Authors: Robert Mustacchi <[email protected]> |
Sponsor: Rich Lowe <[email protected]> |
State: published |
While we often talk about whether a link is up and what it’s speed is, there is another question that gets asked about: media. While we have visibility about discrete, pluggable transceivers that shows up in topo, this doesn’t always cover the question of what type of media we are using. As such, this IPD provides a bit of background and then goes into a concrete proposal for how this should be implemented and exposed through a combination of user tools and throughout the GLDv3 APIs that we have.
To summarize this IPD proposes:
-
A new dladm link property called
media
. -
Adds a new per-mac plugin enumeration to cover all different media kinds.
-
Adds a provision for exposing this through a MAC stat.
-
Adds a provision for exposing this through mac_capab_transceiver(9E).
Today, dladm(8)
allows folks to think
about configuring most of the parameters of the link such as:
-
Controlling the speed and duplex
-
Controlling auto-negotiation
-
Controlling forward error correct (FEC)
-
Controlling pause frames
Similarly, dladm(8)
will report this information back and describe
what is negotiated. However, there is another part of this which is
missing: the actual underlying, generally IEEE, standard that describes
how the link is actually transmitting data. Consider 100 GbE Ethernet
for a moment. That could be done with a passive copper DAC cable with 4
lanes, utilize a single lambda transceiver ala 100GBASE-SR1, or the even
older and uncommon, 10-lane 100GBASE-CR10/SR10.
For the time being, what we care about is making the type of media that is in use clear to users; however, this is not about extending administrative control. Rather, that is intended to be done with the datalink properties.
There is an existing triple of stats that exist in mac around
transceivers: ETHER_STAT_XCVR_ADDR
, ETHER_STAT_XCVR_ID
, and
ETHER_STAT_XCVR_INUSE
. The first two were meant to be specific to a
MII/GMII receiver and indicate the address and ID. The latter provided a
few sets of transceiver IDs that covered 10BASE-T, various forms of 100M
Ethernet, and simple 1000BASE-T and 1000BASE-X. It was never extended to
cover additional states, most likely because higher speed Ethernet
controllers use different phy interfaces and controllers have evolved
away from having the driver manage the phy directly as opposed to what
was traditionally done.
Next, we separately tried to add a first class transceiver interface to MAC to allow us to get at information such as whether a separate discrete device was present or not. mac_capab_transceiver(9E) introduced a MAC capability that provides information about the number of discrete transceivers that existed on a device and how to get information about them. This was designed in particular around the idea of understanding SFP and QSFP style devices that are plugged into NICs as a discrete entity and may or may not be considered a phy in the more classic MII/RMII/RGMII/GMII interfaces.
Currently the main way that this information is exposed and consumed is
through the FMA topology interfaces. All network nodes that implement
this where the transceiver is a discrete device have this information.
This information is also available through a prototype tool called
dltraninfo
which is more of an experiment in figuring out what the
right way to expose this for different device types is in dladm(8).
Ultimately, one can look through all the different information in the transceiver to determine what media options are present, but this is a manual and error-prone process and also relies on there being an external SFP/QSFP style transceiver which may not exist at all!
Neither of the existing interfaces really give us quite what we want and the applicability of each is an interesting question.
Fundamentally we propose that every MAC plugin add a new enumeration to
describe the valid types of media. These are plugin-specific as what is
valid for Ethernet based devices is not true for WiFi, IB, and doesn’t
make sense for some of the other things there. These should all take the
form of mac_<plugin>_media_t
. Our focus in this IPD is for Ethernet,
while other types could be done (and many may be shared with IB), we are
leaving that to future work.
More specifically we will create the enumeration mac_ether_media_t
in
<sys/mac_ether.h>
. Each entry in the enumeration will be a specific
type, for example: ETHER_MEDIA_10BASE_T
, ETHER_MEDIA_1000BASE_T
,
ETHER_MEDIA_10GBASE_CR
, ETHER_MEDIA_40GBASE_LR4
, etc. We will have the
lower values of this enumeration match the existing XCVR_*
constants
that are in <sys/mac_provider.h>
to make things easier to adopt for
existing drivers.
Even if we weren’t doing this, we would still suggest that we phrase
media types as a specific numeric enumeration and not as a series of
bitfields. While C23 is introducing sized enumerations that could be as
large as a uint64_t
, we know that this value will likely get
overflowed in its lifetime. For example, the Mellanox firmware has
already used a full uint32_t
worth of bits and doesn’t even cover
lower speeds such as the various 10/100/1000 Mbit fields nor does it
cover the single lambda variants or various QSFP-DD/OSFP pieces. Now
some of these may be things that could be filled in due to the presence
of 10 unknown entries; however, what this tells us is that we can’t
really assume to fit in a 32-bit value or a 64-bit bitfield.
Using the approach of a separate integer for each value makes it so we
have plenty of space to cover this, even if the compiler’s natural enum
size was a 16-bit integer (though in all of our ABIs it is a 32-bit
integer). The downside to this approach in so far as it is is that if we
ever do additional work such as adding a list of supported media, we
will need to figure out a way to communicate that and have a growable
list since it will not neatly fit in this. This ultimately means a
variant on the mac_propval_range_t
which is a series of integers.
While we will introduce an initial set of values that attempts to be exhaustive, we expect and anticipate that new values will be appended to the enumeration over time. This may mean that speeds will be in different orders, that is fine and while we’ll strive to keep it consistent, we’re ultimately at the mercy of IEEE and therefore that is just how it is.
The most fundamental way that we want to allow someone to express this
is to have a new dladm(8)
read-only link property called media
with
the corresponding MAC property constant MAC_PROP_MEDIA
. This is the
primary way we expect drivers to implement this. See Appendix A: Driver Specific Implementation Notes for more
on the driver-specific enabling plan and path. Note, while the values
are plugin-specific our intent is to use the same link property for all
devices. Drivers will not have to implement the property information for
this. The mac framework will fill it out itself as is done with other
properties.
The MAC property works well for incremental evolution as drivers already
have to deal with properties that they don’t know about and the dladm
show-linkprop
command already handles that just fine. Media types will
be stringified in their approximate IEEE form, e.g. 100GBASE-CR4
,
100BASE-TX
, etc.
As mentioned in the background section, MAC already has a few Ethernet
statistics around this. In particular, we propose repurposing in a
backwards-compatible way the ETHER_STAT_XCVR_INUSE
stat to take a
fuller range of transceivers. Our advice to drivers would be to place a
mac_ether_media_t
value in here which will overlap with the traditional
values here.
Just because a driver implements ETHER_STAT_XCVR_INUSE
does not
suggest it’ll implement or have to lie about the other mii/gmii related
stats.
The main reason that we opted to use this was that otherwise we’d create
another Ethernet-specific stat and it’d just be mostly another copy of
this existing stat, but with additional values. That didn’t seem to aid
anyone. In addition, because we are using a plugin-specific set of
definitions, we want to have the stat scoped to the plugin. This also
means that if we say add an IB specific version of this, then it’d
separately be IB_STAT_XCVR_INUSE
and accept different values.
Originally we proposed adding an extension to the MAC transceiver
capability. However, as we dug into the implementation, it become less
desirable at this time and we have moved it to future work. In
particular, most implementations of MAC_CAPAB_TRANSCEIVER
are specific
to the presence of extrenal SFP-style PHYs. As such, plumbing this
through in a subset of circumstnaces isn’t particularly useful. Our
prototype of adding this to the topo datalink property groups worked
just fine wih the link property.
To implement this, we will do an initial integration of the mac features along with a few drivers. Additional drivers will be integrated in subsequent changes in part based on needs and testing capabilities.
If we want to build on this IPD, here are the high-level ways we expect to follow in the future, but are not at the level of a concrete proposal.
We ultimately want to be able to introduce something more akin to a
dladm show-phy
or dladm show-transceiver
which would take the
information proposed here, the information from the transceiver
capability, and make it a first class dladm-level experience. If we go
down this path then we’ll also want to add an additional property to the
transceiver mct_info(9E) entry point that indicates whether the
transceiver is built-in or not. If we do this, we should consider what
the implementation for drivers without external SFP-style PHYs looks
like as most do not implement MAC_CAPAB_TRANSCIEVER
otherwise.
A different direction that we should consider is potentially introducing an array of supported media. This isn’t a priority here because the link properties that we already have cover most of what someone needs to know and in general there aren’t many times where someone is switching between different medias as the same speed today. We’re going to let demand help motivate this being added, which unlike the one above is something that is less obvious.
This section contains notes on how we implement this functionality for each of the drivers listed below. Not all drivers are listed. Our general plan is to start with more common devices and implement this as we get community support for testing a wider device variety.
These drivers already have an implementation of the
ETHER_STAT_XCVR_INUSE
logic that looks if the chip is in a fiber-based
mode and otherwise uses the link speed to determine the answer.
The bnxe(4D) driver can get this information from the internal
media_info
member of the struct elink_phy
which is directly
accessible already today and is used as part of the
MAC_CAPAB_TRANSCEIVER
mct_read(9E) entry point. So we can take this
and combine it with the speed to get what we need.
The cxgbe(4D) and t4nex drivers work together to get this information.
Right now the driver has the general enum fw_port_type
which describes
the different modes that are supported on the device. The current
version for the device is stored on the port_type
member of the struct
port_info
.
To determine this we need to look at a series of different fields on the device. In particular, the media type, whether it thinks it supports 100-BASET4 or not, and manually put this information together.
The i40e firmware provides us a few different types of information. In
particular, it has an internal enumeration of PHY types that it uses as
part of Get Link Status command (opecode 0x0607). This enum called enum
i40e_aq_phy_type
tells us very specific information about what kind of
phy media is currently in use. In addition, there is also a general
media type that is part of the phy capabilities data in the struct
i40e_phy_info
. This struct also has an array of which PHYs are
supported, but for us the most important member is the one in the link
status.
To determine the media type for ixgbe(4D), we need to combine the
current link speed with the results of the
ixgbe_get_supported_physical_layer()
function in the common code. By
combining these two we can get the current mode of the link.
The mlxcx(4D) driver already has a notion of this with the
mlxcx_eth_proto_t
enumeration which contains the current operational
mode in the port’s mlp_oper_proto
member. All we need to do is convert
this to the appropriate general type.
The qede(4D) driver’s firmware mailbox has some information here. In
particular there is a function ecore_mcp_get_media_type()
which is
used to extract from firmware the struct public_port
member
media_type
. This gives us something we can then comare with the speed
to figure out the exact type of.
The sfxge(4D) driver does have a way of asking the underlying system what the type of the port is. However, this only gets us to high-level PHY information such as BASE-T, SFP, XPF, QSFP, etc. Unlike other drivers or other systems it does not attempt to decode the SFP compliance codes to figure out how to operate (that we can easily see). This means that we will have to return an unknown media type in such circumstances.
The mii framework today already tracks this information as part of configuring its state. This is used by the following drivers:
-
afe
-
atge
-
dmfe
-
efe
-
elxl
-
hme
-
iprb
-
pcn
-
rtls
-
yge
We likely can provide a straightforward callback into the mii layer.