Skip to content

Latest commit

 

History

History
332 lines (269 loc) · 14.3 KB

README.adoc

File metadata and controls

332 lines (269 loc) · 14.3 KB

IPD 39 Datalink Media Types

Authors: Robert Mustacchi <[email protected]>

Sponsor: Rich Lowe <[email protected]>

State: published

While we often talk about whether a link is up and what it’s speed is, there is another question that gets asked about: media. While we have visibility about discrete, pluggable transceivers that shows up in topo, this doesn’t always cover the question of what type of media we are using. As such, this IPD provides a bit of background and then goes into a concrete proposal for how this should be implemented and exposed through a combination of user tools and throughout the GLDv3 APIs that we have.

To summarize this IPD proposes:

  • A new dladm link property called media.

  • Adds a new per-mac plugin enumeration to cover all different media kinds.

  • Adds a provision for exposing this through a MAC stat.

  • Adds a provision for exposing this through mac_capab_transceiver(9E).

1. Media Types and Existing Interfaces

Today, dladm(8) allows folks to think about configuring most of the parameters of the link such as:

  • Controlling the speed and duplex

  • Controlling auto-negotiation

  • Controlling forward error correct (FEC)

  • Controlling pause frames

Similarly, dladm(8) will report this information back and describe what is negotiated. However, there is another part of this which is missing: the actual underlying, generally IEEE, standard that describes how the link is actually transmitting data. Consider 100 GbE Ethernet for a moment. That could be done with a passive copper DAC cable with 4 lanes, utilize a single lambda transceiver ala 100GBASE-SR1, or the even older and uncommon, 10-lane 100GBASE-CR10/SR10.

For the time being, what we care about is making the type of media that is in use clear to users; however, this is not about extending administrative control. Rather, that is intended to be done with the datalink properties.

There is an existing triple of stats that exist in mac around transceivers: ETHER_STAT_XCVR_ADDR, ETHER_STAT_XCVR_ID, and ETHER_STAT_XCVR_INUSE. The first two were meant to be specific to a MII/GMII receiver and indicate the address and ID. The latter provided a few sets of transceiver IDs that covered 10BASE-T, various forms of 100M Ethernet, and simple 1000BASE-T and 1000BASE-X. It was never extended to cover additional states, most likely because higher speed Ethernet controllers use different phy interfaces and controllers have evolved away from having the driver manage the phy directly as opposed to what was traditionally done.

Next, we separately tried to add a first class transceiver interface to MAC to allow us to get at information such as whether a separate discrete device was present or not. mac_capab_transceiver(9E) introduced a MAC capability that provides information about the number of discrete transceivers that existed on a device and how to get information about them. This was designed in particular around the idea of understanding SFP and QSFP style devices that are plugged into NICs as a discrete entity and may or may not be considered a phy in the more classic MII/RMII/RGMII/GMII interfaces.

Currently the main way that this information is exposed and consumed is through the FMA topology interfaces. All network nodes that implement this where the transceiver is a discrete device have this information. This information is also available through a prototype tool called dltraninfo which is more of an experiment in figuring out what the right way to expose this for different device types is in dladm(8).

Ultimately, one can look through all the different information in the transceiver to determine what media options are present, but this is a manual and error-prone process and also relies on there being an external SFP/QSFP style transceiver which may not exist at all!

Neither of the existing interfaces really give us quite what we want and the applicability of each is an interesting question.

2. Proposal

Fundamentally we propose that every MAC plugin add a new enumeration to describe the valid types of media. These are plugin-specific as what is valid for Ethernet based devices is not true for WiFi, IB, and doesn’t make sense for some of the other things there. These should all take the form of mac_<plugin>_media_t. Our focus in this IPD is for Ethernet, while other types could be done (and many may be shared with IB), we are leaving that to future work.

More specifically we will create the enumeration mac_ether_media_t in <sys/mac_ether.h>. Each entry in the enumeration will be a specific type, for example: ETHER_MEDIA_10BASE_T, ETHER_MEDIA_1000BASE_T, ETHER_MEDIA_10GBASE_CR, ETHER_MEDIA_40GBASE_LR4, etc. We will have the lower values of this enumeration match the existing XCVR_* constants that are in <sys/mac_provider.h> to make things easier to adopt for existing drivers.

Even if we weren’t doing this, we would still suggest that we phrase media types as a specific numeric enumeration and not as a series of bitfields. While C23 is introducing sized enumerations that could be as large as a uint64_t, we know that this value will likely get overflowed in its lifetime. For example, the Mellanox firmware has already used a full uint32_t worth of bits and doesn’t even cover lower speeds such as the various 10/100/1000 Mbit fields nor does it cover the single lambda variants or various QSFP-DD/OSFP pieces. Now some of these may be things that could be filled in due to the presence of 10 unknown entries; however, what this tells us is that we can’t really assume to fit in a 32-bit value or a 64-bit bitfield.

Using the approach of a separate integer for each value makes it so we have plenty of space to cover this, even if the compiler’s natural enum size was a 16-bit integer (though in all of our ABIs it is a 32-bit integer). The downside to this approach in so far as it is is that if we ever do additional work such as adding a list of supported media, we will need to figure out a way to communicate that and have a growable list since it will not neatly fit in this. This ultimately means a variant on the mac_propval_range_t which is a series of integers.

While we will introduce an initial set of values that attempts to be exhaustive, we expect and anticipate that new values will be appended to the enumeration over time. This may mean that speeds will be in different orders, that is fine and while we’ll strive to keep it consistent, we’re ultimately at the mercy of IEEE and therefore that is just how it is.

The most fundamental way that we want to allow someone to express this is to have a new dladm(8) read-only link property called media with the corresponding MAC property constant MAC_PROP_MEDIA. This is the primary way we expect drivers to implement this. See Appendix A: Driver Specific Implementation Notes for more on the driver-specific enabling plan and path. Note, while the values are plugin-specific our intent is to use the same link property for all devices. Drivers will not have to implement the property information for this. The mac framework will fill it out itself as is done with other properties.

The MAC property works well for incremental evolution as drivers already have to deal with properties that they don’t know about and the dladm show-linkprop command already handles that just fine. Media types will be stringified in their approximate IEEE form, e.g. 100GBASE-CR4, 100BASE-TX, etc.

2.2. Ethernet Statistics

As mentioned in the background section, MAC already has a few Ethernet statistics around this. In particular, we propose repurposing in a backwards-compatible way the ETHER_STAT_XCVR_INUSE stat to take a fuller range of transceivers. Our advice to drivers would be to place a mac_ether_media_t value in here which will overlap with the traditional values here.

Just because a driver implements ETHER_STAT_XCVR_INUSE does not suggest it’ll implement or have to lie about the other mii/gmii related stats.

The main reason that we opted to use this was that otherwise we’d create another Ethernet-specific stat and it’d just be mostly another copy of this existing stat, but with additional values. That didn’t seem to aid anyone. In addition, because we are using a plugin-specific set of definitions, we want to have the stat scoped to the plugin. This also means that if we say add an IB specific version of this, then it’d separately be IB_STAT_XCVR_INUSE and accept different values.

2.3. Transceiver Capability

Originally we proposed adding an extension to the MAC transceiver capability. However, as we dug into the implementation, it become less desirable at this time and we have moved it to future work. In particular, most implementations of MAC_CAPAB_TRANSCEIVER are specific to the presence of extrenal SFP-style PHYs. As such, plumbing this through in a subset of circumstnaces isn’t particularly useful. Our prototype of adding this to the topo datalink property groups worked just fine wih the link property.

3. Implementation Plan

To implement this, we will do an initial integration of the mac features along with a few drivers. Additional drivers will be integrated in subsequent changes in part based on needs and testing capabilities.

4. Future Directions

If we want to build on this IPD, here are the high-level ways we expect to follow in the future, but are not at the level of a concrete proposal.

We ultimately want to be able to introduce something more akin to a dladm show-phy or dladm show-transceiver which would take the information proposed here, the information from the transceiver capability, and make it a first class dladm-level experience. If we go down this path then we’ll also want to add an additional property to the transceiver mct_info(9E) entry point that indicates whether the transceiver is built-in or not. If we do this, we should consider what the implementation for drivers without external SFP-style PHYs looks like as most do not implement MAC_CAPAB_TRANSCIEVER otherwise.

A different direction that we should consider is potentially introducing an array of supported media. This isn’t a priority here because the link properties that we already have cover most of what someone needs to know and in general there aren’t many times where someone is switching between different medias as the same speed today. We’re going to let demand help motivate this being added, which unlike the one above is something that is less obvious.

5. Appendix A: Driver Specific Implementation Notes

This section contains notes on how we implement this functionality for each of the drivers listed below. Not all drivers are listed. Our general plan is to start with more common devices and implement this as we get community support for testing a wider device variety.

5.1. bge and bnx

These drivers already have an implementation of the ETHER_STAT_XCVR_INUSE logic that looks if the chip is in a fiber-based mode and otherwise uses the link speed to determine the answer.

5.2. bnxe

The bnxe(4D) driver can get this information from the internal media_info member of the struct elink_phy which is directly accessible already today and is used as part of the MAC_CAPAB_TRANSCEIVER mct_read(9E) entry point. So we can take this and combine it with the speed to get what we need.

5.3. cxgbe

The cxgbe(4D) and t4nex drivers work together to get this information. Right now the driver has the general enum fw_port_type which describes the different modes that are supported on the device. The current version for the device is stored on the port_type member of the struct port_info.

5.4. e1000g and igb

To determine this we need to look at a series of different fields on the device. In particular, the media type, whether it thinks it supports 100-BASET4 or not, and manually put this information together.

5.5. i40e

The i40e firmware provides us a few different types of information. In particular, it has an internal enumeration of PHY types that it uses as part of Get Link Status command (opecode 0x0607). This enum called enum i40e_aq_phy_type tells us very specific information about what kind of phy media is currently in use. In addition, there is also a general media type that is part of the phy capabilities data in the struct i40e_phy_info. This struct also has an array of which PHYs are supported, but for us the most important member is the one in the link status.

5.6. ixgbe

To determine the media type for ixgbe(4D), we need to combine the current link speed with the results of the ixgbe_get_supported_physical_layer() function in the common code. By combining these two we can get the current mode of the link.

5.7. mlxcx

The mlxcx(4D) driver already has a notion of this with the mlxcx_eth_proto_t enumeration which contains the current operational mode in the port’s mlp_oper_proto member. All we need to do is convert this to the appropriate general type.

5.8. qede

The qede(4D) driver’s firmware mailbox has some information here. In particular there is a function ecore_mcp_get_media_type() which is used to extract from firmware the struct public_port member media_type. This gives us something we can then comare with the speed to figure out the exact type of.

5.9. sfxge

The sfxge(4D) driver does have a way of asking the underlying system what the type of the port is. However, this only gets us to high-level PHY information such as BASE-T, SFP, XPF, QSFP, etc. Unlike other drivers or other systems it does not attempt to decode the SFP compliance codes to figure out how to operate (that we can easily see). This means that we will have to return an unknown media type in such circumstances.

5.10. mii-framework based drivers

The mii framework today already tracks this information as part of configuring its state. This is used by the following drivers:

  • afe

  • atge

  • dmfe

  • efe

  • elxl

  • hme

  • iprb

  • pcn

  • rtls

  • yge

We likely can provide a straightforward callback into the mii layer.

5.11. No Implementation

For several drivers, an implementation of this doesn’t make sense either because the device is synthetic or it’s an amalgamation of many things. This includes the following known ones right now:

  • aggr(4D)

  • vmxnet3s(4D)

  • overlay(4D)

  • vioif(4D)

  • xnf(4D)