Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MFS set too low on Intel 40GBE? #987

Open
srhb opened this issue Jul 3, 2023 · 8 comments
Open

MFS set too low on Intel 40GBE? #987

srhb opened this issue Jul 3, 2023 · 8 comments

Comments

@srhb
Copy link

srhb commented Jul 3, 2023

Hi

I'm a little unclear on whether this behaviour is expected, so please bear with me as I try to explain.

We recently upgrade from 7099539 (good) to ae435cb (bad) and somewhere between them, our intel 40GBE NICs ended up with a too low MFS set, even if our ipxe script is just a fall-through "exit 1" which immediately proceeds to boot whatever is next in the boot order.

Linux, once booted after the "bad" rev of ipxe was briefly loaded, reports

Jul 03 12:44:33 container-p01 kernel: i40e 0000:01:00.1:
MFS for port 1 has been set below the default: 600

(we expected to be able to run jumbo frames at this point)

Everything behaves as expected on the "good" rev.

I've been trying to bisect the issue more completely, but unfortunately most of the revs simply crash so it's unclear what's going on. If I simply add a printf("-----MFS: %d------", intelxl->mfs); here: https://github.com/ipxe/ipxe/blob/master/src/drivers/net/intelxl.c#L1414-L1416 (feel free to mentally insert "i have no idea what I'm doing" doge meme here)

I can tell it's being set to 1536 on both the good and bad rev, so I have no idea where 600 is coming from!

Now, this issue only arises because ipxe is configuring via dhcp and we did not set a MTU (before or after), with eg dhcp setting mtu to 9000 the problem also goes away (mfs is set to 9216 according to the printf, and persists into the OS) -- but it still seems so weird that the MFS somehow is different between the two versions, and that it survives into the booted OS after ipxe despite the fact that we are manually configuring an MTU (static config of the NIC in the OS) but I suppose that may be a bug in the linux driver itself.

Please let me know if I can do anything else to explain this behaviour. It was a bit of a surprise, but I'm not sure that what ipxe is doing is actually wrong.

@srhb
Copy link
Author

srhb commented Jul 3, 2023

I thought having our DHCP server send a mtu of 9000 would fix this, but it just leads to another strange value after ipxe has had its hands on it;

Jul 03 14:05:17 container-p01 kernel: i40e 0000:01:00.1: MFS for port 1 has been set below the default: 2400

@mcb30
Copy link
Member

mcb30 commented Dec 28, 2023

I can tell it's being set to 1536 on both the good and bad rev, so I have no idea where 600 is coming from!

The 600 is presumably a hex value reported without a 0x prefix, since 0x600 == 1536.

I don't honestly know what the "correct" behaviour would be. iPXE assumes that in a non-virtual environment (specifically: when using a NIC that isn't a PCI virtual function or a virtual NIC such as netfront) then the MTU is 1536 unless specified otherwise. We therefore allocate only 1536-byte RX buffers, and so configure the NIC to use only that buffer size.

The Intel 40GbE does have two separate places where MFS is configured: intelxl_admin_mac_config() (which sets the maximum frame length that will be accepted as a valid packet by the MAC) and intelxl_context_rx() (which sets the size of receive descriptor buffers). It's possible that everything would still work if we were to set only the latter, and use the default value (whatever that is) in intelxl_admin_mac_config() (or just omit that function entirely). You could try that and see if it fixes the problem, and report back.

@ErwanAliasr1
Copy link

ErwanAliasr1 commented Feb 27, 2024

Just faced the issue here under Linux where some i40e devices detect the MFS is too low but don't set it :/
This ipxe behavior broke the driver behavior assuming the MFS is properly set. I'll send a patch to the kernel to avoid accepting setting up the MTU if its greater than the MFS (https://lore.kernel.org/netdev/[email protected]/T/#u)
I really wonder why iPXE tries to enforce the MFS which is the greatest L2 packet to be received. Why not assuming the firmware set it properly ?

@toreanderson
Copy link
Contributor

For what it is worth, this issue impacts 10GbE Intel NICs using the i40e driver too, in particular we were seeing it on Intel Corporation Ethernet Connection X722 for 10GbE SFP+ [8086:37d0] (rev 09).

The symptom is that the device iPXE booted from rather silently discards both in- and outgoing jumbo frames, even though the Linux netdev has an sufficiently high MTU set. The inbound drops can however be observed with, e.g., ethtool -S eth0 | grep rx_oversize.

We have verified that this problem is there as of today's Git master (226531e)

The commit blamed in @ErwanAliasr1 netdev post (6871a7d) and the one immediately preceding it fails to boot our servers. iPXE loads but immediately fail with the error message No more network devices. We therefore downgraded to d3c8944, which is the last commit preceding a string of presumably related and inter-dependent changes tagged [intelxl] (including the alleged culprit), which successfully solved the problem.

@ErwanAliasr1
Copy link

The tx_error is also increasing. I have to push a 2nd version of my Linux Kernel patch to ensure the MTU is not set in such particular configuration. I have also to spend some time understanding how to set the mfs properly under Linux if the mfs is too small to contain the jumbo packet.

That said, I still don't understand the why iPXE has to reset the MFS to a smaller value than the default (firmware). This is not part of the commit message. I wonder if the previous code using PRTGL was really changing the configuration.

Maybe it is moving to the admin queue usage that made this mfs change functional and as it's computed from the expected mtu, the value is always rewritten. So if the iPXe MTU isn't jumbo, it will reset the default value to a 1500ish value and as the Linux Kernel is not managing this case, it fails.

Having iPXE modifying the default MFS can also probably affect other OSes.
I wonder if we shouldn't have a check here not to change the default MFS if the MTU is smaller than MFS.

@mcb30 any feeling around this issue ?

@mcb30
Copy link
Member

mcb30 commented Mar 19, 2024

That said, I still don't understand the why iPXE has to reset the MFS to a smaller value than the default (firmware). This is not part of the commit message. I wonder if the previous code using PRTGL was really changing the configuration.

The PRTGL_SA[HL] register holds the 48-bit MAC address and 16-bit MFS in the same 64-bit register, so the code that wrote to PRTGL_SA[HL] was definitely setting the MFS since it had no way to avoid doing so.

The commit was part of a series to add support for the Intel 100Gig NICs. The PRTGL_SA[HL] registers are not present in the same form on the 100Gig NICs, but the relevant admin queue commands are identical for both 40Gig and 100Gig, and so by switching to using the admin queue commands instead we can share code between the two drivers.

iPXE needs to set the RX buffer length, so that the hardware doesn't attempt to DMA packets beyond the end of the allocated RX buffer. We also potentially need to set the MFS if we are attempting to use jumbo frames, since we don't know what value the MFS will have when we first start driving the card.

If the MFS is higher than the RX buffer length then the hardware may attempt to split packets across multiple RX descriptors, which the iPXE driver will not support. I'd have to recheck the datasheet to figure out what would actually happen in this scenario. It's possible that we'd just see the fragments as separate packets.

I'm not sure why the Linux (or other) drivers would not set the MFS. There seems to be no particular reason to assume that it has a sensible power-on default value, since the correct value is a property of the network infrastructure to which the NIC is attached, not a property of the NIC itself.

@ErwanAliasr1
Copy link

Yeah I agree the Kernel should also ensure the proper size of the MFS, trying to figure out how to properly handle it in the Kernel. They consider the firmware set properly, which is said in the x710 spec to be the default (9728).

For now, my patch which prevents to set jumbo frames when iPXE sets the MFS to a lower value than the default is about to be merged. At least, users will avoid falling into this issue.

Thx for the explanation about why not considering the default was not considered, I was lacking this information in the commit message.

@mcb30
Copy link
Member

mcb30 commented Mar 19, 2024

I've looked through the relevant parts of the XL710 spec. If we do not set MFS, and a jumbo frame is received, then it will be split across multiple RX buffers. iPXE will interpret the last fragment of the packet as a truncated packet (which will be discarded as nonsensical by the rest of the stack), and will interpret any preceding fragments in an undefined way since it will assume that various flags in the RX writeback descriptor are valid (which is not the case for non-final fragments).

We could potentially reassemble fragments into complete packets using iob_concatenate(), but I'm not sure it's worth doing. We allocate very few RX buffers and so there's a good chance that portions of a packet would be lost. The datasheet does not seem to describe how this situation is handled. It's possible that the hardware doesn't implement low latency RX cut-through and so won't start consuming RX descriptors until the whole packet length is known, in which case it would probably cleanly drop the whole packet, but I don't see anything in the spec that directly addresses this.

We could fairly easily add logic in intelxl_poll_rx() to drop fragmented packets: whenever a completed descriptor does not have the EOP bit set, then both that descriptor and the next completed descriptor should be treated as receive errors. (This would require maintaining a state flag in struct intelx_ring, since the next completed descriptor may not be processed in the same invocation of intelxl_poll_rx().)

This would allow us to function correctly even if we leave the MFS with its default value: any received frames above our expected MTU would then be cleanly treated as (multiple) receive errors. However, in my opinion it would be adding a bug to iPXE to leave the MFS untouched since (as I mentioned above) we have no guarantees that it already holds a sensible value.

It may be worth adding this bug for the sake of compatibility with other operating systems, if every other operating system handles MFS incorrectly and will continue doing so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants