-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MFS set too low on Intel 40GBE? #987
Comments
I thought having our DHCP server send a mtu of 9000 would fix this, but it just leads to another strange value after ipxe has had its hands on it;
|
The I don't honestly know what the "correct" behaviour would be. iPXE assumes that in a non-virtual environment (specifically: when using a NIC that isn't a PCI virtual function or a virtual NIC such as The Intel 40GbE does have two separate places where MFS is configured: |
Just faced the issue here under Linux where some i40e devices detect the MFS is too low but don't set it :/ |
For what it is worth, this issue impacts 10GbE Intel NICs using the i40e driver too, in particular we were seeing it on The symptom is that the device iPXE booted from rather silently discards both in- and outgoing jumbo frames, even though the Linux netdev has an sufficiently high MTU set. The inbound drops can however be observed with, e.g., We have verified that this problem is there as of today's Git master (226531e) The commit blamed in @ErwanAliasr1 netdev post (6871a7d) and the one immediately preceding it fails to boot our servers. iPXE loads but immediately fail with the error message |
The tx_error is also increasing. I have to push a 2nd version of my Linux Kernel patch to ensure the MTU is not set in such particular configuration. I have also to spend some time understanding how to set the mfs properly under Linux if the mfs is too small to contain the jumbo packet. That said, I still don't understand the why iPXE has to reset the MFS to a smaller value than the default (firmware). This is not part of the commit message. I wonder if the previous code using PRTGL was really changing the configuration. Maybe it is moving to the admin queue usage that made this mfs change functional and as it's computed from the expected mtu, the value is always rewritten. So if the iPXe MTU isn't jumbo, it will reset the default value to a 1500ish value and as the Linux Kernel is not managing this case, it fails. Having iPXE modifying the default MFS can also probably affect other OSes. @mcb30 any feeling around this issue ? |
The The commit was part of a series to add support for the Intel 100Gig NICs. The iPXE needs to set the RX buffer length, so that the hardware doesn't attempt to DMA packets beyond the end of the allocated RX buffer. We also potentially need to set the MFS if we are attempting to use jumbo frames, since we don't know what value the MFS will have when we first start driving the card. If the MFS is higher than the RX buffer length then the hardware may attempt to split packets across multiple RX descriptors, which the iPXE driver will not support. I'd have to recheck the datasheet to figure out what would actually happen in this scenario. It's possible that we'd just see the fragments as separate packets. I'm not sure why the Linux (or other) drivers would not set the MFS. There seems to be no particular reason to assume that it has a sensible power-on default value, since the correct value is a property of the network infrastructure to which the NIC is attached, not a property of the NIC itself. |
Yeah I agree the Kernel should also ensure the proper size of the MFS, trying to figure out how to properly handle it in the Kernel. They consider the firmware set properly, which is said in the x710 spec to be the default (9728). For now, my patch which prevents to set jumbo frames when iPXE sets the MFS to a lower value than the default is about to be merged. At least, users will avoid falling into this issue. Thx for the explanation about why not considering the default was not considered, I was lacking this information in the commit message. |
I've looked through the relevant parts of the XL710 spec. If we do not set MFS, and a jumbo frame is received, then it will be split across multiple RX buffers. iPXE will interpret the last fragment of the packet as a truncated packet (which will be discarded as nonsensical by the rest of the stack), and will interpret any preceding fragments in an undefined way since it will assume that various flags in the RX writeback descriptor are valid (which is not the case for non-final fragments). We could potentially reassemble fragments into complete packets using We could fairly easily add logic in This would allow us to function correctly even if we leave the MFS with its default value: any received frames above our expected MTU would then be cleanly treated as (multiple) receive errors. However, in my opinion it would be adding a bug to iPXE to leave the MFS untouched since (as I mentioned above) we have no guarantees that it already holds a sensible value. It may be worth adding this bug for the sake of compatibility with other operating systems, if every other operating system handles MFS incorrectly and will continue doing so. |
Hi
I'm a little unclear on whether this behaviour is expected, so please bear with me as I try to explain.
We recently upgrade from 7099539 (good) to ae435cb (bad) and somewhere between them, our intel 40GBE NICs ended up with a too low MFS set, even if our ipxe script is just a fall-through "exit 1" which immediately proceeds to boot whatever is next in the boot order.
Linux, once booted after the "bad" rev of ipxe was briefly loaded, reports
(we expected to be able to run jumbo frames at this point)
Everything behaves as expected on the "good" rev.
I've been trying to bisect the issue more completely, but unfortunately most of the revs simply crash so it's unclear what's going on. If I simply add a
printf("-----MFS: %d------", intelxl->mfs);
here: https://github.com/ipxe/ipxe/blob/master/src/drivers/net/intelxl.c#L1414-L1416 (feel free to mentally insert "i have no idea what I'm doing" doge meme here)I can tell it's being set to 1536 on both the good and bad rev, so I have no idea where 600 is coming from!
Now, this issue only arises because ipxe is configuring via dhcp and we did not set a MTU (before or after),
with eg dhcp setting mtu to 9000 the problem also goes away (mfs is set to 9216 according to the printf, and persists into the OS) -- but it still seems so weird that the MFS somehow is different between the two versions, and that it survives into the booted OS after ipxe despite the fact that we are manually configuring an MTU (static config of the NIC in the OS) but I suppose that may be a bug in the linux driver itself.Please let me know if I can do anything else to explain this behaviour. It was a bit of a surprise, but I'm not sure that what ipxe is doing is actually wrong.
The text was updated successfully, but these errors were encountered: