Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipxe.efi hangs at downloading large files over HTTP #1023

Open
elvinup opened this issue Aug 12, 2023 · 22 comments
Open

ipxe.efi hangs at downloading large files over HTTP #1023

elvinup opened this issue Aug 12, 2023 · 22 comments

Comments

@elvinup
Copy link

elvinup commented Aug 12, 2023

image

I've been trying to boot to WinPE which requires me to download a boot.wim file at ~400MB. It always stops after the first block it downloads (could be 2%, 16%, 29% etc), and never progresses further afterwards.

Interestingly, I am able to download the same boot.wim file over TFTP with the latest ipxe source code, but i definitely had some data corruption issues on that boot.wim file since i'd run into this Red screen of death half the time, and the other half it would boot properly. Perhaps because TFTP is using UDP instead of TCP and is dropping/corrupting some packets? Not sure, but anyways...

I spent a couple days wondering what the issue could be trying to get HTTP working, and said screw it eventually and tried to compile the ipxe.efi file off of commit 1295b4acff1f2014261c40d9f9d2107ffd668d92 instead after reading issue 155.

The ipxe.efi file i compiled from that commit now allows the boot.wim file to download over HTTP in < 1 second and boots straight to WinPE properly. Also, i don't get a red screen of death half the time, anymore. It's consistent when downloading over HTTP instead of TFTP!

I have no clue what happened between now and 2020 when that commit was made, but I just wanted to shed light on this issue that downloading a large file from HTTP seems to be busted currently.

If someone makes a patch to try to address this, I am happy to recompile that ipxe.efi file with it to test :). Until then, i'll keep using this older version.

@NiKiZe
Copy link
Contributor

NiKiZe commented Aug 13, 2023

I say it works perfectly fine as is. (That is, it works for me and many others on the hardware we have)
There is several details you are leaving out.

  • Which binary are you using (ipxe.efi, snponly efi, snp.efi), as well as exact version?
  • What nic is used? 10gb intel?
  • What hardware/machine is used? HPE?
  • Have you done a tcpdump to check what is going on at the wire?
  • Have you tried a git bisect to verify if it is the given commit, or something else that can be attributed to the issues you are seeing.

Since a few versions back of wimboot you no longer need boot.sdi or BCD

@elvinup
Copy link
Author

elvinup commented Aug 14, 2023

@NiKiZe I understand, I figure it likely hasn't been mentioned yet in the git issues lately because mostly people aren't running into the same problem with my specific HW perhaps? To answer your questions:

  • I'm using ipxe.efi, compiled myself off the latest commit c1834f3 using make bin-x86_64-efi/ipxe.efi
  • This is the exact NIC used, seems to be a 10gb one.
  • Yes, it's an HPE machine
  • I have done a tcpdump and didn't really see anything of relevance, packets seems normal until it suddenly just stops
  • Haven't done a git bisect to narrow down the issue. I plan to eventually but it's gonna take some time since quite a lot of commits have seemed to come through since then.

Thanks for the pointer on wimboot, i did not know that! Just getting my feet wet with pxe booting in general in the last couple weeks.

At the end of the day, the ipxe.efi i compiled after doing these steps made the difference

git clone git://git.ipxe.org/ipxe.git
cd ipxe/src
git checkout 1295b4acff1f2014261c40d9f9d2107ffd668d92
make bin-x86_64-efi/ipxe.efi

And it worked perfectly and allowed me to fully download my boot.wim file from my web server.

So going back in time in git history definitely shows something went wrong somewhere, and so I doubt it has much to do with my machine, but rather some regression that's happened since then that maybe nobody noticed until now? 🤷‍♂️ Just brought up the issue so that at least someone else can see it in case they run into the same thing, or if the problem might be obvious to anyone familiar with the codebase.

Meantime I can try to figure out which commit this starts to break to pinpoint it, but until then if you or anyone else has an idea, happy to hear it!

@elvinup
Copy link
Author

elvinup commented Aug 15, 2023

Just realized how slick git bisect is, got it narrowed down faster than I thought 😁

The problems start right at commit 059c4dc @NiKiZe

@elvinup
Copy link
Author

elvinup commented Aug 15, 2023

Some more info. This is my exact NIC

Also simply reverting that commit 059c4dc makes downloading over HTTP work without issues, from the latest master commit.

But looking at the diff I don't really see what the issue could really be 😕 , specifically with device id 16D8 like mine. Is it possible this affected all BNXT nics?

@mcb30
Copy link
Member

mcb30 commented Sep 4, 2023

Also simply reverting that commit 059c4dc makes downloading over HTTP work without issues, from the latest master commit.

But looking at the diff I don't really see what the issue could really be 😕 , specifically with device id 16D8 like mine. Is it possible this affected all BNXT nics?

That is strange. There is nothing of substance in that commit that should affect behaviour.

Could you check the output of ifstat in known-good and known-bad builds of ipxe.efi? I'm wondering if somehow the commit is causing the device to fail to be detected by the driver, causing iPXE to fall back to using the NIC's existing UNDI/NII driver.

@euthuppan
Copy link

euthuppan commented Sep 6, 2023

@mcb30 Here's the output of an ifstat on:

a "good" ipxe.efi file
image

and a "bad" ipxe.efi file
image

I do see a slight change in that top line using NII vs 14e4-16D8. Perhaps that's the clue to the problem here?

@mcb30
Copy link
Member

mcb30 commented Sep 6, 2023

I do see a slight change in that top line using NII vs 14e4-16D8. Perhaps that's the clue to the problem here?

Yes, that's a major difference. 🙂

For some reason, the commit is causing the NIC to be recognised by the bnxt driver in iPXE. This driver then seems to suffer from the problem that you have described.

In the earlier commit, the NIC is for some reason not recognised by the bnxt driver (or fails in some way during the device probe). iPXE then falls back to using the NII driver (i.e. using API calls into the driver provided by the platform firmware, instead of driving the hardware directly).

@euthuppan
Copy link

Thanks for the explanation, things are starting to make sense!
So my machine spits this out when checking for the NIC:

# lspci | egrep -i --color 'network|ethernet'
5d:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)
5d:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)

It sounds like the current logic is somehow more correct, but for some reason the bnxt driver just doesn't want to work, right? I have updated the NIC firmware before and still shows the same problem, so not sure if it's solely a problem with this specific driver or bnxt ones in general that iPXE tries to handle specifically.

Do you think all bnxt drivers should just fall back to NII if it works anyways?

@NiKiZe
Copy link
Contributor

NiKiZe commented Sep 6, 2023

The broadcom driver has issues on many machines. Often it depends on the firmware in the machine (on Macs) where it works in older, but not newer.

Consider using snponly.efi or snp.efi, at least on these machines.
But of course if we can figure out how to fix the driver that would be great. (But this should now probably be a duplicate of that existing issue)

@elvinup
Copy link
Author

elvinup commented Sep 7, 2023

@NiKiZe Thanks for the suggestion, snp.efi seems to work just fine, now realizing that it was an option 😁

Feel free to mark this as a duplicate then to another bad driver issue.

@sploders101
Copy link

sploders101 commented Oct 1, 2023

I'm not sure how useful this is, but I am also experiencing this issue. I have an AsRock Rack motherboard with the BCM57416 (same chip from the card linked earlier). Reverting 059c4dc fixed it for me as well from master (currently 8b14652). Once I finish my redundant server setup, I'd be happy to do some testing and maybe contribute. I'm not great with C, but I can kind of kludge something together and take comments in a review to clean it up.

This commit doesn't seem too complicated, but I'm missing some of the context. If someone who understands it could point me in the right direction I'd love to give it a go. Otherwise, I might circle back to it later.

@NiKiZe
Copy link
Contributor

NiKiZe commented Oct 1, 2023

Could you grab ifstat from iPXE , both with and without that commit.

Unless there was changes to which devices are "supported" by that commit we wouldn't expect any changes.

@sploders101
Copy link

Sorry for the late response. I haven't had a good opportunity to take my server down for troubleshooting lately without making someone upset. Here are the results of running ifstat on the latest build before and after DHCP. The port it's connected to is a trunk port with multiple VLANs, but I don't see why traffic for an undeclared VLAN would trigger an error.

Screenshot 2023-10-16 at 11 36 22 AM
Screenshot 2023-10-16 at 11 34 48 AM

@NiKiZe
Copy link
Contributor

NiKiZe commented Oct 16, 2023

Reported errors from ifstat is common, and often not an issue.

The interesting part is comparing ifstat output between the working and non working iPXE builds.
This is the one with issues? What about ifstat from the working build?

@NiKiZe
Copy link
Contributor

NiKiZe commented Dec 5, 2023

Did you have a chance to compare ifstat between working and non working builds?

@ErwanAliasr1
Copy link

I had booting issue with a BCM57414 (14e4:16d7) where depending on the server I'm loading a large ramfs (+500MB) from, ipxe remains stuck on a given (random) percentage of the file.

It was reproducible so started on top on the upstream version (26d3ef0), I enabled debug.
The trace always ends like :

  • RX Stat Total 1955 Good 1955 Drop err 0 LB 0 VLAN 0
  • CQ Type (rx) cid 27
    RX desc_idx 3 PktLen 60
  • RX Stat Total 1956 Good 1956 Drop err 0 LB 0 VLAN 0
  • CQ Type (rx) cid 29
  • CQ Type (rx) cid 29
  • CQ Type (rx) cid 29
  • CQ Type (rx) cid 29
  • CQ Type (rx) cid 29
  • CQ Type (rx) cid 29
  • CQ Type (rx) cid 29
  • CQ Type (rx) cid 29
    [...]

If appear that bnxt_rx_complete() https://github.com/ipxe/ipxe/blob/master/src/drivers/net/bnxt/bnxt.c#L484, returns NO_MORE_CQ_BD_TO_SERVICE in loop.

This makes the driver looping on the same packet and not reading new packets and blocks the boot process.

It's unclear to me if it's a firmware or driver issue.

I tried a workaround (attached to this comment) and by reducing the number of RX buffers, all my servers were fine at booting large file. This patch can affect the download speed when operating in perfect conditions but solved my issue here.

If some want to test it and make a feedback I'd be happy about it, I can also offer a PR but I'd love having more comments on it first.

0004-bnxt.txt

For the reference, please find my card info
FW Version : 226.0.145.0
cmd timeout : 5000
hwrm_max_req_len : 128
hwrm_max_ext_req : 384
chip_num : 16d7
chip_id : 1010000
Port Number : 0
fid : 0x0001
PF MAC : 14:23:f2:c3:e4:20
min_hw_ring_grps : 169
max_hw_ring_grps : 169
min_tx_rings : 170
max_tx_rings : 170
min_rx_rings : 241
max_rx_rings : 241
min_cq_rings : 295
max_cq_rings : 295
min_stat_ctxs : 235
max_stat_ctxs : 235
ordinal_value : 0
stat_ctx_id : 0
num_cmpl_rings : 1
num_tx_rings : 1
num_rx_rings : 1
num_ring_grps : 1
num_stat_ctxs : 1

I also emailed the original driver author to inform him about the issue and my workaround, we'll see how it goes.

@jw14812
Copy link
Contributor

jw14812 commented Feb 12, 2024

Hi, I tried reproducing this issue by downloading a 2.5GB test file using tftp and http, but I was not able to observe the hang. Is the boot.wim or large ramfs available to download so that I can test that on my setup directly

@ErwanAliasr1
Copy link

I cannot share the RAMFS I'm using in production but it's a 600MB one. Please note that's on a real production network infrastructure implicating several switches and routing between the server and the client.

@mcb30
Copy link
Member

mcb30 commented Feb 13, 2024

@jw14812 thanks for testing!

@ErwanAliasr1 can you try simplifying the setup by e.g. trying a different (and public) large image to download, or by using a direct connection that eliminates the variety of switches and routers from the scenario?

@ErwanAliasr1
Copy link

@mcb30 Trying downloading another file will be easy but bypassing the whole infra will be complicated for me. I can't hijack the infra like this :(
That said, I don't think the content of the file has any particular impact as the driver was stuck after a couple of seconds.
The driver is a bit complex to read as there are no comments/explanations, I was hoping the traces I reported would have helped @jw14812 to get an idea of what's wrong here.

@agrevtsev
Copy link

Hey guys! Thanks for clues!
Were trying to boot Supermicro server with H12SSL-NT m/b, and such NICs

45:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)
45:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)

When i used ipxe.efi, booting over HTTP were stuck while downloading initrd (around 30Mb), with error Error: No buffer space available.
As soon as i rebuild ipxe.efi with @ErwanAliasr1 (thanks mate!) fix - server could download initrd and boot (download were veeeery slow - but successful)

Sorry for lack of debug logs - unfortunately have only IPMI access, can't copy-paste.

Br, Alexey

@ErwanAliasr1
Copy link

ErwanAliasr1 commented Feb 26, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants