New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CLI]: Large files not successfully uploading when resuming #7557
Comments
@collinmccarthy could it be that at some point you deleted these files? |
Yes, I did delete the files via the Api, accidentally. I continually delete old checkpoints via the Api during training, including both of the large files linked above. Now I would like to re-add the last one to the same run that I deleted them from. |
@collinmccarthy looks like currently it is not possible to re-use the same name for delete files, in the same run. our backend team is trying to figure out what's the best approach to solve this issue. sorry for the trouble here. |
@kptkin Got it, thank you for the update. I'll go with the prefix approach for now, hopefully I won't accidentally delete something again anyway. |
WandB Internal User commented: |
WandB Internal User commented: |
Hi @collinmccarthy we have raised a bug request for this and our Engineering team is already looking at this issue. We will get back to you for an update. |
Describe the bug
When my run finished training, it accidentally deleted a checkpoint (using wandb.Api()) that I didn't want it to. I always backup the
wandb-resume.json
file so I can resume training even after calling finish with exit_code=0. So I re-run the training run, and at the start of training it callswandb.save
for all checkpoint files, which should have re-uploaded the checkpoint I deleted via wandb.Api().The run ends, and calling finish() uploads all the files after a minute or so. But only the small files show up, not the large ones. I checked the files directory and I see the symlinks for all checkpoints, which look correct. I tried changing the policy from 'now' to 'live', and reverted back to Wandb version 0.16.6 (I was on 0.16.7 dev). No luck.
To narrow down the issue I added some other files, a dummy .pth file (with just a line of text), and a different checkpoint. The dummy .pth file uploads fine. Both large checkpoints go through the process of uploading, but they never appear in the web GUI nor when querying the files via wandb.Api().
I also tried using the CLI to call
wandb sync
and it said it sync'ed but the large checkpoints are still missing.Additional Files
I've attached my log files. The large checkpoints are
best_coco_PQ_epoch_70.pth
andbest_coco_PQ_epoch_150.pth
, they failed to upload (but didn't throw any exceptions / print any errors). The dummy checkpoint isbest_test.pth
which uploaded fine.debug-internal.log
debug.log
Environment
WandB version: 0.16.6
OS: Ubuntu 22.04.4 LTS
Python version: 3.11.8
Versions of relevant libraries: Conda env with any packages installed. Let me know if you want the full list.
Additional Context
This issue only seems to happen when resuming. If I delete the
wandb-resume.json
file it creates a new run, uploads the files, then immediately exists, and everything works. It takes a minute to upload, and I immediately see the files on Wandb.The text was updated successfully, but these errors were encountered: