Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLI]: Large files not successfully uploading when resuming #7557

Open
collinmccarthy opened this issue May 3, 2024 · 7 comments
Open

[CLI]: Large files not successfully uploading when resuming #7557

collinmccarthy opened this issue May 3, 2024 · 7 comments
Labels
a:app Area: Frontend/Backend

Comments

@collinmccarthy
Copy link

collinmccarthy commented May 3, 2024

Describe the bug

When my run finished training, it accidentally deleted a checkpoint (using wandb.Api()) that I didn't want it to. I always backup the wandb-resume.json file so I can resume training even after calling finish with exit_code=0. So I re-run the training run, and at the start of training it calls wandb.save for all checkpoint files, which should have re-uploaded the checkpoint I deleted via wandb.Api().

The run ends, and calling finish() uploads all the files after a minute or so. But only the small files show up, not the large ones. I checked the files directory and I see the symlinks for all checkpoints, which look correct. I tried changing the policy from 'now' to 'live', and reverted back to Wandb version 0.16.6 (I was on 0.16.7 dev). No luck.

To narrow down the issue I added some other files, a dummy .pth file (with just a line of text), and a different checkpoint. The dummy .pth file uploads fine. Both large checkpoints go through the process of uploading, but they never appear in the web GUI nor when querying the files via wandb.Api().

I also tried using the CLI to call wandb sync and it said it sync'ed but the large checkpoints are still missing.

Additional Files

I've attached my log files. The large checkpoints are best_coco_PQ_epoch_70.pth and best_coco_PQ_epoch_150.pth, they failed to upload (but didn't throw any exceptions / print any errors). The dummy checkpoint is best_test.pth which uploaded fine.

debug-internal.log
debug.log

Environment

WandB version: 0.16.6

OS: Ubuntu 22.04.4 LTS

Python version: 3.11.8

Versions of relevant libraries: Conda env with any packages installed. Let me know if you want the full list.

Additional Context

This issue only seems to happen when resuming. If I delete the wandb-resume.json file it creates a new run, uploads the files, then immediately exists, and everything works. It takes a minute to upload, and I immediately see the files on Wandb.

@kptkin
Copy link
Contributor

kptkin commented May 6, 2024

@collinmccarthy could it be that at some point you deleted these files?
this is the only way i was able to reproduce this behavior.
If so that would explain the behavior your are seeing, it seems that once we delete the run, currently, the UI doesn't show it even when you re-upload it.

@kptkin kptkin added a:app Area: Frontend/Backend and removed a:cli Area: Client c:stitch c:save labels May 6, 2024
@collinmccarthy
Copy link
Author

collinmccarthy commented May 6, 2024

Yes, I did delete the files via the Api, accidentally. I continually delete old checkpoints via the Api during training, including both of the large files linked above. Now I would like to re-add the last one to the same run that I deleted them from.

@kptkin
Copy link
Contributor

kptkin commented May 10, 2024

@collinmccarthy looks like currently it is not possible to re-use the same name for delete files, in the same run. our backend team is trying to figure out what's the best approach to solve this issue.
In the short-term i would suggest to change the name slightly, maybe add some indexing prefix or any other schema that makes in your code.

sorry for the trouble here.

@collinmccarthy
Copy link
Author

@kptkin Got it, thank you for the update. I'll go with the prefix approach for now, hopefully I won't accidentally delete something again anyway.

Copy link

WandB Internal User commented:
kptkin commented:
@collinmccarthy could it be that at some point you deleted these files?
this is the only way i was able to reproduce this behavior.
If so that would explain the behavior your are seeing, it seems that once we delete the run, currently, the UI doesn't show it even when you re-upload it.

Copy link

WandB Internal User commented:
collinmccarthy commented:
Yes, I did delete the files via the Api, accidentally. Now I would like to re-add them to the same run that I deleted them from.

@paulosabile-wb
Copy link

Hi @collinmccarthy we have raised a bug request for this and our Engineering team is already looking at this issue. We will get back to you for an update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:app Area: Frontend/Backend
Projects
None yet
Development

No branches or pull requests

3 participants