New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow down between epochs when using ddp with num_workers > 0 #576
Comments
Normal ddp should work correctly now. Have you tried it? I have updated the default ddp config recently #571 I'm not sure what's going on when you log images to wandb, but have you made sure to execute logging only on rank 0 process? You don't want for each DDP process to log the same image independently |
yes, I used the normal ddp before, when observing the slow down, I played around with many different settings to solve it, and finally find that when using ddp_spawn, the pause between epochs disappear. I'm sure I log only on rank 0 process. The function is decorated with I think the problem is that the output dir of wandb logger is set to |
It seems like it. I guess you could set |
When using ddp with num_workers > 0, the training slow down between epochs. I tried using ddp_spawn with persistent_workers according to the doc,
but in the same document, they said num_workers > 0 is should not be used with ddp_spawn, which is confusing.
Also, when using wandb logger to log images in ddp_spawn mode, the images are logged into the root project dir, and not sent to the server correctly. how can we fix this problem?
The text was updated successfully, but these errors were encountered: