Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow down between epochs when using ddp with num_workers > 0 #576

Open
5c4lar opened this issue May 12, 2023 · 3 comments
Open

Slow down between epochs when using ddp with num_workers > 0 #576

5c4lar opened this issue May 12, 2023 · 3 comments

Comments

@5c4lar
Copy link

5c4lar commented May 12, 2023

When using ddp with num_workers > 0, the training slow down between epochs. I tried using ddp_spawn with persistent_workers according to the doc,

When using strategy="ddp_spawn" and num_workers>0, consider setting persistent_workers=True inside your DataLoader since it can result in data-loading bottlenecks and slowdowns.

but in the same document, they said num_workers > 0 is should not be used with ddp_spawn, which is confusing.

Also, when using wandb logger to log images in ddp_spawn mode, the images are logged into the root project dir, and not sent to the server correctly. how can we fix this problem?

@ashleve
Copy link
Owner

ashleve commented May 12, 2023

Normal ddp should work correctly now. Have you tried it? I have updated the default ddp config recently #571

I'm not sure what's going on when you log images to wandb, but have you made sure to execute logging only on rank 0 process? You don't want for each DDP process to log the same image independently

@5c4lar
Copy link
Author

5c4lar commented May 12, 2023

Normal ddp should work correctly now. Have you tried it? I have updated the default ddp config recently #571

yes, I used the normal ddp before, when observing the slow down, I played around with many different settings to solve it, and finally find that when using ddp_spawn, the pause between epochs disappear.

I'm sure I log only on rank 0 process. The function is decorated with rank_zero_only.

I think the problem is that the output dir of wandb logger is set to output_dir: ${hydra:runtime.output_dir}, which doesn't work as desired for ddp spawn mode.

@ashleve
Copy link
Owner

ashleve commented May 12, 2023

I think the problem is that the output dir of wandb logger is set to output_dir: ${hydra:runtime.output_dir}, which doesn't work as desired for ddp spawn mode.

It seems like it. I guess you could set output_dir: ${paths.root_dir}/.wandb as a fix for now, so wandb dir will be always the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants