Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

val loss in distribute training #674

Open
LiuSiQi-TJ opened this issue Aug 15, 2023 · 2 comments
Open

val loss in distribute training #674

LiuSiQi-TJ opened this issue Aug 15, 2023 · 2 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@LiuSiQi-TJ
Copy link

I use librimix dataset to traing DCCRN by 8gpus
I open early stop in conf
I find the model always stop in very early stage like 10 or 20 epochs
In the log, I find, the val loss is caculated by diffierent gpus and early stop is implemented only by gpu 0, which I think is the reason to very early stop, the log is as follows:

[rank: 5] Metric val_loss improved by 0.433 >= min_delta = 0.0. New best score: -11.178
[rank: 0] Metric val_loss improved by 0.333 >= min_delta = 0.0. New best score: -11.104
[rank: 7] Metric val_loss improved by 0.530 >= min_delta = 0.0. New best score: -10.551
[rank: 4] Metric val_loss improved by 0.408 >= min_delta = 0.0. New best score: -10.931
[rank: 1] Metric val_loss improved by 0.287 >= min_delta = 0.0. New best score: -10.971
[rank: 3] Metric val_loss improved by 0.415 >= min_delta = 0.0. New best score: -11.321
[rank: 2] Metric val_loss improved by 0.418 >= min_delta = 0.0. New best score: -10.858
[rank: 6] Metric val_loss improved by 0.504 >= min_delta = 0.0. New best score: -11.375
Epoch 2, global step 1587: 'val_loss' reached -11.10351 (best -11.10351),

@LiuSiQi-TJ LiuSiQi-TJ added bug Something isn't working help wanted Extra attention is needed labels Aug 15, 2023
@LiuSiQi-TJ
Copy link
Author

I set CUDA_VISIBLE_DEVICES = 0,1,2,3,4,5,6,7 in run.sh, did I do something wrong?

@mpariente
Copy link
Collaborator

Hello,

I would say you did not do anything wrong. What is your version of lightning ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants