Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient overflow #30

Open
datar001 opened this issue Aug 21, 2021 · 4 comments
Open

Gradient overflow #30

datar001 opened this issue Aug 21, 2021 · 4 comments

Comments

@datar001
Copy link

Hi, Why did this repo output the "Gradient overflow"? I run msrrvt_qa task with 1 and 2 1080Ti GPU(s). Can this repo be achieved by a single GPU (1080Ti)? Thanks!

微信图片_20210821210453
微信图片_20210821211630

@jayleicn
Copy link
Owner

Hi @datar001, By default, we trained on 4-8 V100 GPUs. We have not tried training on a single 1080Ti. For this warning, it is expected due to the use of mixed precision (fp16).

@linjieli222
Copy link
Collaborator

One more piece of information. From our past experience, if the loss scaler stay above 1, the training should be steady.

@peiswang
Copy link

@datar001 Have you solved this problem? I have the same problem when running on 8 RTX 3090 GPUs.

@akira-l
Copy link

akira-l commented Dec 13, 2022

Met the same problem on 2X2080TI or 2XP6000.
Training on 2080TI stayed longer but still failed after two epochs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants