New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
batch_size > 1 results in NaN loss value #3950
Comments
Hi @K-Mistele! This is actually a known issue that we recently debugged and is actually not specific to Ludwig! The best way to solve it is to set However, I notice you're training on a V100 and I don't think bfloat16 is supported since it only works on ampere architectures and above? Is there any chance you can use a newer Nvidia GPU? |
The only Nvidia GPU that supports the |
@K-Mistele that makes sense! Actually the entire A series uses Ampere, so you could consider an A5000 from AWS which is pretty cheap. I might also suggest giving the Predibase free trial a try since we have A5000s/A6000s etc (A10Gs) for fine-tuning and we have $25 in free trial credits! |
I am planning to I just want to make sure I can use the tool locally first |
is there no workaround for a v100? |
Unfortunately, not to my knowledge with Mistral. Do you want to test Llama-2-7B instead? The issue doesn't show up there with larger batch sizes! |
yeah I can try it |
@K-Mistele let me know how it goes! |
Do you know if zephyr has the same problem @arnavgarg1 ? |
@K-Mistele not to my knowledge! |
@K-Mistele Did the fix work? |
Describe the bug
When I set a
trainer.batch_size
of > 1 or auto, my loss value is alwaysNaN
, and training will fail and exit at the end of the first epoch. Settingbatch_size
to 1 fixes the issue, but results in very inefficient GPU utilization for more powerful GPUs.To Reproduce
Steps to reproduce the behavior:
Do LoRA training with a
trainer.batch_size
ofauto
or >= 1:Expected behavior
I would expect a non-
NaN
loss value.Screenshots
Environment (please complete the following information):
Additional context
GPU: 1x Tesla v100 32GB
The text was updated successfully, but these errors were encountered: