Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch_size > 1 results in NaN loss value #3950

Open
K-Mistele opened this issue Feb 29, 2024 · 11 comments
Open

batch_size > 1 results in NaN loss value #3950

K-Mistele opened this issue Feb 29, 2024 · 11 comments

Comments

@K-Mistele
Copy link

Describe the bug

When I set a trainer.batch_size of > 1 or auto, my loss value is always NaN, and training will fail and exit at the end of the first epoch. Setting batch_size to 1 fixes the issue, but results in very inefficient GPU utilization for more powerful GPUs.

To Reproduce

Steps to reproduce the behavior:

Do LoRA training with a trainer.batch_size of auto or >= 1:

model_type: llm
base_model: mistralai/Mistral-7B-v0.1
quantization:
  bits: 4

adapter:
  type: lora

prompt:
  template: >-
    You are given a premise and a hypothesis below. If the premise entails the hypothesis, return 0. If the premise co>

    ### Premise: {premise}

    ### Hypothesis: {hypothesis}

    ### Label:

input_features:
  - name: input # this is a placeholder since we are using a prompt template, it is not expected to match a column.
    type: text

output_features:
  - name: label
    type: text

trainer:
  type: finetune
  batch_size: 1
  enable_gradient_checkpointing: true
  epochs: 1
  learning_rate: 0.00002
  learning_rate_scheduler:
      decay: cosine
      warmup_fraction: 0.03
      reduce_on_plateau: 0
backend: 
  type: local

generation:
  temperature: 0.1
  max_new_tokens: 512

preprocessing:
  split:
     type: random
     probabilities: [0.9, 0.05, 0.05]

Expected behavior

I would expect a non-NaN loss value.

Screenshots

Starting with step 0, epoch: 0
Training:  33%|███▎      | 429/1287 [32:07<1:08:57,  4.82s/it, loss=nan]Found NaN or inf values in parameter 'model.base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight' of module 'LLM'
NaN or inf tensors found in the model. Stopping training.
Could not load best checkpoint state from /mnt/disk/AI/ludwig/ludwig-lora/results/experiment_run/model/training_checkpoints/best.ckpt. Best checkpoint may not exist.
Traceback (most recent call last):
  File "/home/constellate/anaconda3/envs/ludwig/bin/ludwig", line 8, in <module>
    sys.exit(main())
  File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/cli.py", line 197, in main
    CLI()
  File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/cli.py", line 72, in __init__
    getattr(self, args.command)()
  File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/cli.py", line 77, in train
    train.cli(sys.argv[2:])
  File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/train.py", line 395, in cli
    train_cli(**vars(args))
  File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/train.py", line 185, in train_cli
    model.train(
  File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/api.py", line 678, in train
    train_stats = trainer.train(
  File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/trainers/trainer.py", line 1130, in train
    raise RuntimeError(error_message)
RuntimeError: Training ran into an error. No checkpoint was saved. This is because training was terminated early due to the presence of NaN or Inf values in the model weights before a single valid checkpoint could be saved.

Environment (please complete the following information):

  • OS: Debian 12 (Bookworm)
  • Python version: python 3.10.13 (through anaconda)
  • Ludwig version: latest (v0.10.0)

Additional context
GPU: 1x Tesla v100 32GB

@arnavgarg1
Copy link
Contributor

arnavgarg1 commented Feb 29, 2024

Hi @K-Mistele! This is actually a known issue that we recently debugged and is actually not specific to Ludwig!

The best way to solve it is to set bnb_4bit_compute_dtype in the quantisation section of the Ludwig config to bfloat16 instead of float16 since batch sizes of > 1 with mistral in particular lead to bit overflows during training resulting in NaN loss during the first backprop in the train loop.

However, I notice you're training on a V100 and I don't think bfloat16 is supported since it only works on ampere architectures and above? Is there any chance you can use a newer Nvidia GPU?

@K-Mistele
Copy link
Author

The only Nvidia GPU that supports the bfloat16 is the A100 which I do not have access to. My v100 is an owned GPU not a rented/cloud one, so I try and stick with that whenever possible since I'm not paying by the hour.

@arnavgarg1
Copy link
Contributor

@K-Mistele that makes sense! Actually the entire A series uses Ampere, so you could consider an A5000 from AWS which is pretty cheap. I might also suggest giving the Predibase free trial a try since we have A5000s/A6000s etc (A10Gs) for fine-tuning and we have $25 in free trial credits!

@K-Mistele
Copy link
Author

I am planning to I just want to make sure I can use the tool locally first

@K-Mistele
Copy link
Author

is there no workaround for a v100?

@arnavgarg1
Copy link
Contributor

Unfortunately, not to my knowledge with Mistral. Do you want to test Llama-2-7B instead? The issue doesn't show up there with larger batch sizes!

@K-Mistele
Copy link
Author

yeah I can try it

@arnavgarg1
Copy link
Contributor

@K-Mistele let me know how it goes!

@K-Mistele
Copy link
Author

Do you know if zephyr has the same problem @arnavgarg1 ?

@arnavgarg1
Copy link
Contributor

@K-Mistele not to my knowledge!

@arnavgarg1
Copy link
Contributor

@K-Mistele Did the fix work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants