Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM error while QLoRA+Deepspeed fine tuning of Llama3-70B model on 4xA100-40GB gpus #1703

Open
3 of 4 tasks
hrushikesh198 opened this issue May 2, 2024 · 3 comments
Open
3 of 4 tasks

Comments

@hrushikesh198
Copy link

hrushikesh198 commented May 2, 2024

System Info

OS                        Ubuntu
GPUS                      4XA100-40GB
Python                    3.10.14
accelerate                0.29.3
bitsandbytes              0.43.1
deepspeed                 0.14.2
peft                      0.10.0
transformers              4.40.1
trl                       0.8.6

Who can help?

@pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

deepspeed_config_z3_qlora_4g.yaml
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
run_peft_qlora_deepspeed_stage3_llama3_70b.sh
accelerate launch --config_file "configs/deepspeed_config_z3_qlora_4g.yaml"  train.py \
--seed 100 \
--model_name_or_path "NousResearch/Meta-Llama-3-70B" \
--dataset_name "smangrul/ultrachat-10k-chatml" \
--chat_template_format "chatml" \
--add_special_tokens False \
--append_concat_token False \
--splits "train,test" \
--max_seq_len 2048 \
--num_train_epochs 1 \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "epoch" \
--save_strategy "epoch" \
--bf16 True \
--packing True \
--learning_rate 1e-4 \
--lr_scheduler_type "cosine" \
--weight_decay 1e-4 \
--warmup_ratio 0.0 \
--max_grad_norm 1.0 \
--output_dir "llama-sft-qlora-dsz3" \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 2 \
--gradient_checkpointing True \
--use_reentrant True \
--dataset_text_field "content" \
--use_flash_attn True \
--use_peft_lora True \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.1 \
--lora_target_modules "all-linear" \
--use_4bit_quantization True \
--use_nested_quant True \
--bnb_4bit_compute_dtype "bfloat16" \
--bnb_4bit_quant_storage_dtype "bfloat16"

Using the official train.py example from https://github.com/huggingface/peft/blob/main/examples/sft/train.py

Log output
[2024-05-02 06:46:54,788] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[2024-05-02 06:46:56,799] torch.distributed.run: [WARNING]
[2024-05-02 06:46:56,799] torch.distributed.run: [WARNING] *****************************************
[2024-05-02 06:46:56,799] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-05-02 06:46:56,799] torch.distributed.run: [WARNING] *****************************************
[2024-05-02 06:47:02,228] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-02 06:47:02,234] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-02 06:47:02,239] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-02 06:47:02,241] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[2024-05-02 06:47:02,967] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-02 06:47:02,967] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-02 06:47:02,972] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-02 06:47:02,991] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-02 06:47:02,992] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
`low_cpu_mem_usage` was None, now set to True since model is quantized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|██████████████████████████████| 30/30 [01:25<00:00,  2.86s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████████████████| 30/30 [01:27<00:00,  2.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████| 30/30 [01:27<00:00,  2.93s/it]
Loading checkpoint shards: 100%|██████████████████████████████| 30/30 [01:27<00:00,  2.93s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Size of the train set: 10000. Size of the validation set: 2000
A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"}
Size of the train set: 10000. Size of the validation set: 2000
A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"}
Size of the train set: 10000. Size of the validation set: 2000
A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"}
Size of the train set: 10000. Size of the validation set: 2000
A sample of train dataset: {'content': "<|im_start|>user\nThese instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?\nOn your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!\nYour Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.\nDoes this feature apply to all sections of the theme or just specific ones as listed in the text material?<|im_end|>\n<|im_start|>assistant\nThis feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.<|im_end|>\n<|im_start|>user\nCan you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?<|im_end|>\n<|im_start|>assistant\nSure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:\n\n1. Log in to your Shopify account and go to your Online Store.\n2. Click on Customize theme for the section-based theme you are using.\n3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.\n4. With the section open, click on the section's setting (gear) icon in the top-left corner.\n5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.\n6. If available, select 'Show secondary image on hover'.\n7. Save the changes and preview the Collection/Featured Collection page to see the effect.\n\nIf you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.<|im_end|>\n<|im_start|>user\nCan you provide me with a link to the documentation for my theme?<|im_end|>\n<|im_start|>assistant\nI don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.<|im_end|>\n<|im_start|>user\nCan you confirm if this feature also works for the Quick Shop section of my theme?<|im_end|>\n<|im_start|>assistant\nThe secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:\n\n1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.<|im_end|>\n"}
Using auto half precision backend
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128264, 8192)
        (layers): ModuleList(
          (0-79): 80 x LlamaDecoderLayer(
            (self_attn): LlamaFlashAttention2(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=8192, out_features=8192, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=8192, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=8192, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=8192, out_features=1024, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=8192, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (v_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=8192, out_features=1024, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=8192, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (o_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=8192, out_features=8192, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=8192, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=8192, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (rotary_emb): LlamaRotaryEmbedding()
            )
            (mlp): LlamaMLP(
              (gate_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=8192, out_features=28672, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=8192, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=28672, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (up_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=8192, out_features=28672, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=8192, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=28672, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (down_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=28672, out_features=8192, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=28672, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=8192, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (act_fn): SiLU()
            )
            (input_layernorm): LlamaRMSNorm()
            (post_attention_layernorm): LlamaRMSNorm()
          )
        )
        (norm): LlamaRMSNorm()
      )
      (lm_head): Linear(in_features=8192, out_features=128264, bias=False)
    )
  )
)
trainable params: 103,546,880 || all params: 70,657,384,448 || trainable%: 0.1465478531493122
[2024-05-02 06:48:45,278] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.2, git-hash=unknown, git-branch=unknown
trainable params: 103,546,880 || all params: 70,657,384,448 || trainable%: 0.1465478531493122
trainable params: 103,546,880 || all params: 70,657,384,448 || trainable%: 0.1465478531493122
trainable params: 103,546,880 || all params: 70,657,384,448 || trainable%: 0.1465478531493122
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work = group.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1711403380909/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 2 'out of memory'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/hmohapa/peft/examples/sft/train.py", line 162, in <module>
    main(model_args, data_args, training_args)
  File "/root/hmohapa/peft/examples/sft/train.py", line 146, in main
    trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 361, in train
    output = super().train(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2012, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1266, in prepare
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    result = self._prepare_deepspeed(*args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1652, in _prepare_deepspeed
        engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)work = group.broadcast([tensor], opts)

  File "/opt/conda/lib/python3.10/site-packages/deepspeed/__init__.py", line 181, in initialize
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1711403380909/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 2 'out of memory'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/hmohapa/peft/examples/sft/train.py", line 162, in <module>
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in __init__
    self._configure_distributed_model(model)
main(model_args, data_args, training_args)  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model

  File "/root/hmohapa/peft/examples/sft/train.py", line 146, in main
    trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 361, in train
    self._broadcast_model()
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model
    output = super().train(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train
    dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
    return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 199, in broadcast
    return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
    msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 49, in _get_msg_dict
        "args": f"{args}, {kwargs}",return inner_training_loop(

  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 461, in __repr__
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2012, in _inner_training_loop
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 677, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 597, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 331, in _tensor_str
    self = self.float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.92 GiB. GPU 2 has a total capacity of 39.39 GiB of which 2.00 MiB is free. Process 112078 has 39.38 GiB memory in use. Of the allocated memory 37.36 GiB is allocated by PyTorch, and 76.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1266, in prepare
    result = self._prepare_deepspeed(*args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1652, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/__init__.py", line 181, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 262, in __init__
    self._configure_distributed_model(model)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model
    self._broadcast_model()
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model
    dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
    return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 199, in broadcast
    return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
    msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 49, in _get_msg_dict
    "args": f"{args}, {kwargs}",
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 461, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 677, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 597, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 331, in _tensor_str
    self = self.float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.92 GiB. GPU 1 has a total capacity of 39.39 GiB of which 2.00 MiB is free. Process 112077 has 39.38 GiB memory in use. Of the allocated memory 37.36 GiB is allocated by PyTorch, and 76.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2024-05-02 06:49:11,955] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 322 closing signal SIGTERM
[2024-05-02 06:49:11,955] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 325 closing signal SIGTERM
[2024-05-02 06:49:13,221] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 323) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1060, in launch_command
    deepspeed_launcher(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 764, in deepspeed_launcher
    distrib_run.run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-02_06:49:11
  host      : d34443f89434
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 324)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-02_06:49:11
  host      : d34443f89434
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 323)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Expected behavior

LoRA adapters should fine tune successfully

@adamamer20
Copy link

I also had OOM with Qlora #1708

@BenjaminBossan
Copy link
Member

@pacman100 Could you please take a look 🙏?

Copy link

github-actions bot commented Jun 1, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants