Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

添加--group_by_length后训练一段时间后OOM #3668

Closed
1 task done
fst813 opened this issue May 10, 2024 · 3 comments
Closed
1 task done

添加--group_by_length后训练一段时间后OOM #3668

fst813 opened this issue May 10, 2024 · 3 comments
Labels
solved This problem has been already solved.

Comments

@fst813
Copy link

fst813 commented May 10, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

{'loss': 0.0924, 'grad_norm': 0.24532318115234375, 'learning_rate': 9.868477119388896e-05, 'epoch': 0.37}
{'loss': 0.1159, 'grad_norm': 0.26738905906677246, 'learning_rate': 9.868363724534642e-05, 'epoch': 0.37}
{'loss': 0.1304, 'grad_norm': 0.30571019649505615, 'learning_rate': 9.868250281470779e-05, 'epoch': 0.37}
{'loss': 0.1323, 'grad_norm': 0.6097196340560913, 'learning_rate': 9.868136790198433e-05, 'epoch': 0.37}
{'loss': 0.0982, 'grad_norm': 0.3634645342826843, 'learning_rate': 9.868023250718725e-05, 'epoch': 0.37}
{'loss': 0.1017, 'grad_norm': 0.3034124970436096, 'learning_rate': 9.867909663032783e-05, 'epoch': 0.37}
{'loss': 0.1334, 'grad_norm': 0.331504225730896, 'learning_rate': 9.867796027141728e-05, 'epoch': 0.37}
{'loss': 0.0979, 'grad_norm': 0.25397709012031555, 'learning_rate': 9.867682343046687e-05, 'epoch': 0.37}
{'loss': 0.0912, 'grad_norm': 0.27199316024780273, 'learning_rate': 9.867568610748785e-05, 'epoch': 0.37}
{'loss': 0.1006, 'grad_norm': 0.2768314480781555, 'learning_rate': 9.86745483024915e-05, 'epoch': 0.37}
  7%|▋         | 2321/31570 [4:16:04<33:18:50,  4.10s/it]Traceback (most recent call last):
  File "src/train_bash.py", line 16, in <module>
    main()
  File "src/train_bash.py", line 7, in main
    run_exp()
  File "/data2/fst/llama-factory_0507-master/src/llmtuner/train/tuner.py", line 121, in run_exp
    run_exe(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/data2/fst/llama-factory_0507-master/src/llmtuner/train/tuner.py", line 30, in run_exe
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/data2/fst/llama-factory_0507-master/src/llmtuner/train/sft/workflow.py", line 71, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1624, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2902, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2925, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/operations.py", line 817, in forward
    return model_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/operations.py", line 805, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/opt/conda/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/peft/peft_model.py", line 1129, in forward
    return self.base_model(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/peft/tuners/tuners_utils.py", line 161, in forward
    return self.model.forward(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 1201, in forward
    loss = loss_fct(shift_logits, shift_labels)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 15.07 GiB (GPU 3; 79.35 GiB total capacity; 48.54 GiB already allocated; 14.61 GiB free; 62.91 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Expected behavior

添加--group_by_length后序列长度不是从大到小排序了吗?为什么中间会OOM

System Info

No response

Others

No response

@fst813 fst813 changed the title 添加frou后训练一段时间后OOM 添加--group_by_length后训练一段时间后OOM May 10, 2024
@codemayq
Copy link
Collaborator

跟这个参数没有必然关系,跟你的batch_size 和cut off you关系。

@fst813
Copy link
Author

fst813 commented May 10, 2024

@codemayq 调小batch size当然可以,但我的问题是group_by_length不是按照从大到小排序了吗?为啥一开始不OOM训到中间会OOM呢 中间的序列长度不会比第一批长吧

@codemayq codemayq added the solved This problem has been already solved. label May 10, 2024
@codemayq
Copy link
Collaborator

codemayq commented May 11, 2024

group_by_length 会从大到小排序 被训练,这个有什么文档可以支撑这个说法吗?

@fst813 fst813 closed this as completed May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved.
Projects
None yet
Development

No branches or pull requests

2 participants