为什么我总是训练到一半告诉我内存不足， #211

anyiz · 2024-02-05T05:14:32Z

我用了4张A10显卡，总是训练一会显存就崩，下面是我的配置,有什么办法可以解决这个问题吗

    "output_dir": "output/firefly-qwen-7b-sft-full",
    "model_name_or_path": "/home/gpu/cuizhai/Firefly-master/Qwen/Qwen-7B-Chat2",
    "deepspeed": "./train_args/ds_z3_config.json",
    "train_file": "./data/output.jsonl",
    "template_name": "qwen",
    "train_mode": "full",
    "num_train_epochs": 1,
    "per_device_train_batch_size": 1, 
    "gradient_accumulation_steps": 8,
    "learning_rate": 1e-5,
    "max_seq_length": 512,
    "logging_steps": 50,
    "save_steps": 100,
    "save_total_limit": 1,
    "lr_scheduler_type": "cosine",
    "warmup_steps": 100,

    "gradient_checkpointing": true,
    "disable_tqdm": false,
    "optim": "adamw_hf",
    "seed": 42,
    "bf16": true,
    "report_to": "tensorboard",
    "dataloader_num_workers": 0,
    "save_strategy": "steps",
    "weight_decay": 0,
    "max_grad_norm": 1.0,
    "remove_unused_columns": false
}

The text was updated successfully, but these errors were encountered:

yangjianxin1 · 2024-02-05T06:10:24Z

先尝试把这行注释打开，看是否会直接OOM：https://github.com/yangjianxin1/Firefly/blob/master/component/collator.py#L17

如果直接OOM，则是因为512的长度太长了。

anyiz · 2024-02-05T07:19:38Z

先尝试把这行注释打开，看是否会直接OOM：https://github.com/yangjianxin1/Firefly/blob/master/component/collator.py#L17

如果直接OOM，则是因为512的长度太长了。

这行注释打开后是可以正常训练的，但是训练一会还是OOM挂掉

yangjianxin1 · 2024-02-05T08:01:34Z

A10的显存为24GB，全量参数训练7B模型还是略显不足。我这边使用V100全量参数训练7B模型也很吃力。

如果是进行指令微调，结合你的训练资源情况，建议使用QLoRA，也能取得很不错的效果。

yangjianxin1 · 2024-02-05T08:01:59Z

一般训练多少个step就会OOM呢？

anyiz · 2024-02-05T08:13:00Z

一般训练多少个step就会OOM呢？

他不太固定，我遇到17，11，20多

anyiz · 2024-02-05T10:01:25Z

A10的显存为24GB，全量参数训练7B模型还是略显不足。我这边使用V100全量参数训练7B模型也很吃力。

如果是进行指令微调，结合你的训练资源情况，建议使用QLoRA，也能取得很不错的效果。

我使用QLoRA进行微调时报错如下，我在网上找了很久都没解决

Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|████████████████████████| 8/8 [00:23<00:00,  2.88s/it]
Loading checkpoint shards: 100%|████████████████████████| 8/8 [00:23<00:00,  2.88s/it]
Loading checkpoint shards: 100%|████████████████████████| 8/8 [00:23<00:00,  2.88s/it]
Loading checkpoint shards: 100%|████████████████████████| 8/8 [00:23<00:00,  2.89s/it]
Traceback (most recent call last):
  File "/home/gpu/cuizhai/Firefly-master/train.py", line 350, in <module>
    main()
  File "/home/gpu/cuizhai/Firefly-master/train.py", line 335, in main
    trainer = init_components(args, training_args)
  File "/home/gpu/cuizhai/Firefly-master/train.py", line 309, in init_components
    model = load_model(args, training_args)
  File "/home/gpu/cuizhai/Firefly-master/train.py", line 257, in load_model
    model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=training_args.gradient_checkpointing)
  File "/home/gpu/anaconda3/envs/firefly/lib/python3.9/site-packages/peft/utils/other.py", line 81, in prepare_model_for_kbit_training
    param.data = param.data.to(torch.float32)
RuntimeError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 1; 22.02 GiB total capacity; 6.63 GiB already allocated; 2.17 GiB free; 6.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3475817 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3475820 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3475822 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 3475819) of binary: /home/gpu/anaconda3/envs/firefly/bin/python3.9
Traceback (most recent call last):
  File "/home/gpu/anaconda3/envs/firefly/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/gpu/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/gpu/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/gpu/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/gpu/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/gpu/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-05_17:54:27
  host      : gpu-P10DRG-PMN34567890123456789012
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3475819)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

他上面说我显存不足，但是我把配置参数调的非常低依旧是这个问题，我修改好参数后直接运行train.py使用一个gpu训练却可以正常训练

anyiz · 2024-02-06T02:17:49Z

昨天我把数据整理成这个样子，但是训练出的效果不太好，请问是我配置文件或者数据有问题吗？

    "conversation_id":148,
    "category":"Brainstorming",
    "conversation":[
        {
            "human":"下面是一位用户的基本信息：姓名: 李先生,所属银行: 华夏银行,逾期情况: 是否逾期: 否,逾期金额: 无,逾期时间: 无,逾期卡号: 无,逾期后果: 无,最后还款日: 无,还款方式: 自动扣款",
            "assistant":"您好，请问是李先生吗？"
        },
        {
            "human":"是我，您是哪位？",
            "assistant":"我是华夏银行的客服百聆。给您打电话是想确认一下您名下的华夏银行信用卡情况。根据我们的记录，您没有逾期欠款，账户正常。"
        },
        {
            "human":"好的，谢谢您的提醒。",
            "assistant":"您的信用卡还款方式是自动扣款吗？"
        },
        {
            "human":"是的，我设置了自动扣款。",
            "assistant":"很好，这样确保您每月都能按时还款，不会漏掉还款时间。如有需要，您可以登录网银或拨打我们的客服热线进行相关查询。"
        },
        {
            "human":"好的，谢谢您。",
            "assistant":"不客气，如果您还有其他问题，可以随时联系我们。祝您生活愉快，再见！"
        }
    ]
}

使用python train.py --train_args_file train_args/sft/qlora/qwen-7b-sft-qlora.json微调qwen模型，json配置文件内容为：

{
    "output_dir": "output/firefly-qwen-7b-sft-qlora",
    "model_name_or_path": "/home/gpu/cuizhai/Firefly-master/Qwen/Qwen-7B-Chat2",
    "train_file": "./data/output.jsonl",
    "template_name": "qwen",
    "num_train_epochs": 1,
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 16,
    "learning_rate": 2e-4,
    "max_seq_length": 512,
    "logging_steps": 100,
    "save_steps": 100,
    "save_total_limit": 1,
    "lr_scheduler_type": "constant_with_warmup",
    "warmup_steps": 100,
    "lora_rank": 64,
    "lora_alpha": 16,
    "lora_dropout": 0.05,

    "gradient_checkpointing": true,
    "disable_tqdm": false,
    "optim": "paged_adamw_32bit",
    "seed": 42,
    "bf16": true,
    "report_to": "tensorboard",
    "dataloader_num_workers": 0,
    "save_strategy": "steps",
    "weight_decay": 0,
    "max_grad_norm": 0.3,
    "remove_unused_columns": false
}

我不明白他为什么回答这么多，是我数据集格式不对吗

Parasolation · 2024-03-27T02:09:00Z

昨天我把数据整理成这个样子，但是训练出的效果不太好，请问是我配置文件或者数据有问题吗？

    "conversation_id":148,
    "category":"Brainstorming",
    "conversation":[
        {
            "human":"下面是一位用户的基本信息：姓名: 李先生,所属银行: 华夏银行,逾期情况: 是否逾期: 否,逾期金额: 无,逾期时间: 无,逾期卡号: 无,逾期后果: 无,最后还款日: 无,还款方式: 自动扣款",
            "assistant":"您好，请问是李先生吗？"
        },
        {
            "human":"是我，您是哪位？",
            "assistant":"我是华夏银行的客服百聆。给您打电话是想确认一下您名下的华夏银行信用卡情况。根据我们的记录，您没有逾期欠款，账户正常。"
        },
        {
            "human":"好的，谢谢您的提醒。",
            "assistant":"您的信用卡还款方式是自动扣款吗？"
        },
        {
            "human":"是的，我设置了自动扣款。",
            "assistant":"很好，这样确保您每月都能按时还款，不会漏掉还款时间。如有需要，您可以登录网银或拨打我们的客服热线进行相关查询。"
        },
        {
            "human":"好的，谢谢您。",
            "assistant":"不客气，如果您还有其他问题，可以随时联系我们。祝您生活愉快，再见！"
        }
    ]
}

使用python train.py --train_args_file train_args/sft/qlora/qwen-7b-sft-qlora.json微调qwen模型，json配置文件内容为：

{
    "output_dir": "output/firefly-qwen-7b-sft-qlora",
    "model_name_or_path": "/home/gpu/cuizhai/Firefly-master/Qwen/Qwen-7B-Chat2",
    "train_file": "./data/output.jsonl",
    "template_name": "qwen",
    "num_train_epochs": 1,
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 16,
    "learning_rate": 2e-4,
    "max_seq_length": 512,
    "logging_steps": 100,
    "save_steps": 100,
    "save_total_limit": 1,
    "lr_scheduler_type": "constant_with_warmup",
    "warmup_steps": 100,
    "lora_rank": 64,
    "lora_alpha": 16,
    "lora_dropout": 0.05,

    "gradient_checkpointing": true,
    "disable_tqdm": false,
    "optim": "paged_adamw_32bit",
    "seed": 42,
    "bf16": true,
    "report_to": "tensorboard",
    "dataloader_num_workers": 0,
    "save_strategy": "steps",
    "weight_decay": 0,
    "max_grad_norm": 0.3,
    "remove_unused_columns": false
}

我不明白他为什么回答这么多，是我数据集格式不对吗

你是用的base版模型么？我用base也出现这种情况，然后改用chat模型就没问题了

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

为什么我总是训练到一半告诉我内存不足， #211

为什么我总是训练到一半告诉我内存不足， #211

anyiz commented Feb 5, 2024

yangjianxin1 commented Feb 5, 2024

anyiz commented Feb 5, 2024

yangjianxin1 commented Feb 5, 2024

yangjianxin1 commented Feb 5, 2024

anyiz commented Feb 5, 2024

anyiz commented Feb 5, 2024 •

edited

anyiz commented Feb 6, 2024

Parasolation commented Mar 27, 2024

为什么我总是训练到一半告诉我内存不足， #211

为什么我总是训练到一半告诉我内存不足， #211

Comments

anyiz commented Feb 5, 2024

yangjianxin1 commented Feb 5, 2024

anyiz commented Feb 5, 2024

yangjianxin1 commented Feb 5, 2024

yangjianxin1 commented Feb 5, 2024

anyiz commented Feb 5, 2024

anyiz commented Feb 5, 2024 • edited

anyiz commented Feb 6, 2024

Parasolation commented Mar 27, 2024

anyiz commented Feb 5, 2024 •

edited