Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

为什么我总是训练到一半告诉我内存不足, #211

Open
anyiz opened this issue Feb 5, 2024 · 8 comments
Open

为什么我总是训练到一半告诉我内存不足, #211

anyiz opened this issue Feb 5, 2024 · 8 comments

Comments

@anyiz
Copy link

anyiz commented Feb 5, 2024

我用了4张A10显卡,总是训练一会显存就崩,下面是我的配置,有什么办法可以解决这个问题吗

    "output_dir": "output/firefly-qwen-7b-sft-full",
    "model_name_or_path": "/home/gpu/cuizhai/Firefly-master/Qwen/Qwen-7B-Chat2",
    "deepspeed": "./train_args/ds_z3_config.json",
    "train_file": "./data/output.jsonl",
    "template_name": "qwen",
    "train_mode": "full",
    "num_train_epochs": 1,
    "per_device_train_batch_size": 1, 
    "gradient_accumulation_steps": 8,
    "learning_rate": 1e-5,
    "max_seq_length": 512,
    "logging_steps": 50,
    "save_steps": 100,
    "save_total_limit": 1,
    "lr_scheduler_type": "cosine",
    "warmup_steps": 100,

    "gradient_checkpointing": true,
    "disable_tqdm": false,
    "optim": "adamw_hf",
    "seed": 42,
    "bf16": true,
    "report_to": "tensorboard",
    "dataloader_num_workers": 0,
    "save_strategy": "steps",
    "weight_decay": 0,
    "max_grad_norm": 1.0,
    "remove_unused_columns": false
}

bug3

@yangjianxin1
Copy link
Owner

先尝试把这行注释打开,看是否会直接OOM:https://github.com/yangjianxin1/Firefly/blob/master/component/collator.py#L17

如果直接OOM,则是因为512的长度太长了。

@anyiz
Copy link
Author

anyiz commented Feb 5, 2024

先尝试把这行注释打开,看是否会直接OOM:https://github.com/yangjianxin1/Firefly/blob/master/component/collator.py#L17

如果直接OOM,则是因为512的长度太长了。

这行注释打开后是可以正常训练的,但是训练一会还是OOM挂掉
bug

@yangjianxin1
Copy link
Owner

A10的显存为24GB,全量参数训练7B模型还是略显不足。我这边使用V100全量参数训练7B模型也很吃力。

如果是进行指令微调,结合你的训练资源情况,建议使用QLoRA,也能取得很不错的效果。

@yangjianxin1
Copy link
Owner

一般训练多少个step就会OOM呢?

@anyiz
Copy link
Author

anyiz commented Feb 5, 2024

一般训练多少个step就会OOM呢?

他不太固定,我遇到17,11,20多

@anyiz
Copy link
Author

anyiz commented Feb 5, 2024

A10的显存为24GB,全量参数训练7B模型还是略显不足。我这边使用V100全量参数训练7B模型也很吃力。

如果是进行指令微调,结合你的训练资源情况,建议使用QLoRA,也能取得很不错的效果。

我使用QLoRA进行微调时报错如下,我在网上找了很久都没解决

Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|████████████████████████| 8/8 [00:23<00:00,  2.88s/it]
Loading checkpoint shards: 100%|████████████████████████| 8/8 [00:23<00:00,  2.88s/it]
Loading checkpoint shards: 100%|████████████████████████| 8/8 [00:23<00:00,  2.88s/it]
Loading checkpoint shards: 100%|████████████████████████| 8/8 [00:23<00:00,  2.89s/it]
Traceback (most recent call last):
  File "/home/gpu/cuizhai/Firefly-master/train.py", line 350, in <module>
    main()
  File "/home/gpu/cuizhai/Firefly-master/train.py", line 335, in main
    trainer = init_components(args, training_args)
  File "/home/gpu/cuizhai/Firefly-master/train.py", line 309, in init_components
    model = load_model(args, training_args)
  File "/home/gpu/cuizhai/Firefly-master/train.py", line 257, in load_model
    model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=training_args.gradient_checkpointing)
  File "/home/gpu/anaconda3/envs/firefly/lib/python3.9/site-packages/peft/utils/other.py", line 81, in prepare_model_for_kbit_training
    param.data = param.data.to(torch.float32)
RuntimeError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 1; 22.02 GiB total capacity; 6.63 GiB already allocated; 2.17 GiB free; 6.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3475817 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3475820 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3475822 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 3475819) of binary: /home/gpu/anaconda3/envs/firefly/bin/python3.9
Traceback (most recent call last):
  File "/home/gpu/anaconda3/envs/firefly/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/gpu/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/gpu/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/gpu/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/gpu/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/gpu/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-05_17:54:27
  host      : gpu-P10DRG-PMN34567890123456789012
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3475819)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

他上面说我显存不足,但是我把配置参数调的非常低依旧是这个问题,我修改好参数后直接运行train.py使用一个gpu训练却可以正常训练

@anyiz
Copy link
Author

anyiz commented Feb 6, 2024

昨天我把数据整理成这个样子,但是训练出的效果不太好,请问是我配置文件或者数据有问题吗?

    "conversation_id":148,
    "category":"Brainstorming",
    "conversation":[
        {
            "human":"下面是一位用户的基本信息:姓名: 李先生,所属银行: 华夏银行,逾期情况: 是否逾期: 否,逾期金额: 无,逾期时间: 无,逾期卡号: 无,逾期后果: 无,最后还款日: 无,还款方式: 自动扣款",
            "assistant":"您好,请问是李先生吗?"
        },
        {
            "human":"是我,您是哪位?",
            "assistant":"我是华夏银行的客服百聆。给您打电话是想确认一下您名下的华夏银行信用卡情况。根据我们的记录,您没有逾期欠款,账户正常。"
        },
        {
            "human":"好的,谢谢您的提醒。",
            "assistant":"您的信用卡还款方式是自动扣款吗?"
        },
        {
            "human":"是的,我设置了自动扣款。",
            "assistant":"很好,这样确保您每月都能按时还款,不会漏掉还款时间。如有需要,您可以登录网银或拨打我们的客服热线进行相关查询。"
        },
        {
            "human":"好的,谢谢您。",
            "assistant":"不客气,如果您还有其他问题,可以随时联系我们。祝您生活愉快,再见!"
        }
    ]
}

使用python train.py --train_args_file train_args/sft/qlora/qwen-7b-sft-qlora.json微调qwen模型,json配置文件内容为:

{
    "output_dir": "output/firefly-qwen-7b-sft-qlora",
    "model_name_or_path": "/home/gpu/cuizhai/Firefly-master/Qwen/Qwen-7B-Chat2",
    "train_file": "./data/output.jsonl",
    "template_name": "qwen",
    "num_train_epochs": 1,
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 16,
    "learning_rate": 2e-4,
    "max_seq_length": 512,
    "logging_steps": 100,
    "save_steps": 100,
    "save_total_limit": 1,
    "lr_scheduler_type": "constant_with_warmup",
    "warmup_steps": 100,
    "lora_rank": 64,
    "lora_alpha": 16,
    "lora_dropout": 0.05,

    "gradient_checkpointing": true,
    "disable_tqdm": false,
    "optim": "paged_adamw_32bit",
    "seed": 42,
    "bf16": true,
    "report_to": "tensorboard",
    "dataloader_num_workers": 0,
    "save_strategy": "steps",
    "weight_decay": 0,
    "max_grad_norm": 0.3,
    "remove_unused_columns": false
}

bug
我不明白他为什么回答这么多,是我数据集格式不对吗

@Parasolation
Copy link

昨天我把数据整理成这个样子,但是训练出的效果不太好,请问是我配置文件或者数据有问题吗?

    "conversation_id":148,
    "category":"Brainstorming",
    "conversation":[
        {
            "human":"下面是一位用户的基本信息:姓名: 李先生,所属银行: 华夏银行,逾期情况: 是否逾期: 否,逾期金额: 无,逾期时间: 无,逾期卡号: 无,逾期后果: 无,最后还款日: 无,还款方式: 自动扣款",
            "assistant":"您好,请问是李先生吗?"
        },
        {
            "human":"是我,您是哪位?",
            "assistant":"我是华夏银行的客服百聆。给您打电话是想确认一下您名下的华夏银行信用卡情况。根据我们的记录,您没有逾期欠款,账户正常。"
        },
        {
            "human":"好的,谢谢您的提醒。",
            "assistant":"您的信用卡还款方式是自动扣款吗?"
        },
        {
            "human":"是的,我设置了自动扣款。",
            "assistant":"很好,这样确保您每月都能按时还款,不会漏掉还款时间。如有需要,您可以登录网银或拨打我们的客服热线进行相关查询。"
        },
        {
            "human":"好的,谢谢您。",
            "assistant":"不客气,如果您还有其他问题,可以随时联系我们。祝您生活愉快,再见!"
        }
    ]
}

使用python train.py --train_args_file train_args/sft/qlora/qwen-7b-sft-qlora.json微调qwen模型,json配置文件内容为:

{
    "output_dir": "output/firefly-qwen-7b-sft-qlora",
    "model_name_or_path": "/home/gpu/cuizhai/Firefly-master/Qwen/Qwen-7B-Chat2",
    "train_file": "./data/output.jsonl",
    "template_name": "qwen",
    "num_train_epochs": 1,
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 16,
    "learning_rate": 2e-4,
    "max_seq_length": 512,
    "logging_steps": 100,
    "save_steps": 100,
    "save_total_limit": 1,
    "lr_scheduler_type": "constant_with_warmup",
    "warmup_steps": 100,
    "lora_rank": 64,
    "lora_alpha": 16,
    "lora_dropout": 0.05,

    "gradient_checkpointing": true,
    "disable_tqdm": false,
    "optim": "paged_adamw_32bit",
    "seed": 42,
    "bf16": true,
    "report_to": "tensorboard",
    "dataloader_num_workers": 0,
    "save_strategy": "steps",
    "weight_decay": 0,
    "max_grad_norm": 0.3,
    "remove_unused_columns": false
}

bug 我不明白他为什么回答这么多,是我数据集格式不对吗

你是用的base版模型么?我用base也出现这种情况,然后改用chat模型就没问题了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants