We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
我这边使用官方的代码,运行qlora格式, 只是指定了gpu的个数, 这边的运行代码是:
export CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc_per_node=3 train.py --train_args_file train_args/sft/qlora/qwen1.5-14b-sft-qlora.json
报错信息:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2589) of binary: /home/powerop/work/conda/envs/firefly/bin/python Traceback (most recent call last): File "/home/powerop/work/conda/envs/firefly/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/powerop/work/conda/envs/firefly/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/powerop/work/conda/envs/firefly/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/powerop/work/conda/envs/firefly/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/powerop/work/conda/envs/firefly/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/powerop/work/conda/envs/firefly/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-03-14_19:51:21 host : deeplearning-use-1-tr034784-0 rank : 2 (local_rank: 2) exitcode : 1 (pid: 2590) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-03-14_19:51:21 host : deeplearning-use-1-tr034784-0 rank : 1 (local_rank: 1) exitcode : 1 (pid: 2589) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
The text was updated successfully, but these errors were encountered:
No branches or pull requests
我这边使用官方的代码,运行qlora格式,
只是指定了gpu的个数,
这边的运行代码是:
报错信息:
The text was updated successfully, but these errors were encountered: