Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR:torch.distributed.elastic.multiprocessing.api:failed #221

Open
ArlanCooper opened this issue Mar 14, 2024 · 0 comments
Open

ERROR:torch.distributed.elastic.multiprocessing.api:failed #221

ArlanCooper opened this issue Mar 14, 2024 · 0 comments

Comments

@ArlanCooper
Copy link

我这边使用官方的代码,运行qlora格式,
只是指定了gpu的个数,
这边的运行代码是:

export CUDA_VISIBLE_DEVICES=1,2,3
torchrun --nproc_per_node=3 train.py --train_args_file train_args/sft/qlora/qwen1.5-14b-sft-qlora.json


报错信息:


ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2589) of binary: /home/powerop/work/conda/envs/firefly/bin/python
Traceback (most recent call last):
  File "/home/powerop/work/conda/envs/firefly/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/powerop/work/conda/envs/firefly/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/powerop/work/conda/envs/firefly/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/powerop/work/conda/envs/firefly/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/powerop/work/conda/envs/firefly/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/powerop/work/conda/envs/firefly/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-14_19:51:21
  host      : deeplearning-use-1-tr034784-0
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2590)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-14_19:51:21
  host      : deeplearning-use-1-tr034784-0
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2589)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant