New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error torch.distributed when running #1309
Comments
Hello, it is hard to find the root cause from these logs as anything causing the child to crash would cause this. Often times this is caused due to running out of ram or gpu ram. So one quick check you could do would be to lower the batch size and see if that stops the issue. Otherwise please try to get a traceback and share it here. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
I am working on one of the extended mmf projects. But when I run it with below command, I get the following error. Of course, it should be noted that I have encountered this error in other extended pythia frameworks.
Command for running:
python -m torch.distributed.launch --nproc_per_node 1 tools/run.py --pretrain --tasks vqa --datasets m4c_textvqa --model m4c_split --seed 13 --config configs/vqa/m4c_textvqa/tap_base_pretrain.yml --save_dir save/m4c_split_pretrain_test training_parameters.distributed True
Error:
////////////////
I install environment with below information
python=3.8
pytorch,cuda with command=
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
GPU= 1 geforce RTX 3090 (24 GPU-RAM)
/////////////////
Could you help me to solve this problem?
Is this error because of using 1 GPU?
Do I need to change the initial value of a some parameters(like local_rank)?
Could the reason for this error be due to lack of GPU-memory?
It is very important to me to solve this problem and I would be very grateful if you could guide me.
The text was updated successfully, but these errors were encountered: