Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error torch.distributed when running #1309

Open
tinaboya2023 opened this issue Apr 18, 2023 · 1 comment
Open

Error torch.distributed when running #1309

tinaboya2023 opened this issue Apr 18, 2023 · 1 comment

Comments

@tinaboya2023
Copy link

tinaboya2023 commented Apr 18, 2023

Hi,
I am working on one of the extended mmf projects. But when I run it with below command, I get the following error. Of course, it should be noted that I have encountered this error in other extended pythia frameworks.
Command for running:
python -m torch.distributed.launch --nproc_per_node 1 tools/run.py --pretrain --tasks vqa --datasets m4c_textvqa --model m4c_split --seed 13 --config configs/vqa/m4c_textvqa/tap_base_pretrain.yml --save_dir save/m4c_split_pretrain_test training_parameters.distributed True

Error:
3

////////////////
I install environment with below information
python=3.8
pytorch,cuda with command= conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
GPU= 1 geforce RTX 3090 (24 GPU-RAM)
/////////////////
Could you help me to solve this problem?
Is this error because of using 1 GPU?
Do I need to change the initial value of a some parameters(like local_rank)?
Could the reason for this error be due to lack of GPU-memory?
It is very important to me to solve this problem and I would be very grateful if you could guide me.

@pbontrager
Copy link

Hello, it is hard to find the root cause from these logs as anything causing the child to crash would cause this. Often times this is caused due to running out of ram or gpu ram. So one quick check you could do would be to lower the batch size and see if that stops the issue. Otherwise please try to get a traceback and share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants