Error torch.distributed when running #1309

tinaboya2023 · 2023-04-18T07:48:34Z

Hi,
I am working on one of the extended mmf projects. But when I run it with below command, I get the following error. Of course, it should be noted that I have encountered this error in other extended pythia frameworks.
Command for running:
python -m torch.distributed.launch --nproc_per_node 1 tools/run.py --pretrain --tasks vqa --datasets m4c_textvqa --model m4c_split --seed 13 --config configs/vqa/m4c_textvqa/tap_base_pretrain.yml --save_dir save/m4c_split_pretrain_test training_parameters.distributed True

Error:

////////////////
I install environment with below information
python=3.8
pytorch,cuda with command= conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
GPU= 1 geforce RTX 3090 (24 GPU-RAM)
/////////////////
Could you help me to solve this problem?
Is this error because of using 1 GPU?
Do I need to change the initial value of a some parameters(like local_rank)?
Could the reason for this error be due to lack of GPU-memory?
It is very important to me to solve this problem and I would be very grateful if you could guide me.

The text was updated successfully, but these errors were encountered:

pbontrager · 2023-04-18T20:42:35Z

Hello, it is hard to find the root cause from these logs as anything causing the child to crash would cause this. Often times this is caused due to running out of ram or gpu ram. So one quick check you could do would be to lower the batch size and see if that stops the issue. Otherwise please try to get a traceback and share it here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error torch.distributed when running #1309

Error torch.distributed when running #1309

tinaboya2023 commented Apr 18, 2023 •

edited

pbontrager commented Apr 18, 2023

Error torch.distributed when running #1309

Error torch.distributed when running #1309

Comments

tinaboya2023 commented Apr 18, 2023 • edited

pbontrager commented Apr 18, 2023

tinaboya2023 commented Apr 18, 2023 •

edited