Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When distributed training was performed, the program remained unresponsive #92

Open
mumu029 opened this issue Jan 28, 2024 · 0 comments

Comments

@mumu029
Copy link

mumu029 commented Jan 28, 2024

I want to train the model on two servers with one GPU each. But after I set up the configuration and ran it, the program stuck in one place and didn't react. I'm sure the program works when I train with a server.

export MASTER_ADDR=192.168.1.12
export MASTER_PORT=17788
export NODE_RANK=0

(py36tr108cu117) (base) cx@v100:~/ViLT-master$ python run.py with data_root=../../data/TrinityMultimodalTrojAI-main/data/clean/ num_gpus=1 num_nodes=2 task_finetune_vqa_randaug per_gpu_batchsize=64 load_path=../../data/model_weight/vilt_200k_mlm_itm.ckpt
WARNING - root - Changed type of config entry "max_steps" from int to NoneType
WARNING - ViLT - No observers have been added to this run
INFO - ViLT - Running command 'main'
INFO - ViLT - Started
Global seed set to 0
INFO - lightning - Global seed set to 0
GPU available: True, used: True
INFO - lightning - GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO - lightning - TPU available: None, using: 0 TPU cores
Using environment variable NODE_RANK for node rank (0).
INFO - lightning - Using environment variable NODE_RANK for node rank (0).
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO - lightning - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.
INFO - lightning - Using native 16bit precision.
Missing logger folder: result/finetune_vqa_randaug_seed0_from_vilt_200k_mlm_itm
WARNING - lightning - Missing logger folder: result/finetune_vqa_randaug_seed0_from_vilt_200k_mlm_itm
Global seed set to 0
INFO - lightning - Global seed set to 0
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
INFO - lightning - initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
INFO - root - Added key: store_based_barrier_key:1 to store for rank: 0

The program stops at this point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant