Port problem when using DDP (more than 2 experiments on same server) #483

kaen2891 · 2023-05-21T03:17:26Z

Hello, is there any way to use two DDP-based experiments on the same server?

For example,
I have 4-GPUS, and first run the run_downstream.py task with DDP with 2-GPUS like: CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node 2 run_downstream.py ~ ~

Thus the remained GPUs are 2.

So I want to use these 2-GPUS for another experiment with DDP. But I got the error as follows:

Maybe this is due to the same port problem.

But there is no way to change the port in run_downstream.py.

How can I solve it?

Best,

The text was updated successfully, but these errors were encountered:

hank0316 · 2023-07-13T07:51:49Z

Hi, you can specified the argument --master-port after --nproc_per_node. Since you already launch a DDP training, the default port 29500 is utilized by it and thus the error occurs. You can specified --master-port 29501 (or other port that are not being used) to solve this problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port problem when using DDP (more than 2 experiments on same server) #483

Port problem when using DDP (more than 2 experiments on same server) #483

kaen2891 commented May 21, 2023

hank0316 commented Jul 13, 2023

Port problem when using DDP (more than 2 experiments on same server) #483

Port problem when using DDP (more than 2 experiments on same server) #483

Comments

kaen2891 commented May 21, 2023

hank0316 commented Jul 13, 2023