Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port problem when using DDP (more than 2 experiments on same server) #483

Open
kaen2891 opened this issue May 21, 2023 · 1 comment
Open

Comments

@kaen2891
Copy link
Contributor

Hello, is there any way to use two DDP-based experiments on the same server?

For example,
I have 4-GPUS, and first run the run_downstream.py task with DDP with 2-GPUS like: CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node 2 run_downstream.py ~ ~

Thus the remained GPUs are 2.

So I want to use these 2-GPUS for another experiment with DDP. But I got the error as follows:
image

Maybe this is due to the same port problem.

But there is no way to change the port in run_downstream.py.

How can I solve it?

Best,

@hank0316
Copy link
Contributor

Hi, you can specified the argument --master-port after --nproc_per_node. Since you already launch a DDP training, the default port 29500 is utilized by it and thus the error occurs. You can specified --master-port 29501 (or other port that are not being used) to solve this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants