Skip to content

How to use MPI to train distributed? #8978

Answered by yingfhu
FDInSky asked this question in Q&A
Discussion options

You must be logged in to vote

@ZwwWayne , hi, I was also wondering how to use MPI for multi-machine training. Could you give an example here?

An example for use MPI run for distributed training with 2 GPUS on 2 nodes (1 GPU per node).

mpirun \
--allow-run-as-root \
--npernode 1 --np 2 \
python tools/train.py ${CONFIG_FILE} --launcher mpi

Note: Should at least set MASTER_ADDR environment variable which is necessary for pytorch. Refers to
https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/dist_utils.py#L66

Replies: 4 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by ZwwWayne
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
4 participants
Converted from issue

This discussion was converted from issue #7164 on October 09, 2022 07:16.