You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[QUESTION] RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:10:00)
#782
Open
JanryPei opened this issue
Apr 16, 2024
· 2 comments
Could you try running a simple PyTorch Distributed test program to help determine if it's something wrong with your infrastructure and job startup, or something with Megatron itself.
My question
I am trying to run Megatron multi-node on Docker.
My docker was established by the following command:
The pretrain.sh also had been setted like this:
However, when I run the shell, the error occured:
The text was updated successfully, but these errors were encountered: