You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: 5c3bd44422c5
Remote host: b0004ecbf31abd2
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
The worker node seems unable to connect to connect to the host node. Does anyone have experience with this?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I have 1 worker node, and 1 host node. Both run a docker image on port 2222. To launch the docker image on host:
'nvidia-docker run -it --publish=2222:2222 -v /uhome/rooijmv/.ssh:/root/.ssh horovod/horovod:sha-626f6a6'
On worker:
nvidia-docker run -it --publish=2222:2222 -v /uhome/rooijmv/.ssh:/root/.ssh horovod/horovod:sha-626f6a6 bash -c "/usr/sbin/sshd -p 2223 -o PermitUserEnvironment=yes; env > /root/.ssh/environment; sleep infinity"
Now on the host:
mpirun -mca plm_rsh_args "-p 2223" --allow-run-as-root -np 2 -H localhost:1,host1:1 hostname
giving me:
The worker node seems unable to connect to connect to the host node. Does anyone have experience with this?
Beta Was this translation helpful? Give feedback.
All reactions