Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The program blocks hvd.init(). #4018

Open
divmid opened this issue Jan 26, 2024 · 1 comment
Open

The program blocks hvd.init(). #4018

divmid opened this issue Jan 26, 2024 · 1 comment
Labels

Comments

@divmid
Copy link

divmid commented Jan 26, 2024

Environment:

  1. Framework: (TensorFlow, Keras, PyTorch, MXNet): TensorFlow
  2. Framework version: 2.9.2
  3. Horovod version: 0.28.1
  4. MPI version: mpirun (Open MPI) 4.1.4
  5. CUDA version:
  6. NCCL version:
  7. Python version: 3.8.10
  8. Spark / PySpark version:
  9. Ray version:
  10. OS and version:
  11. GCC version:
  12. CMake version:

Checklist:

  1. Did you search issues to find if somebody asked this question before?
  2. If your question is about hang, did you read this doc?
  3. If your question is about docker, did you read this doc?
  4. Did you check if you question is answered in the troubleshooting guide?

Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.

  1. My physical host is:
[root@bm83 ~]# cat /etc/centos-release
CentOS Linux release 8.1.1911 (Core)

top - 15:59:32 up 345 days,  5:36,  2 users,  load average: 3.35, 3.45, 3.31
Tasks: 395 total,   1 running, 380 sleeping,  14 stopped,   0 zombie
%Cpu(s):  5.3 us,  5.9 sy,  0.0 ni, 85.8 id,  1.6 wa,  0.2 hi,  1.1 si,  0.0 st
MiB Mem :  64260.5 total,    383.6 free,   5623.3 used,  58253.6 buff/cache
MiB Swap:  32288.0 total,  25981.3 free,   6306.7 used.  57995.8 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
12652 root      20   0 1838576 363928 183776 S  97.7   0.6  16:13.30 python
12653 root      20   0 1838524 364924 184768 S  97.7   0.6  16:14.91 python
17665 1000      20   0   16.1g   2.1g  25564 S  57.3   3.4 419294:36 java

2.The way I have to build my environment is:

[root@bm83 ~]# docker images
REPOSITORY                    TAG       IMAGE ID       CREATED         SIZE
horovod/horovod               latest    4f3896dc9b9e   7 months ago    14.3GB

docker run -it -d --privileged   --name horovod  --network host -v /data/ssh/:/root/.ssh/ -v /data/horovod:/data/ horovod/horovod:latest

docker exec -it horovod /bin/bash
sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
sed -i 's/#PubkeyAuthentication yes/PubkeyAuthentication yes/' /etc/ssh/sshd_config
sed -i 's/#Port 22/Port 12345/' /etc/ssh/sshd_config
service ssh restart
apt update -y && apt install rsync net-tools vim ncat telnet -y

3.The script code I executed was main.py:

import tensorflow as tf
import numpy as np
from tensorflow import keras
import horovod.tensorflow.keras as hvd
print("1111111111111111111")
hvd.init()
print("2222222222222222222")
model = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])
model.compile(optimizer='sgd', loss='mean_squared_error')
xs = np.array([-1.0, 0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
ys = np.array([-2.0, 1.0, 4.0, 7.0, 10.0, 13.0], dtype=float)
model.fit(xs, ys, epochs=3000)
if hvd.rank() == 0:
    model.save_weights("adasd.h5")

4.I have to activate the command is:

root@bm83:/data/QuakeMitchell# export HOROVOD_LOG_LEVEL=trace
root@bm83:/data/QuakeMitchell# mpirun --allow-run-as-root -oversubscribe   --mca oob_tcp_include eth0,eth2 --mca btl tcp,self --mca oob tcp -map-by slot  --mca plm_rsh_args "-p 12345 -q -o StrictHostKeyChecking=no"    -np 2 -H 10.206.74.32:2  python   /data/QuakeMitchell/main.py
1111111111111111111
[2024-01-26 07:43:02.115518: D /tmp/pip-req-build-9nlys6qr/horovod/common/utils/env_parser.cc:107] Using MPI to perform controller operations.
[2024-01-26 07:43:02.115573: D /tmp/pip-req-build-9nlys6qr/horovod/common/utils/env_parser.cc:73] Using MPI to perform CPU operations.
[2024-01-26 07:43:02.115589: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_context.h:51] MPI context enabled.
[2024-01-26 07:43:02.115612: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_controller.h:36] MPI Controller constructed.
1111111111111111111
[2024-01-26 07:43:02.118399: D /tmp/pip-req-build-9nlys6qr/horovod/common/utils/env_parser.cc:107] Using MPI to perform controller operations.
[2024-01-26 07:43:02.118443: D /tmp/pip-req-build-9nlys6qr/horovod/common/utils/env_parser.cc:73] Using MPI to perform CPU operations.
[2024-01-26 07:43:02.118473: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_context.h:51] MPI context enabled.
[2024-01-26 07:43:02.118503: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_controller.h:36] MPI Controller constructed.
[2024-01-26 07:43:02.185741: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_context.cc:195] Using MPI_COMM_WORLD as global communicator.
[2024-01-26 07:43:02.185741: D /tmp/pip-req-build-9nlys6qr/horovod/common/mpi/mpi_context.cc:195] Using MPI_COMM_WORLD as global communicator.

--------------The program blocks hvd.init()-------------
root@bm83:/data/QuakeMitchell# top
top - 07:28:21 up 345 days,  5:05,  1 user,  load average: 2.62, 3.10, 3.13
Tasks:  26 total,   1 running,  11 sleeping,  14 stopped,   0 zombie
%Cpu(s):  5.5 us,  5.7 sy,  0.0 ni, 87.1 id,  0.3 wa,  0.2 hi,  1.1 si,  0.0 st
MiB Mem :  64260.5 total,    308.6 free,   5691.5 used,  58260.4 buff/cache
MiB Swap:  32288.0 total,  26062.5 free,   6225.5 used.  57926.1 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 9134 root      20   0 1842688 367908 183580 S  98.3   0.6   1:43.27 python
 9135 root      20   0 1842640 368152 183812 S  97.7   0.6   1:42.90 python
    1 root      20   0    4244      0      0 S   0.0   0.0   0:00.04 bash
   29 root      20   0    4244   1808   1544 S   0.0   0.0   0:00.15 bash

@divmid divmid added the bug label Jan 26, 2024
@MrAta
Copy link
Contributor

MrAta commented Jan 30, 2024

Neither your code nor the way you're using horovod sounds correct. Please follow the keras example here:
https://github.com/horovod/horovod/blob/master/examples/keras/keras_mnist.py
Also follow the horovod-mpi docs to see how to run the program using horovodrun command:
https://github.com/horovod/horovod/blob/master/examples/keras/keras_mnist.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants