Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting error while running multi node machine learning training on H100 servers #3989

Open
PurvagLapsiwala opened this issue Oct 2, 2023 · 1 comment

Comments

@PurvagLapsiwala
Copy link

PurvagLapsiwala commented Oct 2, 2023

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Using nvcr.io/nvidia/tensorflow:23.02-tf2-py3 docker container provided by Nvidia for multinode training on 2XH100 server works completely fine, which has following package and version.

Docker Container
horovod:0.26.1+nv23.2
tensorflow:2.11.0+nv23.2

When I try to run it on host level without using docker using following version, I am getting mentioned error.

Host
horovod: 0.28.1
tf-nightly:2.14.0.dev20230706

Error:

[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
[6]<stderr>:    return fn(*args, **kwargs)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1742, in fit
[6]<stderr>:    tmp_logs = self.train_function(iterator)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1338, in train_function
[6]<stderr>:    return step_function(self, iterator)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1322, in step_function
[6]<stderr>:    outputs = model.distribute_strategy.run(run_step, args=(data,))
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1303, in run_step
[6]<stderr>:    outputs = model.train_step(data)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1084, in train_step
[6]<stderr>:    self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/optimizers/legacy/optimizer_v2.py", line 598, in minimize
[6]<stderr>:    grads_and_vars = self._compute_gradients(
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 136, in _compute_gradients
[6]<stderr>:    allreduced_grads = self._allreduce(grads, weights)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 218, in _allreduce
[6]<stderr>:    return __filtered_reduce_grads(grads, vars)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 184, in __filtered_reduce_grads
[6]<stderr>:    rg = self._allreduce_grads(rg, rv)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 573, in allreduce_grads
[6]<stderr>:    if groups is not None:
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>:    op=op,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>:    op=op,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>:    op=op,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 398, in _allreduce_cond
[6]<stderr>:    return tf.cond(cond, allreduce_fn, id_fn)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 384, in allreduce_fn
[6]<stderr>:    return allreduce(tensor, *args, process_set=process_set, **kwargs)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 102, in allreduce
[6]<stderr>:    if isinstance(tensor, tf.IndexedSlices):
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 138, in allreduce
[6]<stderr>:    summed_tensor_compressed = _allreduce(tensor_compressed, op=op,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/mpi_ops.py", line 130, in _allreduce
[6]<stderr>:    return MPI_LIB.horovod_allreduce(tensor, name=name, reduce_op=op,
[6]<stderr>:
[6]<stderr>:  File "multinode_training/multinode_training.py", line 477, in <module>
[6]<stderr>:    net.fit(train_batches,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
[6]<stderr>:    return fn(*args, **kwargs)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1742, in fit
[6]<stderr>:    tmp_logs = self.train_function(iterator)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1338, in train_function
[6]<stderr>:    return step_function(self, iterator)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1322, in step_function
[6]<stderr>:    outputs = model.distribute_strategy.run(run_step, args=(data,))
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1303, in run_step
[6]<stderr>:    outputs = model.train_step(data)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1084, in train_step
[6]<stderr>:    self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/optimizers/legacy/optimizer_v2.py", line 598, in minimize
[6]<stderr>:    grads_and_vars = self._compute_gradients(
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 136, in _compute_gradients
[6]<stderr>:    allreduced_grads = self._allreduce(grads, weights)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 218, in _allreduce
[6]<stderr>:    return __filtered_reduce_grads(grads, vars)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 184, in __filtered_reduce_grads
[6]<stderr>:    rg = self._allreduce_grads(rg, rv)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 573, in allreduce_grads
[6]<stderr>:    if groups is not None:
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>:    op=op,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>:    op=op,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>:    op=op,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 398, in _allreduce_cond
[6]<stderr>:    return tf.cond(cond, allreduce_fn, id_fn)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 384, in allreduce_fn
[6]<stderr>:    return allreduce(tensor, *args, process_set=process_set, **kwargs)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 102, in allreduce
[6]<stderr>:    if isinstance(tensor, tf.IndexedSlices):
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 138, in allreduce
[6]<stderr>:    summed_tensor_compressed = _allreduce(tensor_compressed, op=op,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/mpi_ops.py", line 130, in _allreduce
[6]<stderr>:    return MPI_LIB.horovod_allreduce(tensor, name=name, reduce_op=op,
[6]<stderr>:
[6]<stderr>:  File "<string>", line 108, in horovod_allreduce
[6]<stderr>:
[6]<stderr>:ncclCommInitRank failed: invalid usage (run with NCCL_DEBUG=WARN for details)
[6]<stderr>:	 [[{{node DistributedAdam_Allreduce/cond_74/HorovodAllreduce_grads_74_0}}]] [Op:__inference_train_function_6700]
[6]<stderr>:Terminated
[0]<stderr>:Terminated
[4]<stderr>:Terminated
[5]<stderr>:Terminated
Process 7 exit with status code 1.
Terminating remaining workers after failure of Process 7.
Process 3 exit with status code 1.
Process 1 exit with status code 1.
Process 2 exit with status code 1.
Process 6 exit with status code 143.
Process 0 exit with status code 143.
Process 4 exit with status code 143.
Process 5 exit with status code 143.
Traceback (most recent call last):
  File "/home/idps/.local/bin/horovodrun", line 8, in <module>
    sys.exit(run_commandline())
  File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 837, in run_commandline
    _run(args)
  File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 827, in _run
    return _run_static(args)
  File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 685, in _run_static
    _launch_job(args, settings, nics, command)
  File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 800, in _launch_job
    run_controller(args.use_gloo, gloo_run_fn,
  File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 776, in run_controller
    gloo_run()
  File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 792, in gloo_run_fn
    gloo_run(settings, nics, env, driver_ip, command)
  File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 300, in gloo_run
    launch_gloo(command, exec_command, settings, nics, env, server_ip)
  File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 284, in launch_gloo
    raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: 7
Exit code: 1

script

import tensorflow as tf
import horovod.tensorflow.keras as hvd
## Initialize Horovod
hvd.init()

if hvd.local_rank() == 0:
    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
    hvd.broadcast(0, 0)
else:
    hvd.broadcast(0, 0)
    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Data partition for different workers
num_pics_per_rank = x_train.shape[0] // hvd.size()
pic_begin = num_pics_per_rank * hvd.rank()
pic_end = pic_begin + num_pics_per_rank
x_train = x_train[pic_begin:pic_end,]
y_train = y_train[pic_begin:pic_end,]

x_train, x_test = x_train / 255.0, x_test / 255.0


def build_and_compile_cnn_model():
  model = tf.keras.Sequential([
      tf.keras.layers.InputLayer(input_shape=(28, 28)),
      tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
      tf.keras.layers.Conv2D(32, 3, activation='relu'),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(128, activation='relu'),
      tf.keras.layers.Dense(10)
  ])
  model.compile(
      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
      optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
      metrics=['accuracy'])
  return model

model.fit(x_train, y_train, epochs=3, batch_size=128)

Describe the solution you'd like
A clear and concise description of what you want to happen.

It should run without any error

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@PurvagLapsiwala PurvagLapsiwala changed the title Getting error while running multi node machine learning training on H100 server Getting error while running multi node machine learning training on H100 servers Oct 2, 2023
@PurvangL
Copy link

marking it solved as it was libraries issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants