Getting error while running multi node machine learning training on H100 servers #3989

PurvagLapsiwala · 2023-10-02T17:34:31Z

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Using nvcr.io/nvidia/tensorflow:23.02-tf2-py3 docker container provided by Nvidia for multinode training on 2XH100 server works completely fine, which has following package and version.

Docker Container
horovod:0.26.1+nv23.2
tensorflow:2.11.0+nv23.2

When I try to run it on host level without using docker using following version, I am getting mentioned error.

Host
horovod: 0.28.1
tf-nightly:2.14.0.dev20230706

Error:

[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
[6]<stderr>:    return fn(*args, **kwargs)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1742, in fit
[6]<stderr>:    tmp_logs = self.train_function(iterator)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1338, in train_function
[6]<stderr>:    return step_function(self, iterator)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1322, in step_function
[6]<stderr>:    outputs = model.distribute_strategy.run(run_step, args=(data,))
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1303, in run_step
[6]<stderr>:    outputs = model.train_step(data)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1084, in train_step
[6]<stderr>:    self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/optimizers/legacy/optimizer_v2.py", line 598, in minimize
[6]<stderr>:    grads_and_vars = self._compute_gradients(
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 136, in _compute_gradients
[6]<stderr>:    allreduced_grads = self._allreduce(grads, weights)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 218, in _allreduce
[6]<stderr>:    return __filtered_reduce_grads(grads, vars)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 184, in __filtered_reduce_grads
[6]<stderr>:    rg = self._allreduce_grads(rg, rv)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 573, in allreduce_grads
[6]<stderr>:    if groups is not None:
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>:    op=op,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>:    op=op,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>:    op=op,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 398, in _allreduce_cond
[6]<stderr>:    return tf.cond(cond, allreduce_fn, id_fn)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 384, in allreduce_fn
[6]<stderr>:    return allreduce(tensor, *args, process_set=process_set, **kwargs)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 102, in allreduce
[6]<stderr>:    if isinstance(tensor, tf.IndexedSlices):
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 138, in allreduce
[6]<stderr>:    summed_tensor_compressed = _allreduce(tensor_compressed, op=op,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/mpi_ops.py", line 130, in _allreduce
[6]<stderr>:    return MPI_LIB.horovod_allreduce(tensor, name=name, reduce_op=op,
[6]<stderr>:
[6]<stderr>:  File "multinode_training/multinode_training.py", line 477, in <module>
[6]<stderr>:    net.fit(train_batches,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
[6]<stderr>:    return fn(*args, **kwargs)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1742, in fit
[6]<stderr>:    tmp_logs = self.train_function(iterator)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1338, in train_function
[6]<stderr>:    return step_function(self, iterator)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1322, in step_function
[6]<stderr>:    outputs = model.distribute_strategy.run(run_step, args=(data,))
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1303, in run_step
[6]<stderr>:    outputs = model.train_step(data)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/engine/training.py", line 1084, in train_step
[6]<stderr>:    self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/keras/src/optimizers/legacy/optimizer_v2.py", line 598, in minimize
[6]<stderr>:    grads_and_vars = self._compute_gradients(
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 136, in _compute_gradients
[6]<stderr>:    allreduced_grads = self._allreduce(grads, weights)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 218, in _allreduce
[6]<stderr>:    return __filtered_reduce_grads(grads, vars)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/_keras/__init__.py", line 184, in __filtered_reduce_grads
[6]<stderr>:    rg = self._allreduce_grads(rg, rv)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 573, in allreduce_grads
[6]<stderr>:    if groups is not None:
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>:    op=op,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>:    op=op,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 616, in allreduce_grads
[6]<stderr>:    op=op,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 398, in _allreduce_cond
[6]<stderr>:    return tf.cond(cond, allreduce_fn, id_fn)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 384, in allreduce_fn
[6]<stderr>:    return allreduce(tensor, *args, process_set=process_set, **kwargs)
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 102, in allreduce
[6]<stderr>:    if isinstance(tensor, tf.IndexedSlices):
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/__init__.py", line 138, in allreduce
[6]<stderr>:    summed_tensor_compressed = _allreduce(tensor_compressed, op=op,
[6]<stderr>:
[6]<stderr>:  File "/home/idps/.local/lib/python3.8/site-packages/horovod/tensorflow/mpi_ops.py", line 130, in _allreduce
[6]<stderr>:    return MPI_LIB.horovod_allreduce(tensor, name=name, reduce_op=op,
[6]<stderr>:
[6]<stderr>:  File "<string>", line 108, in horovod_allreduce
[6]<stderr>:
[6]<stderr>:ncclCommInitRank failed: invalid usage (run with NCCL_DEBUG=WARN for details)
[6]<stderr>:	 [[{{node DistributedAdam_Allreduce/cond_74/HorovodAllreduce_grads_74_0}}]] [Op:__inference_train_function_6700]
[6]<stderr>:Terminated
[0]<stderr>:Terminated
[4]<stderr>:Terminated
[5]<stderr>:Terminated
Process 7 exit with status code 1.
Terminating remaining workers after failure of Process 7.
Process 3 exit with status code 1.
Process 1 exit with status code 1.
Process 2 exit with status code 1.
Process 6 exit with status code 143.
Process 0 exit with status code 143.
Process 4 exit with status code 143.
Process 5 exit with status code 143.
Traceback (most recent call last):
  File "/home/idps/.local/bin/horovodrun", line 8, in <module>
    sys.exit(run_commandline())
  File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 837, in run_commandline
    _run(args)
  File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 827, in _run
    return _run_static(args)
  File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 685, in _run_static
    _launch_job(args, settings, nics, command)
  File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 800, in _launch_job
    run_controller(args.use_gloo, gloo_run_fn,
  File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 776, in run_controller
    gloo_run()
  File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/launch.py", line 792, in gloo_run_fn
    gloo_run(settings, nics, env, driver_ip, command)
  File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 300, in gloo_run
    launch_gloo(command, exec_command, settings, nics, env, server_ip)
  File "/home/idps/.local/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 284, in launch_gloo
    raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: 7
Exit code: 1

script

import tensorflow as tf
import horovod.tensorflow.keras as hvd
## Initialize Horovod
hvd.init()

if hvd.local_rank() == 0:
    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
    hvd.broadcast(0, 0)
else:
    hvd.broadcast(0, 0)
    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Data partition for different workers
num_pics_per_rank = x_train.shape[0] // hvd.size()
pic_begin = num_pics_per_rank * hvd.rank()
pic_end = pic_begin + num_pics_per_rank
x_train = x_train[pic_begin:pic_end,]
y_train = y_train[pic_begin:pic_end,]

x_train, x_test = x_train / 255.0, x_test / 255.0


def build_and_compile_cnn_model():
  model = tf.keras.Sequential([
      tf.keras.layers.InputLayer(input_shape=(28, 28)),
      tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
      tf.keras.layers.Conv2D(32, 3, activation='relu'),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(128, activation='relu'),
      tf.keras.layers.Dense(10)
  ])
  model.compile(
      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
      optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
      metrics=['accuracy'])
  return model

model.fit(x_train, y_train, epochs=3, batch_size=128)

Describe the solution you'd like
A clear and concise description of what you want to happen.

It should run without any error

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

PurvangL · 2023-10-23T17:46:11Z

marking it solved as it was libraries issue.

PurvagLapsiwala added the enhancement label Oct 2, 2023

PurvagLapsiwala changed the title ~~Getting error while running multi node machine learning training on H100 server~~ Getting error while running multi node machine learning training on H100 servers Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting error while running multi node machine learning training on H100 servers #3989

Getting error while running multi node machine learning training on H100 servers #3989

PurvagLapsiwala commented Oct 2, 2023 •

edited

PurvangL commented Oct 23, 2023

Getting error while running multi node machine learning training on H100 servers #3989

Getting error while running multi node machine learning training on H100 servers #3989

Comments

PurvagLapsiwala commented Oct 2, 2023 • edited

PurvangL commented Oct 23, 2023

PurvagLapsiwala commented Oct 2, 2023 •

edited