Support for configuration issues #426

EmmaQiaoCh · 2023-10-25T11:56:13Z

There is an issue related mlperf dlrmv2, original link is: mlcommons/training_results_v3.0#5

Describe the bug
Hi ,
AM trying to bringup the setup for multinode GPU Hugectr training benchmark using the code https://github.com/mlcommons/training_results_v3.0/tree/main/NVIDIA/benchmarks/dlrm_dcnv2/implementations/hugectr

For single node am able to run the benchmark test , but while am executing the multinode (say 2 node) am facing issue shown below , could you please help me resolving this issue??

[HCTR][17:28:15.456][WARNING][RK0][main]: The model name is not specified when creating the solver.
[1695144496.484294] [hpci5201:103648:0] ib_device.c:1250 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.160.0.55 sgid_index=3 traffic_class=106) for UD verbs connect on bnxt_re0 failed: Connection timed out
[hpci5201:103648] pml_ucx.c:419 Error: ucp_ep_create(proc=1) failed: Endpoint timeout
[hpci5201:103648] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 1
Traceback (most recent call last):
File "/dev/shm/data/hugectl/train.py", line 344, in
model = hugectr.Model(solver, reader, optimizer)
RuntimeError: Runtime error: MPI_ERR_OTHER: known error not in list
MPI_Bcast(&seed, 1, (static_cast<MPI_Datatype> (static_cast<void *> (&(ompi_mpi_unsigned_long_long)))), 0, (static_cast<MPI_Comm> (static_cast<void *> (&(ompi_mpi_comm_world))))) at create (/workspace/dlrm/hugectr/HugeCTR/src/resource_managers/resource_manager_ext.cpp:39)

To Reproduce
Steps to reproduce the behavior:

How to build including docker pull & docker run commands
How to run including the JSON config file used

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS: [e.g. Ubuntu xx.yy]
Graphic card: [e.g. a single NVIDIA V100 or NVIDIA DGX A100]
CUDA version: [e.g. CUDA 11.x]
Docker image

Additional context

The text was updated successfully, but these errors were encountered:

EmmaQiaoCh · 2023-10-26T07:05:02Z

Hi RaghavendraChari, I can't reproduce this error on 2 node in our cluster even I built the image from 'training_results_v3.0' repo.
Could you provide the detail reproduce steps? How did you build the image? What's the configurations you used? Which GPU you used?
Thanks!

EmmaQiaoCh mentioned this issue Oct 25, 2023

support for configuration issues mlcommons/training_results_v3.0#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for configuration issues #426

Support for configuration issues #426

EmmaQiaoCh commented Oct 25, 2023

EmmaQiaoCh commented Oct 26, 2023 •

edited

Support for configuration issues #426

Support for configuration issues #426

Comments

EmmaQiaoCh commented Oct 25, 2023

EmmaQiaoCh commented Oct 26, 2023 • edited

EmmaQiaoCh commented Oct 26, 2023 •

edited