-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for configuration issues #426
Comments
Hi RaghavendraChari, I can't reproduce this error on 2 node in our cluster even I built the image from 'training_results_v3.0' repo. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
There is an issue related mlperf dlrmv2, original link is: mlcommons/training_results_v3.0#5
Describe the bug
Hi ,
AM trying to bringup the setup for multinode GPU Hugectr training benchmark using the code https://github.com/mlcommons/training_results_v3.0/tree/main/NVIDIA/benchmarks/dlrm_dcnv2/implementations/hugectr
For single node am able to run the benchmark test , but while am executing the multinode (say 2 node) am facing issue shown below , could you please help me resolving this issue??
[HCTR][17:28:15.456][WARNING][RK0][main]: The model name is not specified when creating the solver.
[1695144496.484294] [hpci5201:103648:0] ib_device.c:1250 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.160.0.55 sgid_index=3 traffic_class=106) for UD verbs connect on bnxt_re0 failed: Connection timed out
[hpci5201:103648] pml_ucx.c:419 Error: ucp_ep_create(proc=1) failed: Endpoint timeout
[hpci5201:103648] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 1
Traceback (most recent call last):
File "/dev/shm/data/hugectl/train.py", line 344, in
model = hugectr.Model(solver, reader, optimizer)
RuntimeError: Runtime error: MPI_ERR_OTHER: known error not in list
MPI_Bcast(&seed, 1, (static_cast<MPI_Datatype> (static_cast<void *> (&(ompi_mpi_unsigned_long_long)))), 0, (static_cast<MPI_Comm> (static_cast<void *> (&(ompi_mpi_comm_world))))) at create (/workspace/dlrm/hugectr/HugeCTR/src/resource_managers/resource_manager_ext.cpp:39)
To Reproduce
Steps to reproduce the behavior:
docker pull & docker run
commandsExpected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context
The text was updated successfully, but these errors were encountered: