-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inter-node MPI_Get on GPU buffer hangs #6888
Comments
I was wondering if there is an issue if you exhaust a "chunk" of GenQ buffers when issuing the gets. You could try to increase the number of buffers per chunk with |
Now reading more closely, the ranks are stuck waiting for active message RMA completions, not the native libfabric ops. I'll have to take another look at the AM code since it differs from the netmod implementation. |
I reproduced this with MPICH main on Sunspot. Just jotting down some notes after adding printfs to the code. All MPI_Get operations are going thru the active message path. For the process which issues the get, a request is created and the local window completion counter is incremented. At request completion (i.e. the data has been sent back from the target and placed in the user buffer), the local counter is decremented via the When the code hangs, I observe that the completion notification mechanism is never triggered, meaning the window counter grows and stays where it is, leading to the fence never returning. When fence is called more frequently, I can observe requests completing and completion_notification getting triggered as expected. |
Processes are stuck in an infinite loop here mpich/src/mpid/ch4/netmod/ofi/ofi_am_impl.h Line 379 in 8af3921
The target is trying to send the responses back to the origin, but it is getting EAGAIN. Need to understand why progress is not being made so those messages can get through. |
When a large number of MPI_Get are called before an MPI_Win_fence on the GPU buffer across nodes, the program seems to hang. I will share the location of the reproducer by email.
In MPIDIG_mpi_win_fence, some ranks stuck at
with a greater than 0 local_cmpl_cnts.
There are some observations from previous experiments:
Change all the messages to 40KB, works
Change all the messages to 8KB, fails – eager protocol
The text was updated successfully, but these errors were encountered: