Inter-node MPI_Get on GPU buffer hangs #6888

dycz0fx · 2024-01-30T22:57:06Z

When a large number of MPI_Get are called before an MPI_Win_fence on the GPU buffer across nodes, the program seems to hang. I will share the location of the reproducer by email.

In MPIDIG_mpi_win_fence, some ranks stuck at

MPIDIU_PROGRESS_DO_WHILE(MPIR_cc_get(MPIDIG_WIN(win, local_cmpl_cnts)) != 0 ||
                                                   MPIR_cc_get(MPIDIG_WIN(win, remote_acc_cmpl_cnts)) != 0, vci);

with a greater than 0 local_cmpl_cnts.

There are some observations from previous experiments:

The reproducer works for large messages but fails for small messages
Change all the messages to 40KB, works
Change all the messages to 8KB, fails – eager protocol
The reproducer works for the CPU buffer but fails for the GPU buffer
The reproducer works if MPI_Win_fence is called more often; for example, calling MPI_Win_fence for every 1000 messages work (even for small messages on GPU)
The reproducer works if all the ranks are on the same node but fails if ranks are distributed on multiple nodes.

The text was updated successfully, but these errors were encountered:

dycz0fx · 2024-01-30T23:18:30Z

@raffenet @hzhou
Hi Ken and Hui, I have sent the reproducer by email. Would you please point me to the CVARs you mentioned in the meeting so I can give them a try?

raffenet · 2024-01-31T21:28:56Z

@raffenet @hzhou Hi Ken and Hui, I have sent the reproducer by email. Would you please point me to the CVARs you mentioned in the meeting so I can give them a try?

I was wondering if there is an issue if you exhaust a "chunk" of GenQ buffers when issuing the gets. You could try to increase the number of buffers per chunk with MPIR_CVAR_CH4_NUM_PACK_BUFFERS_PER_CHUNK.

raffenet · 2024-01-31T22:38:55Z

Now reading more closely, the ranks are stuck waiting for active message RMA completions, not the native libfabric ops. I'll have to take another look at the AM code since it differs from the netmod implementation.

raffenet · 2024-02-07T17:37:57Z

I reproduced this with MPICH main on Sunspot. Just jotting down some notes after adding printfs to the code. All MPI_Get operations are going thru the active message path. For the process which issues the get, a request is created and the local window completion counter is incremented. At request completion (i.e. the data has been sent back from the target and placed in the user buffer), the local counter is decremented via the completion_notification pointer.

When the code hangs, I observe that the completion notification mechanism is never triggered, meaning the window counter grows and stays where it is, leading to the fence never returning. When fence is called more frequently, I can observe requests completing and completion_notification getting triggered as expected.

raffenet · 2024-02-07T22:54:50Z

Processes are stuck in an infinite loop here

mpich/src/mpid/ch4/netmod/ofi/ofi_am_impl.h

Line 379 in 8af3921

 MPIDI_OFI_CALL_RETRY_AM(fi_send(MPIDI_OFI_global.ctx[ctx_idx].tx, msg_hdr, total_msg_sz, 

.

The target is trying to send the responses back to the origin, but it is getting EAGAIN. Need to understand why progress is not being made so those messages can get through.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inter-node MPI_Get on GPU buffer hangs #6888

Inter-node MPI_Get on GPU buffer hangs #6888

dycz0fx commented Jan 30, 2024

dycz0fx commented Jan 30, 2024

raffenet commented Jan 31, 2024

raffenet commented Jan 31, 2024

raffenet commented Feb 7, 2024

raffenet commented Feb 7, 2024

Inter-node MPI_Get on GPU buffer hangs #6888

Inter-node MPI_Get on GPU buffer hangs #6888

Comments

dycz0fx commented Jan 30, 2024

dycz0fx commented Jan 30, 2024

raffenet commented Jan 31, 2024

raffenet commented Jan 31, 2024

raffenet commented Feb 7, 2024

raffenet commented Feb 7, 2024