Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inter-node MPI_Get on GPU buffer hangs #6888

Open
dycz0fx opened this issue Jan 30, 2024 · 5 comments
Open

Inter-node MPI_Get on GPU buffer hangs #6888

dycz0fx opened this issue Jan 30, 2024 · 5 comments

Comments

@dycz0fx
Copy link
Contributor

dycz0fx commented Jan 30, 2024

When a large number of MPI_Get are called before an MPI_Win_fence on the GPU buffer across nodes, the program seems to hang. I will share the location of the reproducer by email.

In MPIDIG_mpi_win_fence, some ranks stuck at

MPIDIU_PROGRESS_DO_WHILE(MPIR_cc_get(MPIDIG_WIN(win, local_cmpl_cnts)) != 0 ||
                                                   MPIR_cc_get(MPIDIG_WIN(win, remote_acc_cmpl_cnts)) != 0, vci);

with a greater than 0 local_cmpl_cnts.

There are some observations from previous experiments:

  1. The reproducer works for large messages but fails for small messages
    Change all the messages to 40KB, works
    Change all the messages to 8KB, fails – eager protocol
  2. The reproducer works for the CPU buffer but fails for the GPU buffer
  3. The reproducer works if MPI_Win_fence is called more often; for example, calling MPI_Win_fence for every 1000 messages work (even for small messages on GPU)
  4. The reproducer works if all the ranks are on the same node but fails if ranks are distributed on multiple nodes.
@dycz0fx
Copy link
Contributor Author

dycz0fx commented Jan 30, 2024

@raffenet @hzhou
Hi Ken and Hui, I have sent the reproducer by email. Would you please point me to the CVARs you mentioned in the meeting so I can give them a try?

@raffenet
Copy link
Contributor

@raffenet @hzhou Hi Ken and Hui, I have sent the reproducer by email. Would you please point me to the CVARs you mentioned in the meeting so I can give them a try?

I was wondering if there is an issue if you exhaust a "chunk" of GenQ buffers when issuing the gets. You could try to increase the number of buffers per chunk with MPIR_CVAR_CH4_NUM_PACK_BUFFERS_PER_CHUNK.

@raffenet
Copy link
Contributor

Now reading more closely, the ranks are stuck waiting for active message RMA completions, not the native libfabric ops. I'll have to take another look at the AM code since it differs from the netmod implementation.

@raffenet
Copy link
Contributor

raffenet commented Feb 7, 2024

I reproduced this with MPICH main on Sunspot. Just jotting down some notes after adding printfs to the code. All MPI_Get operations are going thru the active message path. For the process which issues the get, a request is created and the local window completion counter is incremented. At request completion (i.e. the data has been sent back from the target and placed in the user buffer), the local counter is decremented via the completion_notification pointer.

When the code hangs, I observe that the completion notification mechanism is never triggered, meaning the window counter grows and stays where it is, leading to the fence never returning. When fence is called more frequently, I can observe requests completing and completion_notification getting triggered as expected.

@raffenet
Copy link
Contributor

raffenet commented Feb 7, 2024

Processes are stuck in an infinite loop here

MPIDI_OFI_CALL_RETRY_AM(fi_send(MPIDI_OFI_global.ctx[ctx_idx].tx, msg_hdr, total_msg_sz,
.

The target is trying to send the responses back to the origin, but it is getting EAGAIN. Need to understand why progress is not being made so those messages can get through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants