Add stream operations to accelerator components #12356

devreal · 2024-02-20T21:27:21Z

First batch of changes from #12318 (offloading of reductions to devices).

This PR adds stream operations to the accelerator components:

~~Default stream~~
Stream-based alloc and free
Stream-based memmove
Wait for stream to complete

Also, enable querying for number of devices and memory bandwidth.

This PR is missing implementations for the ze component because I haven't had time to dig into that. Maybe someone familiar with that API can contribute the implementation? Otherwise I will need to find some time in the coming week(s) to implement them myself (the ze component didn't exist when I made these changes).

hppritcha · 2024-02-20T21:50:26Z

Could you please add stub functions for the ZE component?

devreal · 2024-02-20T22:04:17Z

@hppritcha I added stubs returning OPAL_ERR_NOT_IMPLEMENTED to the ze component.

edgargabriel

just a few minor comments/requests, nothing major, looks good overall.

opal/mca/accelerator/accelerator.h

opal/mca/accelerator/rocm/accelerator_rocm_component.c

opal/mca/accelerator/rocm/accelerator_rocm_module.c

opal/mca/accelerator/rocm/accelerator_rocm_component.c

Akshay-Venkatesh

Hi @devreal . Some high-level questions:

I don't know the motivation behind the need for stream-ordered allocations for reductions. Can you explain the high-level picture?
How does MPI user pass streams to the MPI implementation? Without that ability, I'm not sure how stream-ordered memmove operations in accelerator component is taken advantage of.

opal/mca/accelerator/cuda/accelerator_cuda.c

opal/mca/accelerator/cuda/accelerator_cuda_component.c

opal/mca/accelerator/accelerator.h

opal/mca/accelerator/cuda/accelerator_cuda.c

hppritcha · 2024-05-02T15:33:15Z

@devreal could you rebase on top of head of main to pull in the CI ZE compiler sanity check?

edgargabriel

LGTM

Akshay-Venkatesh · 2024-05-14T18:28:11Z

I don't know the motivation behind the need for stream-ordered allocations for reductions. Can you explain the high-level picture?

How does MPI user pass streams to the MPI implementation? Without that ability, I'm not sure how stream-ordered memmove operations in accelerator component is taken advantage of.

Hi @devreal No more comments on the PR changes but I didn't get an answer for the above questions. Apologies if I've missed the response somewhere.

devreal · 2024-05-23T00:23:04Z

@Akshay-Venkatesh Sorry, I only replied to your comments inline.

I don't know the motivation behind the need for stream-ordered allocations for reductions. Can you explain the high-level picture?

For operations, we may need to allocate temporary device buffers. Maybe we can work around that eventually but it probably doesn't hurt to have that functionality available.

How does MPI user pass streams to the MPI implementation? Without that ability, I'm not sure how stream-ordered memmove operations in accelerator component is taken advantage of.

There is no infrastructure for that yet (aside from some research projects). We can take advantage of it in the collective reduction operations though.

bosilca

Based on the content of this PR I cannot figure out why this functionality is needed. What is the grand scheme in which we need to :

expose the memory bandwidth of a device ?
have an asynchronous memmove ?
have an explicit synchronization routine instead of relying on existing mechanisms (create / record / and wait for an event to complete) ?

devreal · 2024-05-23T15:01:24Z

The context is in the referenced PR (#12318) but that's a monster so it might not have been clear.

Based on the content of this PR I cannot figure out why this functionality is needed. What is the grand scheme in which we need to :

expose the memory bandwidth of a device ?

That is needed to decide whether to fetch the data to the host and perform the operator there or to submit a kernel. This will be part of a separate PR that has yet to come. I was hoping to get the basics in first but I can post all PRs and we synchronize across.

have an asynchronous memmove ?

See #12570. We could do without (i.e., use blocking memmove instead) and I'm not even sure how often memmove is actually used there but for the purpose of completeness I thought I'd be the right thing.

have an explicit synchronization routine instead of relying on existing mechanisms (create / record / and wait for an event to complete) ?

See #12570. Not sure how useful events would be there. Sounds like more overhead to create an event than to simply synchronize the stream. That's used to submit multiple data transfers and synchronizing once instead copying piecemeal.

edgargabriel · 2024-06-04T13:19:21Z

it would be good if we could have this PR merged in the near future, since this would simplify evaluating/testing the subsequent PRs

janjust · 2024-06-11T15:36:44Z

Will merge after resolve

- Stream-based alloc and free - Stream-based memmove - Wait for stream to complete Also, enable querying for number of devices and memory bandwidth. These operations are needed for operation device offloading. Co-authored-by: Phuong Nguyen <[email protected]> Signed-off-by: Joseph Schuchart <[email protected]>

devreal · 2024-06-12T22:57:03Z

Rebased and fixed conflicts.

devreal added the Target: main label Feb 20, 2024

devreal requested review from bosilca, janjust and edgargabriel February 20, 2024 21:27

devreal marked this pull request as draft February 20, 2024 21:27

devreal force-pushed the opal_accelerator_stream_ops branch from 3d87780 to ead2c40 Compare February 20, 2024 22:03

devreal force-pushed the opal_accelerator_stream_ops branch 2 times, most recently from df07a9c to 236af1a Compare February 20, 2024 22:29

edgargabriel requested changes Feb 21, 2024

View reviewed changes

devreal force-pushed the opal_accelerator_stream_ops branch 2 times, most recently from 1c4d11b to 529a68d Compare February 25, 2024 15:33

edgargabriel reviewed Feb 26, 2024

View reviewed changes

opal/mca/accelerator/rocm/accelerator_rocm_component.c Show resolved Hide resolved

devreal force-pushed the opal_accelerator_stream_ops branch 2 times, most recently from ebad01d to 97dbb09 Compare February 26, 2024 14:55

Akshay-Venkatesh reviewed Apr 9, 2024

View reviewed changes

opal/mca/accelerator/cuda/accelerator_cuda.c Outdated Show resolved Hide resolved

opal/mca/accelerator/cuda/accelerator_cuda.c Show resolved Hide resolved

opal/mca/accelerator/cuda/accelerator_cuda_component.c Show resolved Hide resolved

wenduwan reviewed Apr 17, 2024

View reviewed changes

opal/mca/accelerator/accelerator.h Outdated Show resolved Hide resolved

opal/mca/accelerator/cuda/accelerator_cuda.c Show resolved Hide resolved

devreal force-pushed the opal_accelerator_stream_ops branch 2 times, most recently from 0765a9e to 7de3f68 Compare April 24, 2024 21:40

devreal marked this pull request as ready for review April 24, 2024 21:41

devreal requested review from Akshay-Venkatesh and edgargabriel April 24, 2024 21:41

janjust removed their request for review April 24, 2024 21:46

devreal force-pushed the opal_accelerator_stream_ops branch from 7de3f68 to f3133b4 Compare April 25, 2024 14:15

wenduwan approved these changes Apr 25, 2024

View reviewed changes

devreal force-pushed the opal_accelerator_stream_ops branch 2 times, most recently from f3133b4 to 26185d6 Compare April 25, 2024 18:23

devreal force-pushed the opal_accelerator_stream_ops branch from 26185d6 to c13779f Compare May 2, 2024 15:40

edgargabriel approved these changes May 14, 2024

View reviewed changes

devreal mentioned this pull request May 23, 2024

Add CUDA/HIP implementations of reduction operators #12569

Open

bosilca reviewed May 23, 2024

View reviewed changes

devreal mentioned this pull request May 23, 2024

Make datatype copy stream-aware #12570

Open

devreal mentioned this pull request Jun 10, 2024

Add accelerator-awareness to most allreduce implementations #12611

Open

devreal force-pushed the opal_accelerator_stream_ops branch 2 times, most recently from a3c8443 to 1d23d06 Compare June 12, 2024 15:27

devreal force-pushed the opal_accelerator_stream_ops branch from 1d23d06 to 684ff97 Compare June 12, 2024 15:58

devreal merged commit 5e0be56 into open-mpi:main Jun 12, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add stream operations to accelerator components #12356

Add stream operations to accelerator components #12356

devreal commented Feb 20, 2024 •

edited

hppritcha commented Feb 20, 2024

devreal commented Feb 20, 2024

edgargabriel left a comment

Akshay-Venkatesh left a comment

hppritcha commented May 2, 2024

edgargabriel left a comment

Akshay-Venkatesh commented May 14, 2024

devreal commented May 23, 2024

bosilca left a comment

devreal commented May 23, 2024

edgargabriel commented Jun 4, 2024

janjust commented Jun 11, 2024

devreal commented Jun 12, 2024

Add stream operations to accelerator components #12356

Add stream operations to accelerator components #12356

Conversation

devreal commented Feb 20, 2024 • edited

hppritcha commented Feb 20, 2024

devreal commented Feb 20, 2024

edgargabriel left a comment

Choose a reason for hiding this comment

Akshay-Venkatesh left a comment

Choose a reason for hiding this comment

hppritcha commented May 2, 2024

edgargabriel left a comment

Choose a reason for hiding this comment

Akshay-Venkatesh commented May 14, 2024

devreal commented May 23, 2024

bosilca left a comment

Choose a reason for hiding this comment

devreal commented May 23, 2024

edgargabriel commented Jun 4, 2024

janjust commented Jun 11, 2024

devreal commented Jun 12, 2024

devreal commented Feb 20, 2024 •

edited