`resplit` with Custom MPI Datatypes and AlltoAllW #1493

JuanPedroGHM · 2024-05-23T13:19:24Z

Due Diligence

General:
- title of the PR is suitable to appear in the Release Notes
Implementation:
- unit tests: all split configurations tested
- unit tests: multiple dtypes tested
- documentation updated where needed

Description

Rewritten most of resplit to use Alltoallw and custom data types.

Issue/s resolved: #

Changes proposed:

MPICommunicator
- Alltoallw operation, that mimics the MPI Alltoallw interface.
- mpi_type_of class method for easy of use
- _create_recursive_vector to handle subarray datatype creation for non-contiguous send buffers.
Manipulations
- _axis2axis method to handle all non-trivial replits
Tiling
- get_subarray_params method to calculate MPI subarray type params

Type of change

Breaking change (fix or feature that would cause existing functionality to not work as expected)

Memory requirements

Coming soon...

Performance

Coming soon...

Does this change modify the behaviour of other functions? If so, which?

yes

Probably most high level ones.

…ne test case, thanks to recursive vector datatypes

github-actions · 2024-06-05T20:00:49Z

Thank you for the PR!

github-actions · 2024-06-05T20:01:08Z

Thank you for the PR!

github-actions · 2024-06-06T09:35:04Z

Thank you for the PR!

JuanPedroGHM · 2024-06-06T14:43:46Z

Some results from Horeka using the following code. Using the old resplit, it managed to overflow the GPU memory (H100 with 94 GBs of memory) when doubling the size of the array. When trying on 4 nodes (16 GPUs), it hit the walltime of 1 hour.

from mpi4py import MPI
import heat as ht
import argparse
import perun
import torch

from heat.core.communication import CUDA_AWARE_MPI
print(f"CUDA_AWARE_MPI: {CUDA_AWARE_MPI}")

@perun.monitor()
def cpu_contiguous(a):
    a = a.resplit(4)
    a.resplit_(3)


@perun.monitor()
def cpu_noncontiguous(a):
    a = a.resplit(0)
    a.resplit_(2)


@perun.monitor()
def gpu_contiguous(a):
    a = a.resplit(4)
    a.resplit_(3)

@perun.monitor()
def gpu_noncontiguous(a):
    a = a.resplit(0)
    a.resplit_(2)

if __name__ == "__main__":
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    size = comm.Get_size()
    
    # Contiguous data creation
    shape = [100, 50, 50, 20, 250]
    n_elements = ht.array(shape).prod().item()
    Mem = n_elements * 8 / 1e9
    base_array = torch.arange(0, n_elements, dtype=torch.float64).reshape(shape) * (rank+1)
    
    print(f"Rank {rank} - Local Shape: {shape} - Memory: {Mem * size} GB - Per rank: {Mem} GB")
    
    # CPU contiguous data
    print("CPU contiguous data")
    a = ht.array(base_array, dtype=ht.float64, is_split=0, copy=True)
    print(f"Rank {rank} - Shape: {a.shape} - Split: {a.split} - Lshape: {a.lshape} - Device: {a.device}")
    cpu_contiguous(a)
    del a
    
    # CPU non-contiguous data
    print("CPU non-contiguous data")
    a = ht.array(base_array, dtype=ht.float64, is_split=1, copy=True).transpose((1,0,4,3,2))
    print(f"Rank {rank} - Shape: {a.shape} - Split: {a.split} - Lshape: {a.lshape} - Device: {a.device}")
    cpu_noncontiguous(a)
    del a

    # GPU contiguous data
    print("GPU contiguous data")
    a = ht.array(base_array, dtype=ht.float64, device="cuda", is_split=0, copy=True)
    print(f"Rank {rank} - Shape: {a.shape} - Split: {a.split} - Lshape: {a.lshape} - Device: {a.device}")
    gpu_contiguous(a)
    del a
    
    # GPU non-contiguous data
    print("GPU non-contiguous data")
    a = ht.array(base_array, dtype=ht.float64, device="cuda", is_split=1, copy=True).transpose((1,0,4,3,2))
    print(f"Rank {rank} - Shape: {a.shape} - Split: {a.split} - Lshape: {a.lshape} - Device: {a.device}")
    gpu_noncontiguous(a)
    del a

    torch.cuda.empty_cache()

Runtime

Memory

GPU Memory

ClaudiaComito · 2024-06-06T16:02:45Z

And the runtime axis is logarithmic? 😁 Fantastic @JuanPedroGHM !

codecov · 2024-06-07T02:49:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.78%. Comparing base (6f5fa1f) to head (48939e1).

❗ Current head 48939e1 differs from pull request most recent head c083678

Please upload reports for the commit c083678 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1493      +/-   ##
==========================================
- Coverage   91.80%   91.78%   -0.02%     
==========================================
  Files          80       80              
  Lines       11772    11810      +38     
==========================================
+ Hits        10807    10840      +33     
- Misses        965      970       +5

Flag	Coverage Δ
unit	`91.78% <100.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ClaudiaComito

@JuanPedroGHM I will review next week, but in the meantime - was resplit ever tested on column-major arrays? To me it looks like we forgot that one. If so, would you add a test for DNDarrays with order="F"? There are some examples in test_dndarray.test_stride_and_strides.

Thanks again for the fantastic job!

github-actions · 2024-06-07T19:56:30Z

Thank you for the PR!

github-actions · 2024-06-10T09:40:42Z

Thank you for the PR!

github-actions · 2024-06-10T09:42:19Z

Thank you for the PR!

github-actions · 2024-06-10T11:40:17Z

Thank you for the PR!

github-actions · 2024-06-10T12:01:23Z

Thank you for the PR!

github-actions · 2024-06-10T12:31:58Z

Thank you for the PR!

github-actions · 2024-06-12T11:17:47Z

Thank you for the PR!

github-actions · 2024-06-13T11:44:08Z

Thank you for the PR!

ClaudiaComito

@JuanPedroGHM thank you so much for this. I have a few comments, mostly from the point of view of maintainability. Great job!

heat/core/communication.py

ClaudiaComito · 2024-06-14T03:00:40Z

heat/core/communication.py

+ sendbuf: Union[DNDarray, torch.Tensor, Any]
+ Buffer address of the send message
+ recvbuf: Union[DNDarray, torch.Tensor, Any]
+ Buffer address where to store the result


Can we give more details on how sendbuf and recvbuf should be constructed? Related: #1072 (which we haven't addressed so far)

That is definitely a big problem in the whole problem in the whole communication.py file. I guess it was made like this originally to support _alltoall_like and similar methods that need a flexible function signature.

For me, what we need is to define a more consistent communication interface, for example, one where all the communication buffers are a tuple of a torch array, and a collection of views from that buffer that we use to define the data types. What do you think?

For now, I'll expand on the actual contents of the buffer on the doc string.

ClaudiaComito · 2024-06-14T03:17:21Z

heat/core/communication.py

+
+ Notes
+ -----
+ This function creates a recursive vector datatype by defining vectors out of the previous datatype with specified strides and sizes. The extent of the new datatype is set to the extent of the basic datatype to allow interweaving of data.


stupid question for sure, but what is the "extent of a datatype"?

It is a weird name for the length, or how many bits, the data type uses. Every MPI datatype has a symbolic extent, that we use to read non-contiguous data in the right order, and a real extent. More here: https://enccs.github.io/intermediate-mpi/derived-datatypes-pt2/

heat/core/tests/test_dndarray.py

heat/core/tiling.py

heat/core/communication.py

ClaudiaComito · 2024-06-14T04:02:27Z

heat/core/dndarray.py

+ recv_buffer = torch.empty(
+ tuple(new_lshape), dtype=self.dtype.torch_type(), device=self.device.torch_device
 )


Is this still in-place? 😬 Not sure this function ever was in place, really. Expand docs?

What's the difference in memory usage between array.resplit_(newaxis) and array = ht.resplit(array, newaxis)?

Well, it is the type of in place resplit as before, which is not really in place for on the larray. The DNDArray is reused and all the other properties are rewritten, but a new larray is created.

We could keep the same larray if the incoming and out coming data has the same length, but it will have to be written in a non-contiguous way, which is not impossible, but not ideal in my opinion.

ClaudiaComito · 2024-06-14T04:07:29Z

heat/core/manipulations.py

@@ -3537,34 +3537,15 @@ def resplit(arr: DNDarray, axis: int = None) -> DNDarray:
 gathered, is_split=axis, device=arr.device, comm=arr.comm, dtype=arr.dtype
 )
 return new_arr
- arr_tiles = tiling.SplitTiles(arr)
+


Should we update the docstring here, and mention the Dalcin paper?

Do you mean the FFT paper? I would reference it inside Alltoallw, as much of the relevant code landed there, or in tiling.py

reshape example script

0319eb8

JuanPedroGHM self-assigned this May 23, 2024

ClaudiaComito changed the title ~~Reshape with Custom MPI Datatypes and AlltoAllW~~ resplit with Custom MPI Datatypes and AlltoAllW May 23, 2024

JuanPedroGHM added 4 commits May 24, 2024 10:50

benchmarking script

427b029

feat: new resplit with alltoallw

5c672ca

Merge branch 'main' into feat/custom-mpi-types

503d605

refactor: Alltoallw function in MPICommunicator

4f76d3a

JuanPedroGHM added the benchmark PR label Jun 4, 2024

JuanPedroGHM added 5 commits June 4, 2024 16:15

fix: allowed empty tiles, inplace resplit works too

612ee8e

Merge branch 'main' into feat/custom-mpi-types

308404c

Merge branch 'main' into feat/custom-mpi-types

64ddf04

feat: alltoallw for non-contiguous datatypes is working, except for o…

0285c23

…ne test case, thanks to recursive vector datatypes

fix: complex conj operation is a view, not a real change

e62c322

JuanPedroGHM marked this pull request as ready for review June 5, 2024 19:55

Merge branch 'main' into feat/custom-mpi-types

e97409d

JuanPedroGHM requested a review from ClaudiaComito June 5, 2024 19:56

JuanPedroGHM added MPI Anything related to MPI communication communication labels Jun 5, 2024

fix: not cuda aware mpi works now

2ec286e

JuanPedroGHM added the PR talk label Jun 6, 2024

ClaudiaComito requested changes Jun 7, 2024

View reviewed changes

JuanPedroGHM added 2 commits June 7, 2024 20:53

Merge branch 'main' into feat/custom-mpi-types

536feca

ci: testing for column mayor and non-contiguous configurations

676d93e

JuanPedroGHM added 2 commits June 10, 2024 11:35

Merge branch 'main' into feat/custom-mpi-types

0337cf8

ci: just to retrigger test

fc016f2

ci: non-matching device fixed in tests

e1586ad

Merge branch 'main' into feat/custom-mpi-types

8647951

Merge branch 'main' into feat/custom-mpi-types

d8348ef

JuanPedroGHM requested a review from ClaudiaComito June 10, 2024 14:39

Merge branch 'main' into feat/custom-mpi-types

1486f0a

JuanPedroGHM added merge queue Review me and removed PR talk labels Jun 12, 2024

Merge branch 'main' into feat/custom-mpi-types

48939e1

ClaudiaComito requested changes Jun 14, 2024

View reviewed changes

ClaudiaComito removed merge queue Review me labels Jun 17, 2024

ClaudiaComito added this to the 1.5.0 milestone Jun 17, 2024

docs: doc string improvements and dead code removal

c083678

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`resplit` with Custom MPI Datatypes and AlltoAllW #1493

`resplit` with Custom MPI Datatypes and AlltoAllW #1493

JuanPedroGHM commented May 23, 2024 •

edited

Loading

github-actions bot commented Jun 5, 2024

github-actions bot commented Jun 5, 2024

github-actions bot commented Jun 6, 2024

JuanPedroGHM commented Jun 6, 2024 •

edited

Loading

ClaudiaComito commented Jun 6, 2024

codecov bot commented Jun 7, 2024 •

edited

Loading

ClaudiaComito left a comment

github-actions bot commented Jun 7, 2024

github-actions bot commented Jun 10, 2024

github-actions bot commented Jun 10, 2024

github-actions bot commented Jun 10, 2024

github-actions bot commented Jun 10, 2024

github-actions bot commented Jun 10, 2024

github-actions bot commented Jun 12, 2024

github-actions bot commented Jun 13, 2024

ClaudiaComito left a comment

ClaudiaComito Jun 14, 2024

JuanPedroGHM Jun 18, 2024

ClaudiaComito Jun 14, 2024

JuanPedroGHM Jun 18, 2024

ClaudiaComito Jun 14, 2024

JuanPedroGHM Jun 18, 2024

ClaudiaComito Jun 14, 2024

JuanPedroGHM Jun 18, 2024

resplit with Custom MPI Datatypes and AlltoAllW #1493

Are you sure you want to change the base?

resplit with Custom MPI Datatypes and AlltoAllW #1493

Conversation

JuanPedroGHM commented May 23, 2024 • edited Loading

Due Diligence

Description

Changes proposed:

Type of change

Memory requirements

Performance

Does this change modify the behaviour of other functions? If so, which?

github-actions bot commented Jun 5, 2024

github-actions bot commented Jun 5, 2024

github-actions bot commented Jun 6, 2024

JuanPedroGHM commented Jun 6, 2024 • edited Loading

Runtime

Memory

GPU Memory

ClaudiaComito commented Jun 6, 2024

codecov bot commented Jun 7, 2024 • edited Loading

Codecov Report

ClaudiaComito left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 7, 2024

github-actions bot commented Jun 10, 2024

github-actions bot commented Jun 10, 2024

github-actions bot commented Jun 10, 2024

github-actions bot commented Jun 10, 2024

github-actions bot commented Jun 10, 2024

github-actions bot commented Jun 12, 2024

github-actions bot commented Jun 13, 2024

ClaudiaComito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

`resplit` with Custom MPI Datatypes and AlltoAllW #1493

`resplit` with Custom MPI Datatypes and AlltoAllW #1493

JuanPedroGHM commented May 23, 2024 •

edited

Loading

JuanPedroGHM commented Jun 6, 2024 •

edited

Loading

codecov bot commented Jun 7, 2024 •

edited

Loading