You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This release enables COSMA to take advantage of fast GPU-to-GPU interconnects like NVLink, to efficiently utilize modern Multi-GPU Systems. This is achieved in 2 ways:
Using NCCL/RCCL Libraries: by specifying -DCOSMA_WITH_NCCL=ON cmake option.
Using GPU-aware MPI: by specifying -DCOSMA_WITH_GPU_AWARE_MPI=ON cmake option, as proposed here.
See README and INSTALL for more info on how to build.
In addition, the following performance improvemets have been made:
Improved Caching:
all nccl buffers, MPI comms, nccl comms are cached and reused when appropriate.
all device memory is cached and reused.
Reduced Data Trasfers: the GPU backend of COSMA called Tiled-MM is extended to offer the possibility to the user to leave the resulting matrix C on the GPU. In that case, there is no need to trasfer matrix C from device to host, which not only reduces the communication, but also speeds up the whole cpu->gpu pipeline as no additional synchronizations are needed. Furthermore, reduce_scatter operation does not have to wait for C to be transfered back to host but is immediately invoked with GPU pointers, thus utilizing fast inter-gpu links. This way, there is no unnecessary data transfers between cpu<->gpu.
All collectives updated: both all-gather and reduce-scatter collectives are improved.
Reduced Data Reshuffling: avoids double reshuffling of data, i.e. the data from NCCL/RCCL GPU buffers is immediately copied in the right layout, without additional reshuffling.
Works for variable blocks: NCCL/RCCL' reduce_scatter operation assumes that all the blocks are of the same size and is hence not completely equivalent to MPI_Reduce_scatterv which we previously used. We padded all the blocks to be able to overcome this issue.
Portability: Supports both NVIDIA and AMD GPUs.
Tiled-MM: Updated submodule
COSTA: Updated submodule
This discussion was created from the release COSMA-v2.6.0.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
This release enables COSMA to take advantage of fast GPU-to-GPU interconnects like NVLink, to efficiently utilize modern Multi-GPU Systems. This is achieved in 2 ways:
NCCL/RCCL
Libraries: by specifying-DCOSMA_WITH_NCCL=ON
cmake option.-DCOSMA_WITH_GPU_AWARE_MPI=ON
cmake option, as proposed here.See README and INSTALL for more info on how to build.
In addition, the following performance improvemets have been made:
all-gather
andreduce-scatter
collectives are improved.MPI_Reduce_scatterv
which we previously used. We padded all the blocks to be able to overcome this issue.This discussion was created from the release COSMA-v2.6.0.
Beta Was this translation helpful? Give feedback.
All reactions