cuMPI: easy-to-use MPI for CUDA programming

cuMPI binds a process to a GPU device(in one node, or across multiple nodes in a network), and provides high-performance communication methods between different processes(i.e. GPU devices) based on NVIDIA NCCL. It means, every MPI process is able to utilize the compute and storage resources of its own GPU device and communicates with other MPI processes by GPU Direct Access(supported by recent architecture), as if all compute and communicating tasks move to GPU.

Using cuMPI in GPU works, We can do almost what we can in CPU-MPI programming before. And the porting is just to add some necessary variables and replace MPI_* to cuMPI_*!

Supported API

We support some common used MPI methods up to now, e.g.

Allreduce
SendRecv
Broadcast
Barrier

Other APIs as follows are also well-supported:

MPI_Init
MPI_Finalize
MPI_Comm_size
MPI_Comm_rank
MPI_Initialized

Usage

Build

git clone [email protected]:ueqri/cuMPI.git
mkdir -p cuMPI/build
cd cuMPI/build
cmake -DCUDA_ARCH_CODE="sm_70" \
      -DCUDA_ARCH_COMPUTE="compute_70" \
      ..
make -j

Note: Please replace sm_70 & compute_70 to your compute capability, lookup at https://developer.nvidia.com/cuda-gpus.

Install

The framework will be built to shared library. So the install depends on your purpose of using cuMPI.

If you want to use libcuMPI.so locally, you can copy the artifact in /path/to/cuMPI/build/src/libcuMPI.so to the certain directory and skip this step.
Otherwise, you can enter the /path/to/cuMPI/build and run make install to install cuMPI_runtime.h in /usr/local/include & libcuMPI.so in /usr/local/lib by default.

Integration

Step 0: check whether the MPI_Datatype, MPI_Op and functions in old MPI codes are supported by cuMPI
Step 1: change all MPI_* to cuMPI_* easily
Step 2: add some configurations in your CMakeLists.txt or Makefile, referred to the CMake in benchmark or test directory.
- find NCCL, CUDA, MPI packages
- link libcuMPI.so to your executable
Step 3: use mpirun or mpiexec and specify the right slots(explicitly the numbers of GPU devices) in single node or multiple nodes
Step 4: run the all-in-GPU tasks

Benchmark

The benchmark in benchmark/CocurrentSendRecv.cu uses a cocurrent communicating methods for SendRecv to test two GPUs band width.

If two GPUs are in one nodes, the benchmark shows the estimated band width of PCIe or NVLinks P2P connections. You can find the topological structure of GPUs in single node by nvidia-smi topo -m.
If two GPUs are in different nodes, the benchmark mainly shows the band width of InfiniBand or Ethernet connections if the NVLinks or PCIe band width is not the bottleneck.

How to run benchmark:

cd /path/to/cuMPI/build/benchmark
# Test two GPUs connection band width in single node
mpirun -np 2 ./cuMPI-benchmark

Test

We provides the test for Allreduce, SendRecv, Broadcast.

You can run the following commands for test:

cd /path/to/cuMPI/build/test
mpirun -np 2 ./cuMPI-testBcast
mpirun -np 2 ./cuMPI-testSendRecv
mpirun -np 2 ./cuMPI-testAllReduce

In the next steps, we would add google-test for modernizations.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
benchmark		benchmark
cmake		cmake
src		src
test		test
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
benchmark.sh		benchmark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cuMPI: easy-to-use MPI for CUDA programming

Supported API

Usage

Build

Install

Integration

Benchmark

Test

About

Releases

Packages

Languages

License

ueqri/cuMPI

Folders and files

Latest commit

History

Repository files navigation

cuMPI: easy-to-use MPI for CUDA programming

Supported API

Usage

Build

Install

Integration

Benchmark

Test

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages