Skip to content
/ can Public

can - a simple dense matrix-matrix multiplication benchmark with MPI/OpenMP/OpenACC. MPI version is based on Cannon's algorithm.

Notifications You must be signed in to change notification settings

dc-fukuoka/can

Repository files navigation

can - A simple dense matrix-matrix mutiplication benchmark.

there are serial, intel MKL dgemm(), OpenMP, MPI, hybrid(MPI+OpenMP), and hybrid(MPI+OpenACC) versions.
MPI version is based on Cannon's algorithm. intel compiler and intel MKL library are needed.
Input matrix is a psudorandom number, that is generated by intel MKL Mersenne Twister(MT19937)

  • binary names:

    • serial: seri
    • OpenMP: omp
    • intel MKL dgemm(): dgemm
    • MPI: can
    • hybrid(MPI+OpenMP): can_hyb
    • hybrid(MPI+OpenACC): can_acc
  • matrix size: imax x imax (param.f)

  • Some notes for MPI and hybrid version:

    • imax/sqrt(np) must be an integer.
    • sqrt(np) must be an integer.

how to run

  • intel compiler and intel MPI are required.
$ make
$ ./create_input

$ ./seri
or
$ ./omp
or
./dgemm
or
mpirun -np $NP ./can
or
mpirun -np $NP ./can_hyb

performance comparison(matrix size: 4096x4096, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, 14 cores/socket, 2 sockets/node, 4 nodes, intel OPA):

  • serial
$ ./seri
 serial time:   6.99500107765198        19.6481675908668      Gflops
 trace:   4196462.48061815
  • MKL dgemm() (single thread)
$ MKL_NUM_THREADS=1 ./dgemm
 dgemm time:   3.69211506843567        37.2249918879782      Gflops
 trace:   4196462.48061815
  • MKL dgemm() (28 threads)
$ MKL_NUM_THREADS=28 KMP_AFFINITY=compact ./dgemm
 dgemm time:   1.08629608154297        126.520711808868      Gflops
 trace:   4196462.48061815
  • OpenMP (28 threads)
$ OMP_NUM_THREADS=28 KMP_AFFINITY=compact ./omp
 omp time:  0.852473020553589        161.223816071913      Gflops
 trace:   4196462.48061815
  • MPI
$ mpiexec.hydra -ppn 16 -np 64 ./can
 MPI time:  0.405706882476807        338.764165480622      Gflops
 trace:   4196462.48061815
  • hybrid(MPI+OpenMP)
$ OMP_NUM_THREADS=$((28/4)) KMP_AFFINITY=compact mpiexec.hydra -ppn 4 -np 16 ./can_hyb
 MPI time:  0.325567960739136        422.151347939683      Gflops
 trace:   4196462.48061815

check the results

$ ./check c.seri c.dgemm
 maximum error:  9.094947017729282E-012
$ ./check c.seri c.omp
 maximum error:  0.000000000000000E+000
$ ./check c.seri c.can
 maximum error:  1.409716787748039E-011
$ ./check c.seri c.can_hyb
 maximum error:  1.409716787748039E-011

MPI+OpenACC version

PGI compiler, OpenMPI and intel MKL are required.
CPU and inteterconnect are the same as normal version, GPU is nvidia P100x4 per 1 node.
GPUDirect is used.

$ make -f makefile.acc.mk
$ ./create_input
$ ./seri
 serial time:    51.68619100000000         2.659103927236580      Gflops
 trace:    4196462.480618147
$  mpirun -x LD_LIBRARY_PATH -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -npernode 4 -np 16 ./can_acc
 MPI time:   0.1217727372422814         1128.651261230572      Gflops
 trace:    4196462.480618146
$ ./check c.seri c.can_acc
 maximum error:   1.2278178473934531E-011

Large size test(imax=16*1024, 4 nodes)

  • flat MPI, 64 cores, intel compiler and intel MPI
$ mpiexec.hydra -ppn 16 -np 64 ./can
 MPI time:   82.6075530052185        106.480493637819      Gflops
 trace:   67116321.7059676
  • hybrid(MPI+OpenMP), 112 cores, intel compiler and intel MPI
$  OMP_NUM_THREADS=$((28/4)) KMP_AFFINITY=compact mpiexec.hydra -ppn 4 -np 16 ./can_hyb
 MPI time:   40.3734800815582        217.868090747666      Gflops
 trace:   67116321.7059676
  • hybrid(MPI+OpenACC), 16 GPUs, PGI compiler and OpenMPI
$ mpirun -x LD_LIBRARY_PATH -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -npernode 4 -np 16 ./can_acc
 MPI time:    4.504744562320411         1952.628589816666      Gflops
 trace:    67116321.70596765

About

can - a simple dense matrix-matrix multiplication benchmark with MPI/OpenMP/OpenACC. MPI version is based on Cannon's algorithm.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published