can - A simple dense matrix-matrix mutiplication benchmark.

there are serial, intel MKL dgemm(), OpenMP, MPI, hybrid(MPI+OpenMP), and hybrid(MPI+OpenACC) versions.
MPI version is based on Cannon's algorithm. intel compiler and intel MKL library are needed.
Input matrix is a psudorandom number, that is generated by intel MKL Mersenne Twister(MT19937)

binary names:
- serial: seri
- OpenMP: omp
- intel MKL dgemm(): dgemm
- MPI: can
- hybrid(MPI+OpenMP): can_hyb
- hybrid(MPI+OpenACC): can_acc
matrix size: imax x imax (param.f)
Some notes for MPI and hybrid version:
- imax/sqrt(np) must be an integer.
- sqrt(np) must be an integer.

how to run

intel compiler and intel MPI are required.

$ make
$ ./create_input

$ ./seri
or
$ ./omp
or
./dgemm
or
mpirun -np $NP ./can
or
mpirun -np $NP ./can_hyb

performance comparison(matrix size: 4096x4096, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, 14 cores/socket, 2 sockets/node, 4 nodes, intel OPA):

serial

$ ./seri
 serial time:   6.99500107765198        19.6481675908668      Gflops
 trace:   4196462.48061815

MKL dgemm() (single thread)

$ MKL_NUM_THREADS=1 ./dgemm
 dgemm time:   3.69211506843567        37.2249918879782      Gflops
 trace:   4196462.48061815

MKL dgemm() (28 threads)

$ MKL_NUM_THREADS=28 KMP_AFFINITY=compact ./dgemm
 dgemm time:   1.08629608154297        126.520711808868      Gflops
 trace:   4196462.48061815

OpenMP (28 threads)

$ OMP_NUM_THREADS=28 KMP_AFFINITY=compact ./omp
 omp time:  0.852473020553589        161.223816071913      Gflops
 trace:   4196462.48061815

MPI

$ mpiexec.hydra -ppn 16 -np 64 ./can
 MPI time:  0.405706882476807        338.764165480622      Gflops
 trace:   4196462.48061815

hybrid(MPI+OpenMP)

$ OMP_NUM_THREADS=$((28/4)) KMP_AFFINITY=compact mpiexec.hydra -ppn 4 -np 16 ./can_hyb
 MPI time:  0.325567960739136        422.151347939683      Gflops
 trace:   4196462.48061815

check the results

$ ./check c.seri c.dgemm
 maximum error:  9.094947017729282E-012
$ ./check c.seri c.omp
 maximum error:  0.000000000000000E+000
$ ./check c.seri c.can
 maximum error:  1.409716787748039E-011
$ ./check c.seri c.can_hyb
 maximum error:  1.409716787748039E-011

MPI+OpenACC version

PGI compiler, OpenMPI and intel MKL are required.
CPU and inteterconnect are the same as normal version, GPU is nvidia P100x4 per 1 node.
GPUDirect is used.

$ make -f makefile.acc.mk
$ ./create_input
$ ./seri
 serial time:    51.68619100000000         2.659103927236580      Gflops
 trace:    4196462.480618147
$  mpirun -x LD_LIBRARY_PATH -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -npernode 4 -np 16 ./can_acc
 MPI time:   0.1217727372422814         1128.651261230572      Gflops
 trace:    4196462.480618146
$ ./check c.seri c.can_acc
 maximum error:   1.2278178473934531E-011

Large size test(imax=16*1024, 4 nodes)

flat MPI, 64 cores, intel compiler and intel MPI

$ mpiexec.hydra -ppn 16 -np 64 ./can
 MPI time:   82.6075530052185        106.480493637819      Gflops
 trace:   67116321.7059676

hybrid(MPI+OpenMP), 112 cores, intel compiler and intel MPI

$  OMP_NUM_THREADS=$((28/4)) KMP_AFFINITY=compact mpiexec.hydra -ppn 4 -np 16 ./can_hyb
 MPI time:   40.3734800815582        217.868090747666      Gflops
 trace:   67116321.7059676

hybrid(MPI+OpenACC), 16 GPUs, PGI compiler and OpenMPI

$ mpirun -x LD_LIBRARY_PATH -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -npernode 4 -np 16 ./can_acc
 MPI time:    4.504744562320411         1952.628589816666      Gflops
 trace:    67116321.70596765

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
README.md		README.md
can.f		can.f
can_acc.f		can_acc.f
can_hyb.f		can_hyb.f
check.f		check.f
create_input.f		create_input.f
dclock.f		dclock.f
dgemm.f		dgemm.f
generate.f		generate.f
makefile		makefile
makefile.acc.mk		makefile.acc.mk
omp.f		omp.f
param.f		param.f
seri.f		seri.f

dc-fukuoka/can

Folders and files

Latest commit

History

Repository files navigation

can - A simple dense matrix-matrix mutiplication benchmark.

how to run

performance comparison(matrix size: 4096x4096, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, 14 cores/socket, 2 sockets/node, 4 nodes, intel OPA):

check the results

MPI+OpenACC version

Large size test(imax=16*1024, 4 nodes)

About

Topics

Resources

Stars

Watchers

Forks

Languages