Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI run-time issue with charm++ example #3701

Open
jscook2345 opened this issue Mar 31, 2023 · 5 comments
Open

MPI run-time issue with charm++ example #3701

jscook2345 opened this issue Mar 31, 2023 · 5 comments
Assignees

Comments

@jscook2345
Copy link

Hello,

I'm getting the following error when running one of the charm++ examples. Was looking for some guidance on how to debug the issue or ideas on what to try next.

Thanks,

Justin

Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired: MPI_THREAD_SINGLE)
Charm++> Running in non-SMP mode: 2 processes (PEs)
Converse/Charm++ Commit ID: v7.0.0
Charm++ built without optimization.
Do not use for performance benchmarking (build with --with-production to do so).
Charm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-error-checking to do so).
Charm++> MPI timer is synchronized
Isomalloc> Synchronized global address space.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 2 hosts (2 sockets x 64 cores x 2 PUs = 256-way SMP)
Charm++> cpu topology info is gathered in 0.004 seconds.
Running Hello on 2 processors for 2000000 elements
MPICH ERROR [Rank 0] [job id 6753376.0] [Fri Mar 31 11:29:51 2023] [nid004265] - Abort(203531535) (rank 0 in comm 0): Fatal error in PMPI_Iprobe: Other MPI error, error stack:
PMPI_Iprobe(126).......: MPI_Iprobe(src=MPI_ANY_SOURCE, tag=1375, comm=0x84000000, flag=0x7fff50e11494, status=0x7fff50e11480) failed
MPID_Iprobe(257).......:
MPIDI_iprobe_safe(118).:
MPIDI_iprobe_unsafe(42):
(unknown)(): Other MPI error

aborting job:
Fatal error in PMPI_Iprobe: Other MPI error, error stack:
PMPI_Iprobe(126).......: MPI_Iprobe(src=MPI_ANY_SOURCE, tag=1375, comm=0x84000000, flag=0x7fff50e11494, status=0x7fff50e11480) failed
MPID_Iprobe(257).......:
MPIDI_iprobe_safe(118).:
MPIDI_iprobe_unsafe(42):
(unknown)(): Other MPI error
srun: error: nid004265: task 0: Exited with exit code 255
srun: launch/slurm: _step_signal: Terminating StepId=6753376.0
srun: error: nid004748: task 1: Terminated
srun: Force Terminated StepId=6753376.0
@jscook2345
Copy link
Author

If it helps, this is on Perlmutter: https://www.nersc.gov/systems/perlmutter/

@stwhite91
Copy link
Contributor

How did you build Charm++ (./build charm++ mpi-crayshasta ?), what modules are you using (PrgEnv, mpi, etc.), and what is your run command?

@jscook2345
Copy link
Author

Build (tag v7.0.0):

./build charm++ mpi-crayshasta -g 
cd mpi-crayshasta/examples/charm++/hello/1darraymsg
make

Modules:

craype-x86-milan
libfabric/1.15.2.0
craype-network-ofi
xpmem/2.5.2-2.4_3.30__gd0f7936.shasta
PrgEnv-gnu/8.3.3
cray-dsmml/0.2.2
cray-libsci/23.02.1.1
cray-mpich/8.1.24
craype/2.7.19
gcc/11.2.0
perftools-base/23.02.0
cpe/23.02
xalt/2.10.2
cpu/1.0
cray-pmi/6.1.9
craype-hugepages8M

Run:

srun -C cpu -q debug -N 2 -n 2 --ntasks-per-node=1 -c 256 ./hello 2000000

@stwhite91
Copy link
Contributor

That all looks correct. It looks to me like an issue in cray-mpich or libfabric below it since the parameters passed into the MPI_Iprobe() call all look to be valid. You could maybe try reproducing the error in a standalone MPI_Iprobe test program using the same environment?

@jscook2345
Copy link
Author

I'll give that a try. Thanks for the initial look.

If I wanted to hook up a parallel debugger like ddt or totalview, do I need to do anything different because of kokkos?

Thanks,

Justin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants