Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PMI error when running on SDSC Expanse #6924

Open
JiakunYan opened this issue Feb 26, 2024 · 16 comments
Open

PMI error when running on SDSC Expanse #6924

JiakunYan opened this issue Feb 26, 2024 · 16 comments

Comments

@JiakunYan
Copy link

I am getting the following error when trying to run MPICH on SDSC Expanse (Infiniband machine with slurm).

srun -n 2 hello_world
PMII_singinit: execv failed: No such file or directory
[unset]: This singleton init program attempted to access some feature
[unset]: for which process manager support was required, e.g. spawn or universe_size.
[unset]: But the necessary mpiexec is not in your path.
PMII_singinit: execv failed: No such file or directory
[unset]: This singleton init program attempted to access some feature
[unset]: for which process manager support was required, e.g. spawn or universe_size.
[unset]: But the necessary mpiexec is not in your path.
[unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_943014_0 key=PMI_mpi_memory_alloc_kinds
:
system msg for write_line failure : Bad file descriptor
[unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_943015_0 key=PMI_mpi_memory_alloc_kinds
:
system msg for write_line failure : Bad file descriptor
exp-9-17: 0 / 1 OK
exp-9-17: 0 / 1 OK
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=28936391.1 tasks 0-1: running
^Csrun: sending Ctrl-C to StepId=28936391.1
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 28936391.1 ON exp-9-17 CANCELLED AT 2024-02-26T11:17:29 ***

mpichversion output

MPICH Version: 4.3.0a1
MPICH Release date: unreleased development copy
MPICH ABI: 0:0:0
MPICH Device: ch4:ucx
MPICH configure: --prefix=/home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/mpich-master-4uxeueqze7pn3732cbji36ckezyqld4o --disable-silent-rules --enable-shared --with-pm=no --enable-romio --without-ibverbs --enable-wrapper-rpath=yes --with-yaksa=/home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/yaksa-0.2-3r62jn5cdiiovsmntoqdrkzircgzvxqh --with-hwloc=/home/jackyan1/opt/hwloc/2.9.1 --with-slurm=yes --with-slurm-include=/cm/shared/apps/slurm/current/include --with-slurm-lib=/cm/shared/apps/slurm/current/lib --with-pmi=slurm --without-cuda --without-hip --with-device=ch4:ucx --with-ucx=/home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb --enable-libxml2 --enable-thread-cs=per-vci --with-datatype-engine=auto
MPICH CC: /home/jackyan1/workspace/spack/lib/spack/env/gcc/gcc -O2
MPICH CXX: /home/jackyan1/workspace/spack/lib/spack/env/gcc/g++ -O2
MPICH F77: /home/jackyan1/workspace/spack/lib/spack/env/gcc/gfortran -O2
MPICH FC: /home/jackyan1/workspace/spack/lib/spack/env/gcc/gfortran -O2
MPICH features: threadcomm

Any idea why this could happen?

@raffenet
Copy link
Contributor

Can you confirm your MPICH library and hello_world are linked with the Slurm PMI library? The output suggests each process thinks it is a singleton, so something is wrong in the discovery of other processes in the job.

@JiakunYan
Copy link
Author

According to the output of ldd, it seems it did link to the slurm pmi library.

srun -n 1 ldd ~/workspace/hpx-lci_scripts/spack_env/expanse/hpx-lcw/.spack-env/view/bin/hello_world
linux-vdso.so.1 (0x0000155555551000)
liblcw.so => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/lcw-master-qdc2ohhyw7cfzumwivkojiilsto66qlh/lib64/liblcw.so (0x000015555511a000)
libstdc++.so.6 => /cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib64/libstdc++.so.6 (0x0000155554d47000)
libm.so.6 => /lib64/libm.so.6 (0x00001555549c5000)
libgcc_s.so.1 => /cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib64/libgcc_s.so.1 (0x00001555547ac000)
libc.so.6 => /lib64/libc.so.6 (0x00001555543e7000)
liblci.so => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/lci-master-ladq3cnji4lzds5pvnmz3pr5kpskvhvs/lib64/liblci.so (0x00001555541c1000)
liblct.so => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/lci-master-ladq3cnji4lzds5pvnmz3pr5kpskvhvs/lib64/liblct.so (0x0000155553f78000)
libibverbs.so.1 => /lib64/libibverbs.so.1 (0x0000155553d58000)
libmpicxx.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/mpich-master-4uxeueqze7pn3732cbji36ckezyqld4o/lib/libmpicxx.so.0 (0x0000155553b35000)
libmpi.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/mpich-master-4uxeueqze7pn3732cbji36ckezyqld4o/lib/libmpi.so.0 (0x00001555534a2000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000155553282000)
/lib64/ld-linux-x86-64.so.2 (0x0000155555325000)
liblci-ucx.so => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/lci-master-ladq3cnji4lzds5pvnmz3pr5kpskvhvs/lib64/liblci-ucx.so (0x0000155553011000)
libnl-route-3.so.200 => /lib64/libnl-route-3.so.200 (0x0000155552d7f000)
libnl-3.so.200 => /lib64/libnl-3.so.200 (0x0000155552b5c000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000155552958000)
libhwloc.so.15 => /home/jackyan1/opt/hwloc/2.9.1/lib/libhwloc.so.15 (0x00001555526f9000)
libpciaccess.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/libpciaccess-0.17-jqqzmoorywzwslxnvh3whvxmxgggxddg/lib/libpciaccess.so.0 (0x00001555524ef000)
libxml2.so.2 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/libxml2-2.10.3-riigwi634oahw6njkyhbrhqjx2hsbjyt/lib/libxml2.so.2 (0x0000155552184000)
libucp.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb/lib/libucp.so.0 (0x0000155551eb6000)
libucs.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb/lib/libucs.so.0 (0x0000155551c55000)
libyaksa.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/yaksa-0.2-3r62jn5cdiiovsmntoqdrkzircgzvxqh/lib/libyaksa.so.0 (0x000015554f989000)
libxpmem.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/xpmem-2.6.5-36-n47tincumvgfjwbnhddzsqskzs7nxohd/lib/libxpmem.so.0 (0x000015554f786000)
librt.so.1 => /lib64/librt.so.1 (0x000015554f57e000)
libpmi.so.0 => /cm/shared/apps/slurm/current/lib64/libpmi.so.0 (0x000015554f378000)
libz.so.1 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/zlib-1.2.13-xhijn7cz7apogelukw47ulnzhhardvos/lib/libz.so.1 (0x000015554f160000)
liblzma.so.5 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/xz-5.4.1-knnmdfklcssmtvciq4pupvfqsh2upbzy/lib/liblzma.so.5 (0x000015554ef33000)
libiconv.so.2 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/libiconv-1.17-fdzdmyikb3i5dtfkt26raiyq63tumvnq/lib/libiconv.so.2 (0x000015554ec26000)
libuct.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb/lib/libuct.so.0 (0x000015554e9eb000)
libnuma.so.1 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/numactl-2.0.14-k3pqb32bk6b5sl2c7kvzd6errjicvsye/lib/libnuma.so.1 (0x000015554e7df000)
libucm.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb/lib/libucm.so.0 (0x000015554e5c4000)
libslurm_pmi.so => /cm/shared/apps/slurm/23.02.7/lib64/slurm/libslurm_pmi.so (0x000015554e1d2000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x000015554dfba000)
libatomic.so.1 => /cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib64/libatomic.so.1 (0x000015554ddb2000)

I also tried the --mpi=pmi2 option of srun and got a different error:

srun -n 2 --mpi=pmi2 hello_world
[cli_0]: write_line: message string doesn't end in newline: :cmd=put kvsname=28943582.7 key=-allgather-shm-1-0-seg-1/2 value=20D0539CE2EDC88B822000637577CC2B3200F8D74F[5]4F030088C70230AC7B6CE151210600020A150136CF1917B7513832A8384F9241AF360113230082BB9321060002C6CA67C9CF1917B7513832A8384F9241AF360013230082C8B7211201028F9A9BAD3BDF1A8A98[2]F0[4]CF1917B75138539E3E4BD7E037370113230082C119210600020A160136CF1917B751383A0B345023ADAE360113230082A92342088F9A9BAD3BDF1A0A8D5377CC2B32[2]705077CCAB33004F2300883B808067[4]43088F9A9BAD3BDF1A0AA7D377CC2B32[2]705077CCAB33004F2300882D75180006[2]C0241148[10]FFFF0A1501367E3977CC2B338142364F8797713526DF230007AB73014A0C[2]278AB00FA1338142364F5C7C613526DF22[2]7AD477CC2B338142364F5C7C613526DF2200010031D15C7CE1338142364FF28969352603270003C27301B30F77CCAB338142364FF28969352603270083C8730125030094057E3977CC2B33EF483850A2E73B3532DF2300072B9701490C[2]278AB00FA133EF48385077CC2B3532DF22[2]7AD477CC2B33EF48385077CC2B3532DF2200010031D15C7CE133EF4838500DDA33353203270003419701B30F77CCAB33EF4838500DDA3335320327008347970126088F9A9BAD3BDF1A0A478BBD3706360024:
[cli_1]: write_line: message string doesn't end in newline: :cmd=put kvsname=28943582.7 key=-allgather-shm-1-1-seg-1/2 value=2056E44C658714485C2000637577CC2B3200F8D74F[5]4F0300884C479A1F197DC693210600020A150136CF1917B7513832A8384F9241AF360113230082E5BD21060002C6CA67C9CF1917B7513832A8384F9241AF36001323008293FD211201028F9A9BAD3BDF1A8A98[2]F0[4]CF1917B75138539E3E4BD7E037370113230082854F210600020A160136CF1917B751383A0B345023ADAE36011323008295F342088F9A9BAD3BDF1A0A8D5377CC2B32[2]705077CCAB33004F2300883C808067[4]43088F9A9BAD3BDF1A0AA7D377CC2B32[2]705077CCAB33004F2300882E75180006[2]C0241148[10]FFFF0A1501367E3977CC2B338142364F8797713526DF230007AC7301490C[2]278AB00FA1338142364F5C7C613526DF22[2]7AD477CC2B338142364F5C7C613526DF2200010031D15C7CE1338142364FF28969352603270003C17301B30F77CCAB338142364FF28969352603270083C7730125030094057E3977CC2B33EF483850A2E73B3532DF2300072A97014B0C[2]278AB00FA133EF48385077CC2B3532DF22[2]7AD477CC2B33EF48385077CC2B3532DF2200010031D15C7CE133EF4838500DDA33353203270003409701B30F77CCAB33EF4838500DDA3335320327008346970126088F9A9BAD3BDF1A0A478BBD3706360024:
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=28943582.7 tasks 0-1: running
^Csrun: sending Ctrl-C to StepId=28943582.7
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

@hzhou
Copy link
Contributor

hzhou commented Feb 27, 2024

The srun --mpi=pmi2 is working. But looks like the exchange address string gets too long to fit the PMI message limit. Not sure where the inconsistency comes from.

@JiakunYan
Copy link
Author

JiakunYan commented May 3, 2024

For the error reported when srun --mpi=pmi2, manually modifying MPICH source code and reducing pmi_max_val_size by half fixed this issue. I would appreciate it if MPICH could provide an environmental variable for users to control the value (like the I_MPI_PMI_VALUE_LENGTH_MAX environmental variable in impi)

yfguo added a commit to yfguo/mpich that referenced this issue May 22, 2024
yfguo added a commit to yfguo/mpich that referenced this issue May 22, 2024
@raffenet
Copy link
Contributor

raffenet commented May 23, 2024

FWIW, a simple Slurm+PMI2 example putting a max size value does not hang on the Bebop cluster here at Argonne. There may still be a bug in the segmented put in the MPIR layer for PMI2. Will investigate further...

#include <slurm/pmi2.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void)
{
  int has_parent, size, rank, appnum;
  int pmi_max_val = PMI2_MAX_VALLEN;
  int pmi_max_key = PMI2_MAX_KEYLEN;

  PMI2_Init(&has_parent, &size, &rank, &appnum);
  char *pmi_kvs_name = malloc(pmi_max_val);
  PMI2_Job_GetId(pmi_kvs_name, pmi_max_val);

  char *valbuf = malloc(pmi_max_val);
  memset(valbuf, 'a', pmi_max_val);
  valbuf[pmi_max_val - 1] = '\0';

  PMI2_KVS_Put("foo", valbuf);
  PMI2_KVS_Fence();
  int out_len;
  PMI2_KVS_Get(pmi_kvs_name, PMI2_ID_NULL, "bar", NULL, 0, &out_len);

  PMI2_Finalize();
  return 0;
}

@hzhou
Copy link
Contributor

hzhou commented May 23, 2024

What Jiakun pointed out is it's likely the PMI2_MAX_VALLEN in pmi2.h is too big. It is 1024 historically. When exchange addresses and the address length is too big, we segment it according to PMI2_MAX_VALLEN, that apparently overflows the libmpi2 in Slurm.

@hzhou
Copy link
Contributor

hzhou commented May 23, 2024

So I think the right solution is to fix PMI2_MAX_VALLEN in pmi2.h. The header should be consistent with the library libpmi2.so.

If we want to add a environment override, it should be named PMI2_MAX_VALLEN, IMO.

@hzhou
Copy link
Contributor

hzhou commented May 23, 2024

@JiakunYan Does the example in #6924 (comment) work on SDSC Expanse?

@raffenet
Copy link
Contributor

So I think the right solution is to fix PMI2_MAX_VALLEN in pmi2.h. The header should be consistent with the library libpmi2.so.

I am saying that the library accepts a value equal to the maximum value length on Bebop. We should confirm it can do the same on Expanse before we say this is a bug in the header.

@hzhou
Copy link
Contributor

hzhou commented May 23, 2024

FWIW, a simple Slurm+PMI2 example putting a max size value does not hang on the Bebop cluster here at Argonne. There may still be a bug in the segmented put in the MPIR layer for PMI2. Will investigate further...

Make sure to check the return from the PMI2 functions. They may be errors.

@raffenet
Copy link
Contributor

Here's output from a modified example that puts and the gets the key.

#include <slurm/pmi2.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>

int main(void)
{
  int has_parent, size, rank, appnum;
  int pmi_max_val = PMI2_MAX_VALLEN;
  int pmi_max_key = PMI2_MAX_KEYLEN;

  PMI2_Init(&has_parent, &size, &rank, &appnum);
  char *pmi_kvs_name = malloc(pmi_max_val);
  PMI2_Job_GetId(pmi_kvs_name, pmi_max_val);

  char *valbuf = malloc(pmi_max_val + 1);
  memset(valbuf, 'a', pmi_max_val);
  valbuf[pmi_max_val] = '\0';
  assert(strlen(valbuf) <= pmi_max_val);
  printf("vallen = %d, max = %d\n", strlen(valbuf), pmi_max_val);

  PMI2_KVS_Put("foo", valbuf);
  printf("put %s into kvs\n", valbuf);
  PMI2_KVS_Fence();
  int out_len;
  char *outbuf = malloc(pmi_max_val + 1);
  PMI2_KVS_Get(pmi_kvs_name, PMI2_ID_NULL, "foo", outbuf, pmi_max_val + 1, &out_len);
  printf("out_len = %d, strlen = %d\n", out_len, strlen(outbuf));
  printf("got %s from kvs\n", outbuf);

  PMI2_Finalize();

  free(valbuf);
  free(outbuf);
  free(pmi_kvs_name);

  return 0;
}
[raffenet@beboplogin4]~% srun --mpi=pmi2 ./a.out
vallen = 1024, max = 1024
put aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa into kvs
out_len = 1024, strlen = 1024
got aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa from kvs

@hzhou
Copy link
Contributor

hzhou commented May 23, 2024

[cli_0]: write_line: message string doesn't end in newline: :cmd=put kvsname=28943582.7 key=-allgather-shm-1-0-seg-1/2 value=20D0539CE2...

I suspect the PMI2_MAX_VALLEN didn't account for the size of overhead, i.e. cmd=put kvsname=28943582.7 key=-allgather-shm-1-0-seg-1/2 value=. Again this should be accounted for by the libpmi implementations since the upper layer is not aware of the internal protocols. I am not against allowing users to overwrite with the environment, but it should be commented with clear reasons.

@hzhou
Copy link
Contributor

hzhou commented May 23, 2024

Here's output from a modified example that puts and the gets the key.

Yeah, this seems to support that Slurm is able accommodate 1024 value size -- its internal buffer size must be even larger.

@hzhou
Copy link
Contributor

hzhou commented May 23, 2024

libpmi.so.0 => /cm/shared/apps/slurm/current/lib64/libpmi.so.0 (0x000015554f378000)

Oh, Just realized @JiakunYan was linking with PMI-1 rather than PMI-2. @raffenet Need test Slurm's PMI-1

@JiakunYan
Copy link
Author

@raffenet @hzhou https://pm.bsc.es/gitlab/rarias/bscpkgs/-/issues/126 might be helpful in explaining the situation.

I think it is related to the pmi implementation of specific slurm versions.

@raffenet
Copy link
Contributor

@JiakunYan thanks, that is helpful. I probably need to run on multiple nodes to trigger the problem. Will try again when I have a chance. It would be good, IMO, to submit a ticket with a PMI-only reproducer to Slurm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants