-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PMI error when running on SDSC Expanse #6924
Comments
Can you confirm your MPICH library and |
According to the output of ldd, it seems it did link to the slurm pmi library.
I also tried the
|
The |
For the error reported when |
FWIW, a simple Slurm+PMI2 example putting a max size value does not hang on the Bebop cluster here at Argonne. There may still be a bug in the segmented put in the MPIR layer for PMI2. Will investigate further... #include <slurm/pmi2.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void)
{
int has_parent, size, rank, appnum;
int pmi_max_val = PMI2_MAX_VALLEN;
int pmi_max_key = PMI2_MAX_KEYLEN;
PMI2_Init(&has_parent, &size, &rank, &appnum);
char *pmi_kvs_name = malloc(pmi_max_val);
PMI2_Job_GetId(pmi_kvs_name, pmi_max_val);
char *valbuf = malloc(pmi_max_val);
memset(valbuf, 'a', pmi_max_val);
valbuf[pmi_max_val - 1] = '\0';
PMI2_KVS_Put("foo", valbuf);
PMI2_KVS_Fence();
int out_len;
PMI2_KVS_Get(pmi_kvs_name, PMI2_ID_NULL, "bar", NULL, 0, &out_len);
PMI2_Finalize();
return 0;
} |
What Jiakun pointed out is it's likely the |
So I think the right solution is to fix If we want to add a environment override, it should be named |
@JiakunYan Does the example in #6924 (comment) work on SDSC Expanse? |
I am saying that the library accepts a value equal to the maximum value length on Bebop. We should confirm it can do the same on Expanse before we say this is a bug in the header. |
Make sure to check the return from the PMI2 functions. They may be errors. |
Here's output from a modified example that puts and the gets the key. #include <slurm/pmi2.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
int main(void)
{
int has_parent, size, rank, appnum;
int pmi_max_val = PMI2_MAX_VALLEN;
int pmi_max_key = PMI2_MAX_KEYLEN;
PMI2_Init(&has_parent, &size, &rank, &appnum);
char *pmi_kvs_name = malloc(pmi_max_val);
PMI2_Job_GetId(pmi_kvs_name, pmi_max_val);
char *valbuf = malloc(pmi_max_val + 1);
memset(valbuf, 'a', pmi_max_val);
valbuf[pmi_max_val] = '\0';
assert(strlen(valbuf) <= pmi_max_val);
printf("vallen = %d, max = %d\n", strlen(valbuf), pmi_max_val);
PMI2_KVS_Put("foo", valbuf);
printf("put %s into kvs\n", valbuf);
PMI2_KVS_Fence();
int out_len;
char *outbuf = malloc(pmi_max_val + 1);
PMI2_KVS_Get(pmi_kvs_name, PMI2_ID_NULL, "foo", outbuf, pmi_max_val + 1, &out_len);
printf("out_len = %d, strlen = %d\n", out_len, strlen(outbuf));
printf("got %s from kvs\n", outbuf);
PMI2_Finalize();
free(valbuf);
free(outbuf);
free(pmi_kvs_name);
return 0;
} [raffenet@beboplogin4]~% srun --mpi=pmi2 ./a.out
vallen = 1024, max = 1024
put aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa into kvs
out_len = 1024, strlen = 1024
got aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa from kvs |
I suspect the |
Yeah, this seems to support that Slurm is able accommodate 1024 value size -- its internal buffer size must be even larger. |
Oh, Just realized @JiakunYan was linking with PMI-1 rather than PMI-2. @raffenet Need test Slurm's PMI-1 |
@raffenet @hzhou https://pm.bsc.es/gitlab/rarias/bscpkgs/-/issues/126 might be helpful in explaining the situation. I think it is related to the pmi implementation of specific slurm versions. |
@JiakunYan thanks, that is helpful. I probably need to run on multiple nodes to trigger the problem. Will try again when I have a chance. It would be good, IMO, to submit a ticket with a PMI-only reproducer to Slurm. |
I am getting the following error when trying to run MPICH on SDSC Expanse (Infiniband machine with slurm).
mpichversion
outputAny idea why this could happen?
The text was updated successfully, but these errors were encountered: