Crash when node names are too long #55

MK4H · 2021-05-04T15:21:32Z

If node names are over 20 characters long, the output of sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong", used at node.go:85, looks like this:

cpu-always-on-st-t30                   1                   0/2/0/2             idle                
cpu-spot-dy-c52xlar0                   1                   0/8/0/8             idle~               
cpu-spot-dy-c52xlar0                   1                   0/8/0/8             idle~

You can see that node name and memory are not separated by whitespace.

This results in a crash with the following output:

prometheus-slurm-exporter[5783]: panic: runtime error: index out of range [4] with length 4
prometheus-slurm-exporter[5783]: goroutine 9 [running]:
prometheus-slurm-exporter[5783]: main.ParseNodeMetrics(0xc00016e000, 0x5eb, 0xe00, 0xc0000b10d8)
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/node.go:56 +0x6cf
prometheus-slurm-exporter[5783]: main.NodeGetMetrics(0x8b7f20)
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/node.go:40 +0x2a
prometheus-slurm-exporter[5783]: main.(*NodeCollector).Collect(0xc00007a000, 0xc0000b1080)
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/node.go:128 +0x37
prometheus-slurm-exporter[5783]: github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x1a2
prometheus-slurm-exporter[5783]: created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xe8e
systemd[1]: slurm_exporter.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: slurm_exporter.service: Failed with result 'exit-code'.

It expects 5 fields separated by whatespace, but finds only 4 which results in out-of-bounds array access and panic.

Possible fix is to change sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong" to sinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong: ", explicitly telling SLURM to append a space after each value.

The text was updated successfully, but these errors were encountered:

Fix vpenso#55 When node names are over 20 characters long, sinfo truncates the name but does not preserve the space between NodeList and AllocMem. Example: cpu-always-on-st-t30 1 0/2/0/2 idle This commit explicitly tells sinfo to append a space after each entry, fixing the issue.

mtds · 2021-05-19T08:51:28Z

On our current infrastructure it is not possible to reproduce this problem, given the fact that we use hostnames with less than 10 characters.

On Slurm 18.08.8 (CentOS 7.8), I have got quite a different output.

If I use the command currently in node.go I got the following (shortened for clarity):

[...]
node1032            main              189200              191762              86/10/0/96          mixed
node1033            main              189200              191762              86/10/0/96          mixed

With the proposed modification to the syntax, I have got instead:

[...]
node1032main18920019176286/10/0/96mixed
node1033main18920019176286/10/0/96mixed

The second output will most likely crash node.go, considering that to extract the numbers we assume that some spaces are present between each value (node name, partition, etc.).

Which version of Slurm are you using? From the crash log you have posted and the hostnames you are using, I assume you are running a cluster on AWS but we definitely do not have operational experience with that environment.

MK4H · 2021-05-29T08:54:35Z

We are using AWS ParallelCluster 2.10.0, running on Ubuntu 18.04 using Slurm version 20.02.4. With this version, the output of sinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong: " is

cpu-always-on-st-t3amedium-1 0 1 0/2/0/2 idle 
cpu-always-on-st-t3amedium-2 0 1 0/2/0/2 idle 
cpu-always-on-st-t3amedium-3 0 1 0/2/0/2 idle 
cpu-spot-dy-c52xlarge-1 0 1 0/8/0/8 idle~ 
cpu-spot-dy-c52xlarge-2 0 1 0/8/0/8 idle~

So I guess there was a change between 18.08.8 and 20.02.4 that changes the interface and output of sinfo.

Edit:
Found it: SchedMD/slurm@9ea6c94

mtds · 2021-05-29T15:42:59Z

We have faced this problem in the past. Since this exporter is basically parsing the output of sinfo, squeue, sdiag, etc. it is very sensible to the output format, which from time to time is changed by the SchedMD developers.

The node.sh module is particularly prone to this problem. In the past, I have implemented a workaround using regular expressions (e.g. nodes.go), but so far I did not have the chance to do the same with node.sh.

sjpb · 2022-05-24T16:21:09Z

Recent versions of sinfo have --json although that won't be an option for everyone. Does this need fields which are not supported by -o|--format? That says that if size isn't provided it automatically picks one long enough. As opposed to -O|--Format which defaults to 20 chars if size is not provided.

mtds self-assigned this May 18, 2021

bollig mentioned this issue Aug 6, 2021

vpenso/prometheus-slurm-exporter parsing errors aws-samples/aws-parallelcluster-monitoring#12

Closed

gianlorenzop pushed a commit to gianlorenzop/prometheus-slurm-exporter that referenced this issue Mar 3, 2022

node.go: fix vpenso#55

0d4806e

gianlorenzop pushed a commit to gianlorenzop/prometheus-slurm-exporter that referenced this issue Mar 3, 2022

node.go: fix vpenso#55

f695e2e

gianlorenzop pushed a commit to gianlorenzop/prometheus-slurm-exporter that referenced this issue Mar 3, 2022

node.go: fix vpenso#55

1f6498f

gianlorenzop pushed a commit to gianlorenzop/prometheus-slurm-exporter that referenced this issue Mar 3, 2022

node.go: fix vpenso#55

f8f5629

gianlorenzop pushed a commit to gianlorenzop/prometheus-slurm-exporter that referenced this issue Mar 3, 2022

node.go: fix vpenso#55

1bab727

hokiegeek2 mentioned this issue May 23, 2022

long node name causes index out of range error #80

Open

sjpb mentioned this issue May 24, 2022

prometheus slurm exporter failing stackhpc/ansible-slurm-appliance#187

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash when node names are too long #55

Crash when node names are too long #55

MK4H commented May 4, 2021 •

edited

mtds commented May 19, 2021 •

edited

MK4H commented May 29, 2021 •

edited

mtds commented May 29, 2021

sjpb commented May 24, 2022

Crash when node names are too long #55

Crash when node names are too long #55

Comments

MK4H commented May 4, 2021 • edited

mtds commented May 19, 2021 • edited

MK4H commented May 29, 2021 • edited

mtds commented May 29, 2021

sjpb commented May 24, 2022

MK4H commented May 4, 2021 •

edited

mtds commented May 19, 2021 •

edited

MK4H commented May 29, 2021 •

edited