Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when node names are too long #55

Open
MK4H opened this issue May 4, 2021 · 4 comments
Open

Crash when node names are too long #55

MK4H opened this issue May 4, 2021 · 4 comments
Assignees

Comments

@MK4H
Copy link

MK4H commented May 4, 2021

If node names are over 20 characters long, the output of sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong", used at node.go:85, looks like this:

cpu-always-on-st-t30                   1                   0/2/0/2             idle                
cpu-spot-dy-c52xlar0                   1                   0/8/0/8             idle~               
cpu-spot-dy-c52xlar0                   1                   0/8/0/8             idle~               

You can see that node name and memory are not separated by whitespace.

This results in a crash with the following output:

prometheus-slurm-exporter[5783]: panic: runtime error: index out of range [4] with length 4
prometheus-slurm-exporter[5783]: goroutine 9 [running]:
prometheus-slurm-exporter[5783]: main.ParseNodeMetrics(0xc00016e000, 0x5eb, 0xe00, 0xc0000b10d8)
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/node.go:56 +0x6cf
prometheus-slurm-exporter[5783]: main.NodeGetMetrics(0x8b7f20)
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/node.go:40 +0x2a
prometheus-slurm-exporter[5783]: main.(*NodeCollector).Collect(0xc00007a000, 0xc0000b1080)
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/node.go:128 +0x37
prometheus-slurm-exporter[5783]: github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x1a2
prometheus-slurm-exporter[5783]: created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
prometheus-slurm-exporter[5783]: #011/home/ubuntu/aws-parallelcluster-monitoring/prometheus-slurm-exporter/go/modules/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xe8e
systemd[1]: slurm_exporter.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
systemd[1]: slurm_exporter.service: Failed with result 'exit-code'.

It expects 5 fields separated by whatespace, but finds only 4 which results in out-of-bounds array access and panic.

Possible fix is to change sinfo -h -N -O "NodeList,AllocMem,Memory,CPUsState,StateLong" to sinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong: ", explicitly telling SLURM to append a space after each value.

MK4H pushed a commit to MK4H/prometheus-slurm-exporter that referenced this issue May 6, 2021
Fix vpenso#55

When node names are over 20 characters long, sinfo truncates the name
but does not preserve the space between NodeList and AllocMem. Example:

cpu-always-on-st-t30                   1                   0/2/0/2
idle

This commit explicitly tells sinfo to append a space after each entry,
fixing the issue.
@mtds mtds self-assigned this May 18, 2021
@mtds
Copy link
Collaborator

mtds commented May 19, 2021

On our current infrastructure it is not possible to reproduce this problem, given the fact that we use hostnames with less than 10 characters.

On Slurm 18.08.8 (CentOS 7.8), I have got quite a different output.

  1. If I use the command currently in node.go I got the following (shortened for clarity):
[...]
node1032            main              189200              191762              86/10/0/96          mixed
node1033            main              189200              191762              86/10/0/96          mixed
  1. With the proposed modification to the syntax, I have got instead:
[...]
node1032main18920019176286/10/0/96mixed
node1033main18920019176286/10/0/96mixed

The second output will most likely crash node.go, considering that to extract the numbers we assume that some spaces are present between each value (node name, partition, etc.).

Which version of Slurm are you using? From the crash log you have posted and the hostnames you are using, I assume you are running a cluster on AWS but we definitely do not have operational experience with that environment.

@MK4H
Copy link
Author

MK4H commented May 29, 2021

We are using AWS ParallelCluster 2.10.0, running on Ubuntu 18.04 using Slurm version 20.02.4. With this version, the output of sinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong: " is

cpu-always-on-st-t3amedium-1 0 1 0/2/0/2 idle 
cpu-always-on-st-t3amedium-2 0 1 0/2/0/2 idle 
cpu-always-on-st-t3amedium-3 0 1 0/2/0/2 idle 
cpu-spot-dy-c52xlarge-1 0 1 0/8/0/8 idle~ 
cpu-spot-dy-c52xlarge-2 0 1 0/8/0/8 idle~ 

So I guess there was a change between 18.08.8 and 20.02.4 that changes the interface and output of sinfo.

Edit:
Found it: SchedMD/slurm@9ea6c94

@mtds
Copy link
Collaborator

mtds commented May 29, 2021

We have faced this problem in the past. Since this exporter is basically parsing the output of sinfo, squeue, sdiag, etc. it is very sensible to the output format, which from time to time is changed by the SchedMD developers.

The node.sh module is particularly prone to this problem. In the past, I have implemented a workaround using regular expressions (e.g. nodes.go), but so far I did not have the chance to do the same with node.sh.

gianlorenzop pushed a commit to gianlorenzop/prometheus-slurm-exporter that referenced this issue Mar 3, 2022
gianlorenzop pushed a commit to gianlorenzop/prometheus-slurm-exporter that referenced this issue Mar 3, 2022
gianlorenzop pushed a commit to gianlorenzop/prometheus-slurm-exporter that referenced this issue Mar 3, 2022
gianlorenzop pushed a commit to gianlorenzop/prometheus-slurm-exporter that referenced this issue Mar 3, 2022
gianlorenzop pushed a commit to gianlorenzop/prometheus-slurm-exporter that referenced this issue Mar 3, 2022
@sjpb
Copy link

sjpb commented May 24, 2022

Recent versions of sinfo have --json although that won't be an option for everyone. Does this need fields which are not supported by -o|--format? That says that if size isn't provided it automatically picks one long enough. As opposed to -O|--Format which defaults to 20 chars if size is not provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants