Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu version #38

Open
titansmc opened this issue Feb 8, 2021 · 25 comments
Open

gpu version #38

titansmc opened this issue Feb 8, 2021 · 25 comments
Assignees

Comments

@titansmc
Copy link

titansmc commented Feb 8, 2021

Hi,
I am testing the latest version and GPU infor seems to not be so accurate, how can I start debugging?

# TYPE slurm_gpus_alloc gauge
slurm_gpus_alloc 21
# HELP slurm_gpus_idle Idle GPUs
# TYPE slurm_gpus_idle gauge
slurm_gpus_idle -21
# HELP slurm_gpus_total Total GPUs
# TYPE slurm_gpus_total gauge
slurm_gpus_total 0
# HELP slurm_gpus_utilization Total GPU utilization
# TYPE slurm_gpus_utilization gauge
slurm_gpus_utilization +Inf

Cheers.

@mtds
Copy link
Collaborator

mtds commented Feb 8, 2021

The GPU module is basically a wrapper which parses the output of the following commands:

Allocated GPUs

sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2

Total available GPUs

sinfo -h -o "%n %G"

What these commands reports to you? Since the total number of GPUs reported in your case is 0, it is not surprising that the GPU utilization metrics (which the Go module calculate as 'allocated' divided by 'total' GPUs) goes to infinite. This also explain why 'Idle GPUs' is negative: it is evaluated as the total minus the allocated GPUs.

If the Slurm commands reports the same results that the exporter is showing, then there is something in your configuration that has to be verified (eventually on the commands we are using too). If it's not the case then there is something wrong in the logic of this module and we have to look deeper into it.

@mtds mtds self-assigned this Feb 8, 2021
@titansmc
Copy link
Author

titansmc commented Feb 8, 2021

the first command just outputs a bunch of blank lines and in between a few times:




gpu:2




gpu:1





about the second one:

[root@~]# sinfo -h -o "%n %G"
sb01-13 tmp:844G
sb01-01 tmp:844G
sb01-02 tmp:844G
sb01-03 tmp:844G
sb01-04 tmp:844G
sb01-05 tmp:844G
sb01-06 tmp:844G
sb01-07 tmp:844G
sb01-08 tmp:844G
sb01-09 tmp:844G
sb01-10 tmp:844G
sb01-11 tmp:844G
sb01-12 tmp:844G
sb01-14 tmp:844G
sb01-15 tmp:844G
sb01-16 tmp:844G
sb01-17 tmp:844G
sb01-18 tmp:844G
sb01-19 tmp:844G
sb01-20 tmp:844G
sb02-01 tmp:844G
sb02-02 tmp:844G
sb02-03 tmp:844G
sb02-04 tmp:844G
sb02-05 tmp:844G
sb02-06 tmp:844G
sb02-07 tmp:844G
sb02-08 tmp:844G
sb02-09 tmp:844G
sb02-10 tmp:844G
sb02-11 tmp:844G
sb02-12 tmp:844G
sb02-13 tmp:844G
sb02-14 tmp:844G
sb02-15 tmp:844G
sb02-16 tmp:844G
sb02-17 tmp:844G
sb02-18 tmp:844G
sb02-19 tmp:844G
sb02-20 tmp:844G
sb03-02 tmp:467G
sb03-03 tmp:467G
sb03-04 tmp:467G
sb04-02 tmp:467G
sb04-03 tmp:467G
sb04-04 tmp:467G
sb04-05 tmp:467G
sb04-06 tmp:467G
sb04-07 tmp:467G
sb04-08 tmp:467G
sb04-09 tmp:467G
sb04-10 tmp:467G
sb04-11 tmp:467G
sb04-12 tmp:467G
sb04-13 tmp:467G
sb04-14 tmp:467G
sb04-15 tmp:467G
sb04-16 tmp:467G
sb04-17 tmp:467G
sb04-18 tmp:467G
sb04-19 tmp:467G
sb04-20 tmp:467G
sb05-02 tmp:467G
sb05-03 tmp:467G
sb05-04 tmp:467G
sb05-05 tmp:467G
sb05-06 tmp:467G
sb05-07 tmp:467G
sb05-08 tmp:467G
sb05-09 tmp:467G
sb05-10 tmp:467G
sb05-11 tmp:467G
sb05-12 tmp:467G
sb05-13 tmp:467G
sb05-14 tmp:467G
sb05-15 tmp:467G
sb05-16 tmp:467G
sb05-17 tmp:467G
sb05-18 tmp:467G
sb05-19 tmp:467G
sb05-20 tmp:467G
sm-epyc-01 tmp:7571G
sm-epyc-02 tmp:9400282M
sm-epyc-03 tmp:9400282M
sm-epyc-04 tmp:9400282M
sm-epyc-05 tmp:9400282M
smer01-1 tmp:203G
smer01-2 tmp:203G
smer01-3 tmp:203G
smer01-4 tmp:203G
smer02-1 tmp:203G
smer02-2 tmp:203G
smer02-3 tmp:203G
smer02-4 tmp:203G
smer03-1 tmp:203G
smer03-2 tmp:203G
smer03-3 tmp:203G
smer03-4 tmp:203G
smer04-1 tmp:203G
smer04-2 tmp:203G
smer04-3 tmp:203G
smer04-4 tmp:203G
smer05-1 tmp:203G
smer05-2 tmp:203G
smer05-3 tmp:203G
smer05-4 tmp:203G
smer06-1 tmp:203G
smer06-2 tmp:203G
smer06-3 tmp:203G
smer06-4 tmp:203G
smer07-1 tmp:203G
smer07-2 tmp:203G
smer07-3 tmp:203G
smer07-4 tmp:203G
smer08-1 tmp:203G
smer08-2 tmp:203G
smer08-3 tmp:203G
smer08-4 tmp:203G
smer09-1 tmp:203G
smer09-2 tmp:203G
smer09-3 tmp:203G
smer09-4 tmp:203G
smer10-1 tmp:203G
smer10-2 tmp:203G
smer10-3 tmp:203G
smer10-4 tmp:203G
smer11-1 tmp:203G
smer11-2 tmp:203G
smer11-3 tmp:203G
smer11-4 tmp:203G
smer12-1 tmp:203G
smer12-2 tmp:203G
smer12-3 tmp:203G
smer12-4 tmp:203G
smer13-1 tmp:203G
smer13-2 tmp:203G
smer13-3 tmp:203G
smer13-4 tmp:203G
smer14-1 tmp:203G
smer14-2 tmp:203G
smer14-3 tmp:203G
smer14-4 tmp:203G
smer15-1 tmp:203G
smer15-2 tmp:203G
smer15-3 tmp:203G
smer15-4 tmp:203G
smer16-1 tmp:203G
smer16-2 tmp:203G
smer16-3 tmp:203G
smer16-4 tmp:203G
smer17-1 tmp:203G
smer17-2 tmp:203G
smer17-3 tmp:203G
smer17-4 tmp:203G
smer18-1 tmp:203G
smer18-2 tmp:203G
smer18-3 tmp:203G
smer18-4 tmp:203G
smer19-1 tmp:203G
smer19-2 tmp:203G
smer19-3 tmp:203G
smer19-4 tmp:203G
smer20-1 tmp:203G
smer20-2 tmp:203G
smer20-3 tmp:203G
smer20-4 tmp:203G
smer21-1 tmp:203G
smer21-2 tmp:203G
smer21-3 tmp:203G
smer21-4 tmp:203G
smer22-1 tmp:203G
smer22-2 tmp:203G
smer22-3 tmp:203G
smer22-4 tmp:203G
smer23-1 tmp:203G
smer23-2 tmp:203G
smer23-3 tmp:203G
smer23-4 tmp:203G
smer24-1 tmp:203G
smer24-2 tmp:203G
smer24-3 tmp:203G
smer24-4 tmp:203G
smer25-1 tmp:203G
smer25-2 tmp:203G
smer25-3 tmp:203G
smer25-4 tmp:203G
smer26-1 tmp:203G
smer26-2 tmp:203G
smer26-3 tmp:203G
smer26-4 tmp:203G
smer27-1 tmp:203G
smer27-2 tmp:203G
smer27-3 tmp:203G
smer27-4 tmp:203G
smer28-1 tmp:203G
smer28-2 tmp:203G
smer28-3 tmp:203G
smer28-4 tmp:203G
smer29-1 tmp:203G
smer29-2 tmp:203G
smer29-3 tmp:203G
smer29-4 tmp:203G
smer30-1 tmp:203G
smer30-2 tmp:203G
smer30-3 tmp:203G
smer30-4 tmp:203G
gpu4 gpu:1080Ti:8(S:0-1),tmp:1127G
gpu5 gpu:1080Ti:8(S:0-1),tmp:1127G
gpu8 gpu:2080Ti:8(S:0),tmp:3100G
gpu10 gpu:V100:4,tmp:456G
gpu11 gpu:2080Ti:4,tmp:467G
gpu12 gpu:2080Ti:4,tmp:467G
gpu13 gpu:2080Ti:4,tmp:467G
gpu14 gpu:2080Ti:4,tmp:467G
gpu15 gpu:2080Ti:4,tmp:467G
sb03-05 gpu:A100:1,tmp:467G
gpu9 gpu:2080Ti:8(S:0),tmp:3100G
gpu16 gpu:2080Ti:4,tmp:467G
gpu17 gpu:2080Ti:4,tmp:467G
gpu18 gpu:2080Ti:4,tmp:467G
gpu19 gpu:2080Ti:4,tmp:467G
gpu20 gpu:2080Ti:4,tmp:467G
sb03-06 gpu:A100:1,tmp:467G
sb03-07 gpu:A100:1,tmp:467G
sb03-08 gpu:A100:1,tmp:467G
sb03-09 gpu:A100:1,tmp:467G
sb03-10 gpu:A100:1,tmp:467G
sb03-11 gpu:A100:1,tmp:467G
sb03-12 gpu:A100:1,tmp:467G
sb03-13 gpu:A100:1,tmp:467G
sb03-14 gpu:A100:1,tmp:467G
sb03-15 gpu:A100:1,tmp:467G
sb03-16 gpu:A100:1,tmp:467G
sb03-17 gpu:A100:1,tmp:467G
sb03-18 gpu:A100:1,tmp:467G
sb03-19 gpu:A100:1,tmp:467G
sb03-20 gpu:A100:1,tmp:467G
bn01 (null)
bn02 (null)
bn03 (null)
bn04 (null)
sb04-01 (null)

@biocyberman
Copy link

The GPU module is basically a wrapper which parses the output of the following commands:

Allocated GPUs

sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2

Depending on which version you are running, on slurm 20.04 Allocgres is replaced with AllocTRES. The grafana json rev3, however, doesn't have anything for GPU yet @mtds

@titansmc
Copy link
Author

I am running 20.02

@mtds
Copy link
Collaborator

mtds commented Mar 4, 2021

@titansmc
Copy link
Author

titansmc commented Mar 5, 2021

@mtds I doesn't work, it produces output like:

[root@lrms1 ~]# sacct -a -X --format=AllocTRES --state=RUNNING --noheader --parsable2
billing=6,cpu=6,mem=35G,node=1
billing=1,cpu=1,mem=2G,node=1
billing=1,cpu=1,mem=2G,node=1
billing=1,cpu=1,mem=2G,node=1

while Gres :

[root@lrms1 ~]# sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2


gpu:2







gpu:2



gpu:2

What is what I should expect from that command ?

@mtds
Copy link
Collaborator

mtds commented Mar 10, 2021

What is what I should expect from that command ?

I don't have an answer for that, since it is highly dependent on your Slurm configuration related to the GPUs
and there is also issue #40 in the middle.

@JoeriHermans : can you eventually offer an insight about the output format? We do not have enough GPUs
now to change our configuration accordingly and make a test. May you eventually provide a test data for the
output as well? Like the other *_test.go files.

@lahwaacz
Copy link
Contributor

This patch seems to work for me:

diff --git a/gpus.go b/gpus.go
index ca3bcaf..9e90421 100644
--- a/gpus.go
+++ b/gpus.go
@@ -38,15 +38,19 @@ func GPUsGetMetrics() *GPUsMetrics {
 func ParseAllocatedGPUs() float64 {
 	var num_gpus = 0.0
 
-	args := []string{"-a", "-X", "--format=Allocgres", "--state=RUNNING", "--noheader", "--parsable2"}
+	args := []string{"-a", "-X", "--format=AllocTRES", "--state=RUNNING", "--noheader", "--parsable2"}
 	output := string(Execute("sacct", args))
 	if len(output) > 0 {
 		for _, line := range strings.Split(output, "\n") {
 			if len(line) > 0 {
 				line = strings.Trim(line, "\"")
-				descriptor := strings.TrimPrefix(line, "gpu:")
-				job_gpus, _ := strconv.ParseFloat(descriptor, 64)
-				num_gpus += job_gpus
+				for _, resource := range strings.Split(line, ",") {
+					if strings.HasPrefix(resource, "gres/gpu=") {
+						descriptor := strings.TrimPrefix(resource, "gres/gpu=")
+						job_gpus, _ := strconv.ParseFloat(descriptor, 64)
+						num_gpus += job_gpus
+					}
+				}
 			}
 		}
 	}
@@ -63,11 +67,17 @@ func ParseTotalGPUs() float64 {
 		for _, line := range strings.Split(output, "\n") {
 			if len(line) > 0 {
 				line = strings.Trim(line, "\"")
-				descriptor := strings.Fields(line)[1]
-				descriptor = strings.TrimPrefix(descriptor, "gpu:")
-				descriptor = strings.Split(descriptor, "(")[0]
-				node_gpus, _ :=  strconv.ParseFloat(descriptor, 64)
-				num_gpus += node_gpus
+				gres := strings.Fields(line)[1]
+				// gres column format: comma-delimited list of resources
+				for _, resource := range strings.Split(gres, ",") {
+					if strings.HasPrefix(resource, "gpu:") {
+						// format: gpu:<type>:N(S:<something>), e.g. gpu:RTX2070:2(S:0)
+						descriptor := strings.Split(resource, ":")[2]
+						descriptor = strings.Split(descriptor, "(")[0]
+						node_gpus, _ :=  strconv.ParseFloat(descriptor, 64)
+						num_gpus += node_gpus
+					}
+				}
 			}
 		}
 	}

@mtds
Copy link
Collaborator

mtds commented Mar 10, 2021

Thanks for the patch but as I wrote above, I do not have the chance to run a test on a cluster with GPUs at the moment.

I assume that this patch will work (e.g. no obvious syntax errors) but I am wary of integrating it right now. I would need
at least other two persons who can test it on their configurations.

@crinavar
Copy link

crinavar commented Mar 15, 2021

I can confirm the same problem with slurm version 20.11.2. So far I have only changed the single line to use "AllocTRES" argument.
@lahwaacz @mtds which grafana dashboard works with the patch?

@ThomasADavis
Copy link

just updated to this on our system.

"sinfo -h -o %n %G" command takes the scrape from 1/2 seconds to 2.5 minutes on our system, and it has no gpus.

'%n' with means you have to examine every node in the system to see if has a GPU; and our Cray has over 10k nodes with no GPU's.

Is there way to disable this? I changed it from %n to %N and everything is much faster now.

@mtds
Copy link
Collaborator

mtds commented Mar 18, 2021

Is there way to disable this? I changed it from %n to %N and everything is much faster now.

At the moment there is no way to turn it off but given the mixed result of this patch, I believe I will add
a command line switch to explicitly turn it on otherwise by default it will be disabled.

The fact that you change the options and the sinfo command is so much faster makes me wonder.

According to the man page of sinfo, those options are doing the following:

%n
    List of node hostnames. 
%N
    List of node names. 

So, with %n you'll get a complete list:

host001
host002
[...]

while with %N a compressed list of the hosts is printed, like the following:

host0[01-10]

No wonder in the second case it's faster but it may depend on the length of the output but I am not sure
how sinfo is going through the list internally (naively thinking it should check the configuration files,
since it is not mentioned anywhere that this command will issue RPC calls to slurmctld).

How many nodes (approximately) do you have on your cluster? We never tested this exporter with more than 800
nodes, so I cannot say how much performant those sinfo commands are on very big installations.

@mtds
Copy link
Collaborator

mtds commented Mar 18, 2021

[...] which grafana dashboard works with the patch?

@crinavar : you can try the one I made, available here: https://grafana.com/grafana/dashboards/4323

Note: there are no graph panels (yet) for GPUs, since we do not have so much of that HW in our current
installation so I did not have the chance to create an additional dashboard so far.

@lahwaacz
Copy link
Contributor

lahwaacz commented Mar 18, 2021

while with %N a compressed list of the hosts is printed, like the following:

host0[01-10]

This also means that if there were some GPUs on these nodes, they would not be counted correctly. The exporter expects one node per line and does not know that e.g. host0[01-10] is 10 nodes...

@ThomasADavis
Copy link

Cori (cori.nersc.gov) currently has 2,388 haswell nodes, and 9,688 KNL nodes. No GPU's.

Perlmutter I am not at liberty to disclose at this time, but it has GPU's, and it will have more nodes than Cori.

@mtds
Copy link
Collaborator

mtds commented Mar 18, 2021

@ThomasADavis : I see. That's quite a difference in term of installation size.

Take a look at the gpus_acct branch

There is only 1 commit of difference with the master branch:

  • By default, the GPUs collector is now disabled.
  • A command line option -gpus-acct must be set to true in order to enable it.

@ThomasADavis
Copy link

I'll just add.. "We break things."

I will look at it. I thought we was still under blackout, but they did post that there will be 6000+ GPU's in the perlmutter phase 1 system.

@mtds
Copy link
Collaborator

mtds commented Mar 18, 2021

I'll just add.. "We break things."

For us it's interesting to know that there are such big installations using this exporter! And bug reports are always welcome :-)

[...] but they did post that there will be 6000+ GPU's in the perlmutter phase 1 system.

Those are definitely more GPUs than we are expecting to receive and install in the next months...and next years as well, I guess.

I cannot say now if sacct will perform correctly: with utilities that interact directly with the Slurm DB backend, there is always the possibility that 'horrible' SQL queries behind the scenes will turn out in timed out answers. This is the reason that (whenever possible) we have used only sinfo, squeue and sdiag.

@ThomasADavis
Copy link

We have a contract with the slurm people to deal with some of those issues.

@crinavar
Copy link

crinavar commented Apr 10, 2021

[...] which grafana dashboard works with the patch?

@crinavar : you can try the one I made, available here: https://grafana.com/grafana/dashboards/4323

Note: there are no graph panels (yet) for GPUs, since we do not have so much of that HW in our current
installation so I did not have the chance to create an additional dashboard so far.

Many thanks,
I am now testing the patch presented by @lahwaacz but it is giving this error
slurm-exporter_1 | panic: runtime error: index out of range
EDIT: solved, now the patch works and exporter working properly.
For anyone having the same problem, the error was caused because i had the gres.conf file like this

# GRES configuration for native GPUS
# DGX A100: 8x Nvidia A100

# Autodetect not working
#AutoDetect=nvml
Name=gpu File=/dev/nvidia[0-7]

Not having the "Type" made the index "2" in the patch of "gpus.go" produce an out of bounds error. Adding Type solved the problem

# GRES configuration for native GPUS
# DGX A100: 8x Nvidia A100

# Autodetect not working
#AutoDetect=nvml
Name=gpu Type=A100 File=/dev/nvidia[0-7]

The patch actually has a very important comment i didnt pay attention to
// format: gpu:<type>:N(S:<something>), e.g. gpu:RTX2070:2(S:0)

best

@mtds
Copy link
Collaborator

mtds commented Apr 16, 2021

Version 0.19 introduces a breaking changes: by default the GPUs accounting will not be enabled but a command line option can be used to explicitly activate it. Note that the exporter will also log the status of this function, when it's enabled or not.

Considering the ongoing discussion here and what was also reported in issue #40, we have decided to change the default behaviour of the exporter and play the safe bet of keeping such a functionality off by default.

Until we have a chance to test this feature on our cluster (we are going through the process of acquiring new servers equipped with GPUs), we will leave these issues open.

It would be useful if other users could report how the GPUs accounting functionality is working in their infrastructure. In particular:

  • version of Slurm;
  • details about the GPUs configuration in the Slurm configuration.

nuno-silva added a commit to nuno-silva/prometheus-slurm-exporter that referenced this issue Feb 2, 2022
These changes make accounting work with slurm 19 and should help with vpenso#38
@itzsimpl
Copy link

itzsimpl commented Mar 4, 2022

Ever since slurm v19.05.0rc1, slurm provides another way to check for Available and Active GRES, i.e. via : sinfo -a -h --Format=Nodes,Gres,GresUsed. I have refactored the gpus.go to be based on this call and consider also cases where gres_type is defined. See PR #73. I have tested on Slurm 21.08.5, and note that at the moment for Slurm version below 19.05.0rc1 it is better to stay on the old implementation.

@mtds
Copy link
Collaborator

mtds commented Mar 29, 2022

@itzsimpl I have merged your updated PR into the development branch (among other contributions).
Please take a look and let me know if it works.

We are currently not able to test this exporter, for the GPU part, on newer version of Slurm,
so I am trusting the feedback from other users about stability and functionalities.

Last but not least: thanks!!

@martialblog
Copy link
Contributor

Hey, some feedback for Slurm 20.11.9 (CentOS 7, 20.11.9-1.el7.x86_64). As far as I can tell, the GPU export look alright using the development branch.

Raw Output:

sinfo -a -h --Format="Nodes: ,GresUsed:" --state=allocated
4 gpu:tesla:4(IDX:0-3)
1 gpu:tesla:3(IDX:0,2-3)
1 gpu:tesla:1(IDX:1)
1 gpu:tesla:1(IDX:3)
1 gpu:tesla:2(IDX:1-2)

sinfo -a -h --Format="Nodes: ,Gres: ,GresUsed:" --state=idle,allocated
4 (null) gpu:0
4 gpu:tesla:4 gpu:tesla:4(IDX:0-3)
1 gpu:tesla:4 gpu:tesla:3(IDX:0,2-3)
1 gpu:tesla:4 gpu:tesla:1(IDX:1)
1 gpu:tesla:4 gpu:tesla:1(IDX:3)
1 gpu:tesla:4 gpu:tesla:2(IDX:1-2)

sinfo -a -h --Format="Nodes: ,Gres:"
4 (null)
8 gpu:tesla:4

Parsed:

# HELP slurm_gpus_alloc Allocated GPUs
# TYPE slurm_gpus_alloc gauge
slurm_gpus_alloc 23
# HELP slurm_gpus_idle Idle GPUs
# TYPE slurm_gpus_idle gauge
slurm_gpus_idle 9
# HELP slurm_gpus_other Other GPUs
# TYPE slurm_gpus_other gauge
slurm_gpus_other 0
# HELP slurm_gpus_total Total GPUs
# TYPE slurm_gpus_total gauge
slurm_gpus_total 32
# HELP slurm_gpus_utilization Total GPU utilization
# TYPE slurm_gpus_utilization gauge
slurm_gpus_utilization 0.71875

Got a GPU Cluster if you need further testing before a new release. Anything I can help with to get the development branch ready to merge?

@itzsimpl
Copy link

@martialblog thanks for testing; I did not have the chance to test the development branch, but PR #73 has been up and running on SLURM 21.08.5 for a couple of months now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants