-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu version #38
Comments
The GPU module is basically a wrapper which parses the output of the following commands: Allocated GPUs sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2 Total available GPUs sinfo -h -o "%n %G" What these commands reports to you? Since the total number of GPUs reported in your case is 0, it is not surprising that the GPU utilization metrics (which the Go module calculate as 'allocated' divided by 'total' GPUs) goes to infinite. This also explain why 'Idle GPUs' is negative: it is evaluated as the total minus the allocated GPUs. If the Slurm commands reports the same results that the exporter is showing, then there is something in your configuration that has to be verified (eventually on the commands we are using too). If it's not the case then there is something wrong in the logic of this module and we have to look deeper into it. |
the first command just outputs a bunch of blank lines and in between a few times:
about the second one:
|
Depending on which version you are running, on slurm 20.04 |
I am running 20.02 |
|
@mtds I doesn't work, it produces output like:
while Gres :
What is what I should expect from that command ? |
I don't have an answer for that, since it is highly dependent on your Slurm configuration related to the GPUs @JoeriHermans : can you eventually offer an insight about the output format? We do not have enough GPUs |
This patch seems to work for me: diff --git a/gpus.go b/gpus.go
index ca3bcaf..9e90421 100644
--- a/gpus.go
+++ b/gpus.go
@@ -38,15 +38,19 @@ func GPUsGetMetrics() *GPUsMetrics {
func ParseAllocatedGPUs() float64 {
var num_gpus = 0.0
- args := []string{"-a", "-X", "--format=Allocgres", "--state=RUNNING", "--noheader", "--parsable2"}
+ args := []string{"-a", "-X", "--format=AllocTRES", "--state=RUNNING", "--noheader", "--parsable2"}
output := string(Execute("sacct", args))
if len(output) > 0 {
for _, line := range strings.Split(output, "\n") {
if len(line) > 0 {
line = strings.Trim(line, "\"")
- descriptor := strings.TrimPrefix(line, "gpu:")
- job_gpus, _ := strconv.ParseFloat(descriptor, 64)
- num_gpus += job_gpus
+ for _, resource := range strings.Split(line, ",") {
+ if strings.HasPrefix(resource, "gres/gpu=") {
+ descriptor := strings.TrimPrefix(resource, "gres/gpu=")
+ job_gpus, _ := strconv.ParseFloat(descriptor, 64)
+ num_gpus += job_gpus
+ }
+ }
}
}
}
@@ -63,11 +67,17 @@ func ParseTotalGPUs() float64 {
for _, line := range strings.Split(output, "\n") {
if len(line) > 0 {
line = strings.Trim(line, "\"")
- descriptor := strings.Fields(line)[1]
- descriptor = strings.TrimPrefix(descriptor, "gpu:")
- descriptor = strings.Split(descriptor, "(")[0]
- node_gpus, _ := strconv.ParseFloat(descriptor, 64)
- num_gpus += node_gpus
+ gres := strings.Fields(line)[1]
+ // gres column format: comma-delimited list of resources
+ for _, resource := range strings.Split(gres, ",") {
+ if strings.HasPrefix(resource, "gpu:") {
+ // format: gpu:<type>:N(S:<something>), e.g. gpu:RTX2070:2(S:0)
+ descriptor := strings.Split(resource, ":")[2]
+ descriptor = strings.Split(descriptor, "(")[0]
+ node_gpus, _ := strconv.ParseFloat(descriptor, 64)
+ num_gpus += node_gpus
+ }
+ }
}
}
} |
Thanks for the patch but as I wrote above, I do not have the chance to run a test on a cluster with GPUs at the moment. I assume that this patch will work (e.g. no obvious syntax errors) but I am wary of integrating it right now. I would need |
just updated to this on our system. "sinfo -h -o %n %G" command takes the scrape from 1/2 seconds to 2.5 minutes on our system, and it has no gpus. '%n' with means you have to examine every node in the system to see if has a GPU; and our Cray has over 10k nodes with no GPU's. Is there way to disable this? I changed it from %n to %N and everything is much faster now. |
At the moment there is no way to turn it off but given the mixed result of this patch, I believe I will add The fact that you change the options and the According to the man page of sinfo, those options are doing the following:
So, with
while with
No wonder in the second case it's faster but it may depend on the length of the output but I am not sure How many nodes (approximately) do you have on your cluster? We never tested this exporter with more than 800 |
@crinavar : you can try the one I made, available here: https://grafana.com/grafana/dashboards/4323 Note: there are no graph panels (yet) for GPUs, since we do not have so much of that HW in our current |
This also means that if there were some GPUs on these nodes, they would not be counted correctly. The exporter expects one node per line and does not know that e.g. |
Cori (cori.nersc.gov) currently has 2,388 haswell nodes, and 9,688 KNL nodes. No GPU's. Perlmutter I am not at liberty to disclose at this time, but it has GPU's, and it will have more nodes than Cori. |
@ThomasADavis : I see. That's quite a difference in term of installation size. Take a look at the gpus_acct branch There is only 1 commit of difference with the master branch:
|
I'll just add.. "We break things." I will look at it. I thought we was still under blackout, but they did post that there will be 6000+ GPU's in the perlmutter phase 1 system. |
For us it's interesting to know that there are such big installations using this exporter! And bug reports are always welcome :-)
Those are definitely more GPUs than we are expecting to receive and install in the next months...and next years as well, I guess. I cannot say now if |
We have a contract with the slurm people to deal with some of those issues. |
Many thanks,
Not having the "Type" made the index "2" in the patch of "gpus.go" produce an out of bounds error. Adding Type solved the problem
The patch actually has a very important comment i didnt pay attention to best |
Version 0.19 introduces a breaking changes: by default the GPUs accounting will not be enabled but a command line option can be used to explicitly activate it. Note that the exporter will also log the status of this function, when it's enabled or not. Considering the ongoing discussion here and what was also reported in issue #40, we have decided to change the default behaviour of the exporter and play the safe bet of keeping such a functionality off by default. Until we have a chance to test this feature on our cluster (we are going through the process of acquiring new servers equipped with GPUs), we will leave these issues open. It would be useful if other users could report how the GPUs accounting functionality is working in their infrastructure. In particular:
|
These changes make accounting work with slurm 19 and should help with vpenso#38
Ever since slurm v19.05.0rc1, slurm provides another way to check for Available and Active GRES, i.e. via : |
@itzsimpl I have merged your updated PR into the development branch (among other contributions). We are currently not able to test this exporter, for the GPU part, on newer version of Slurm, Last but not least: thanks!! |
Hey, some feedback for Slurm 20.11.9 (CentOS 7, 20.11.9-1.el7.x86_64). As far as I can tell, the GPU export look alright using the development branch. Raw Output:
Parsed:
Got a GPU Cluster if you need further testing before a new release. Anything I can help with to get the development branch ready to merge? |
@martialblog thanks for testing; I did not have the chance to test the |
Hi,
I am testing the latest version and GPU infor seems to not be so accurate, how can I start debugging?
Cheers.
The text was updated successfully, but these errors were encountered: