Reintroduce GPU Usage & Efficiency #2731

thomasvn · 2024-05-03T23:53:34Z

What does this PR change?

This PR reintroduces GPU Usage/Efficiency to the Allocation API. It also adds GPURequestAverage and GPUUsageAverage to bingen.
It requires deploying the dcgm-exporter and configuring Prometheus to scrape the exporter. Specifically, this PR leverages the DCGM_FI_DEV_GPU_UTIL metric.

TODO. I think there is some future work required to further validate the usage/querying of the DCGM_FI_DEV_GPU_UTIL metric.

Works correctly for both controlled and uncontrolled pods?
Works correctly for a job which only uses the GPU for a short amount of time, then ending in a "Completed" status?
Works correctly when pods are waiting in the queue to use the GPU?

Does this PR relate to any other PRs?

Reintroduces GPU Usage/Efficiency (previously Support Allocation GPU utilization/efficiency through integration with Nvidia GPU Operator/DCGM. #944)

How will this PR impact users?

The next release will include three new fields to the Allocation API. gpuRequestAverage, gpuUsageAverage, and gpuEfficiency.

Does this PR address any GitHub or Zendesk issues?

Closes Reintroduce gpuRequestAverage and gpuUsageAverage to the Allocation API Schema kubecost/cost-analyzer-helm-chart#1787

How was this PR tested?

Setup

kubectl port-forward svc/prometheus-server 9080:80

rm -rf /tmp/localrun/default
export CLUSTER_ID="cluster-localrun-default"
export CONFIG_PATH="/tmp/localrun/default"
export PROMETHEUS_SERVER_ENDPOINT="http://127.0.0.1:9080"
export KUBERNETES_PORT="helloworld"
export CLOUD_PROVIDER_API_KEY="REDACTED"
mkdir -p $CONFIG_PATH

go run cmd/costmodel/main.go

Setup GPU Node + Prometheus
Run OpenCost pointed at that Prometheus
Deploy a sample workload to request usage of the GPU
http://localhost:9003/metrics. Check to see that metrics are updated.
http://locahost:9003/allocation?window=1d. Validate my dcgmproftester deployment has new gpuRequestAverage and gpuUsageAverage fields. Example result here allocation.json.
http://locahost:9003/allocation/summary?window=1d. Validate GPU fields. Example result here allocationsummary.json.

Does this PR require changes to documentation?

TODO. Document how to deploy the DCGM Exporter.

Have you labeled this PR and its corresponding Issue as "next release" if it should be part of the next OpenCost release? If not, why not?

v2.4

testing. Signed-off-by: thomasvn <[email protected]>

response. Signed-off-by: thomasvn <[email protected]>

vercel · 2024-05-03T23:53:38Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
opencost	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 10, 2024 0:30am

marshalled/unmarshalled to/from bytes. Signed-off-by: thomasvn <[email protected]>

Signed-off-by: thomasvn <[email protected]>

accordance with documentation here https://github.com/opencost/opencost/blob/develop/core/pkg/opencost/bingen.go. Rerun `go generate`. Signed-off-by: thomasvn <[email protected]>

Signed-off-by: thomasvn <[email protected]>

thomasvn · 2024-05-16T16:45:22Z

core/pkg/opencost/allocation.go

+ GPURequestAverage float64 `json:"gpuRequestAverage"` //@bingen:field[version=22]
+ GPUUsageAverage float64 `json:"gpuUsageAverage"` //@bingen:field[version=22]


I added these new fields to the bottom of the struct in accordance with https://github.com/opencost/opencost/blob/develop/core/pkg/opencost/bingen.go.

I've noticed that the codecs performs "field version checks". Which leads me to believe that we don't have to add new fields to the end of the struct. If possible, I'd like to group these new fields alongside the other GPU fields.

sonarcloud · 2024-05-16T16:48:24Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

Signed-off-by: thomasvn <[email protected]>

thomasvn added 2 commits May 3, 2024 14:47

Initial progress on gpu-efficiency. Still needs more troubleshooting &

a92a5b3

testing. Signed-off-by: thomasvn <[email protected]>

Bring GPURequestAverage and GPUEfficiency back into the Allocation API

2279782

response. Signed-off-by: thomasvn <[email protected]>

github-actions bot added the needs-follow-up label May 3, 2024

vercel bot deployed to Preview May 3, 2024 23:54 View deployment

Merge branch 'develop' into thomasn/gpu-efficiency

3546594

vercel bot deployed to Preview May 3, 2024 23:55 View deployment

chipzoller mentioned this pull request May 9, 2024

Reintroduce gpuRequestAverage and gpuUsageAverage to the Allocation API Schema kubecost/cost-analyzer-helm-chart#1787

Closed

Merge branch 'develop' into thomasn/gpu-efficiency

01ccc38

vercel bot deployed to Preview May 9, 2024 17:54 View deployment

Add new Allocation fields to opencost_codecs so that they are correctly

975a26f

marshalled/unmarshalled to/from bytes. Signed-off-by: thomasvn <[email protected]>

vercel bot deployed to Preview May 10, 2024 00:30 View deployment

thomasvn added 7 commits May 14, 2024 11:55

Merge branch 'develop' into thomasn/gpu-efficiency

9ab69ce

Bump AllocationSet Codec Version

035534f

Signed-off-by: thomasvn <[email protected]>

Other updates to bingen bump

4adcd7b

Signed-off-by: thomasvn <[email protected]>

Merge branch 'develop' into thomasn/gpu-efficiency

bd99121

Move gpuRequestAverage and gpuUsageAverage to bottom of struct in

2c658dd

accordance with documentation here https://github.com/opencost/opencost/blob/develop/core/pkg/opencost/bingen.go. Rerun `go generate`. Signed-off-by: thomasvn <[email protected]>

Merge branch 'develop' into thomasn/gpu-efficiency

9c35024

Cleanup

aad39ed

Signed-off-by: thomasvn <[email protected]>

thomasvn commented May 16, 2024

View reviewed changes

thomasvn changed the title ~~[WIP] Reintroduce GPU Usage & Efficiency~~ Reintroduce GPU Usage & Efficiency May 16, 2024

thomasvn marked this pull request as ready for review May 16, 2024 16:46

thomasvn requested review from mbolt35 and kaelanspatel May 16, 2024 16:51

thomasvn added 2 commits May 20, 2024 18:50

Update ordering of Allocation JSON response.

d87bc73

Signed-off-by: thomasvn <[email protected]>

Add to SummaryAllocation{}

5a11220

Signed-off-by: thomasvn <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reintroduce GPU Usage & Efficiency #2731

Reintroduce GPU Usage & Efficiency #2731

thomasvn commented May 3, 2024 •

edited

vercel bot commented May 3, 2024 •

edited

thomasvn May 16, 2024

thomasvn May 16, 2024

sonarcloud bot commented May 16, 2024

		GPURequestAverage float64 `json:"gpuRequestAverage"` //@bingen:field[version=22]
		GPUUsageAverage float64 `json:"gpuUsageAverage"` //@bingen:field[version=22]

Reintroduce GPU Usage & Efficiency #2731

Are you sure you want to change the base?

Reintroduce GPU Usage & Efficiency #2731

Conversation

thomasvn commented May 3, 2024 • edited

What does this PR change?

Does this PR relate to any other PRs?

How will this PR impact users?

Does this PR address any GitHub or Zendesk issues?

How was this PR tested?

Does this PR require changes to documentation?

Have you labeled this PR and its corresponding Issue as "next release" if it should be part of the next OpenCost release? If not, why not?

vercel bot commented May 3, 2024 • edited

thomasvn May 16, 2024

Choose a reason for hiding this comment

thomasvn May 16, 2024

Choose a reason for hiding this comment

sonarcloud bot commented May 16, 2024

Quality Gate passed

thomasvn commented May 3, 2024 •

edited

vercel bot commented May 3, 2024 •

edited