Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reintroduce GPU Usage & Efficiency #2731

Open
wants to merge 14 commits into
base: develop
Choose a base branch
from

Conversation

thomasvn
Copy link
Contributor

@thomasvn thomasvn commented May 3, 2024

What does this PR change?

  • This PR reintroduces GPU Usage/Efficiency to the Allocation API. It also adds GPURequestAverage and GPUUsageAverage to bingen.
  • It requires deploying the dcgm-exporter and configuring Prometheus to scrape the exporter. Specifically, this PR leverages the DCGM_FI_DEV_GPU_UTIL metric.

TODO. I think there is some future work required to further validate the usage/querying of the DCGM_FI_DEV_GPU_UTIL metric.

  • Works correctly for both controlled and uncontrolled pods?
  • Works correctly for a job which only uses the GPU for a short amount of time, then ending in a "Completed" status?
  • Works correctly when pods are waiting in the queue to use the GPU?

Does this PR relate to any other PRs?

How will this PR impact users?

  • The next release will include three new fields to the Allocation API. gpuRequestAverage, gpuUsageAverage, and gpuEfficiency.

Does this PR address any GitHub or Zendesk issues?

How was this PR tested?

Setup
kubectl port-forward svc/prometheus-server 9080:80
rm -rf /tmp/localrun/default
export CLUSTER_ID="cluster-localrun-default"
export CONFIG_PATH="/tmp/localrun/default"
export PROMETHEUS_SERVER_ENDPOINT="http://127.0.0.1:9080"
export KUBERNETES_PORT="helloworld"
export CLOUD_PROVIDER_API_KEY="REDACTED"
mkdir -p $CONFIG_PATH

go run cmd/costmodel/main.go
  1. Setup GPU Node + Prometheus
  2. Run OpenCost pointed at that Prometheus
  3. Deploy a sample workload to request usage of the GPU
  4. http://localhost:9003/metrics. Check to see that metrics are updated.
  5. http://locahost:9003/allocation?window=1d. Validate my dcgmproftester deployment has new gpuRequestAverage and gpuUsageAverage fields. Example result here allocation.json.
  6. http://locahost:9003/allocation/summary?window=1d. Validate GPU fields. Example result here allocationsummary.json.

Does this PR require changes to documentation?

  • TODO. Document how to deploy the DCGM Exporter.

Have you labeled this PR and its corresponding Issue as "next release" if it should be part of the next OpenCost release? If not, why not?

  • v2.4

Copy link

vercel bot commented May 3, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
opencost ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 10, 2024 0:30am

marshalled/unmarshalled to/from bytes.

Signed-off-by: thomasvn <[email protected]>
Comment on lines +101 to +102
GPURequestAverage float64 `json:"gpuRequestAverage"` //@bingen:field[version=22]
GPUUsageAverage float64 `json:"gpuUsageAverage"` //@bingen:field[version=22]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these new fields to the bottom of the struct in accordance with https://github.com/opencost/opencost/blob/develop/core/pkg/opencost/bingen.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed that the codecs performs "field version checks". Which leads me to believe that we don't have to add new fields to the end of the struct. If possible, I'd like to group these new fields alongside the other GPU fields.

@thomasvn thomasvn changed the title [WIP] Reintroduce GPU Usage & Efficiency Reintroduce GPU Usage & Efficiency May 16, 2024
@thomasvn thomasvn marked this pull request as ready for review May 16, 2024 16:46
Copy link

sonarcloud bot commented May 16, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reintroduce gpuRequestAverage and gpuUsageAverage to the Allocation API Schema
1 participant