Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Support Nvidia GFD Labels for GPU type detection #2460

Open
romilbhardwaj opened this issue Aug 25, 2023 · 3 comments · May be fixed by #3493
Open

[k8s] Support Nvidia GFD Labels for GPU type detection #2460

romilbhardwaj opened this issue Aug 25, 2023 · 3 comments · May be fixed by #3493
Labels
k8s Kubernetes related items

Comments

@romilbhardwaj
Copy link
Collaborator

To detect GPU type on the cluster, we currently support GKE labels and skypilot.co/accelerators labels created by our GPU labelling script (python -m sky.utils.kubernetes.gpu_labeler).

It would be good to add a GPULabelFormatter for Nvidia GPU Feature Discovery. To do so, we will need a list of label values generated by Nvidia GFD for SkyPilot supported GPUs (e.g., nvidia.com/gpu.product: A100-SXM4-40GB).

Related issue: NVIDIA/k8s-device-plugin#739

@romilbhardwaj romilbhardwaj added this to the k8s milestone Aug 25, 2023
@romilbhardwaj romilbhardwaj added the k8s Kubernetes related items label Sep 23, 2023
@romilbhardwaj romilbhardwaj removed this from the k8s milestone Sep 23, 2023
Copy link

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@asaiacai
Copy link
Contributor

asaiacai commented Apr 26, 2024

Based on the code the maintainer for gpu feature discovery mentioned in the related issue, I want to say that the output from nvidia.com/gpu.product is exactly the same as what you get from running nvidia-smi --query-gpu=name --format=csv,noheader,nounits as done here? the only difference being there's dashes instead of spaces. On my machine I get the following outputs

$ kubectl describe node | grep nvidia.com/gpu.product
                    nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3

$ nvidia-smi --query-gpu=name --format=csv,noheader,nounits
NVIDIA H100 80GB HBM3

so it's probably enough to use similar logic as the labeler job to compare against the canonical gpu names we have?

I can draft up a PR for this. it was getting hard to remember to rerun the labeler job anytime I added a node to my k3s cluster but this way it just works directly with feature discovery. Somewhat resolves #3432 if people are running with the NVIDIA gpu operator

@romilbhardwaj
Copy link
Collaborator Author

Good point, if the nvidia.com/gpu.product label indeed uses nvidia-smi --query-gpu=name --format=csv,noheader,nounits with - instead of , we should be able to have a GPULabelFormatter that parses GFD labels into SkyPilot canonical names. Would love to see a PR. Just noting that we will need to test it extensively, since having a half-functioning label formatter is worse than having users run the labelling script for consistency.

@asaiacai asaiacai linked a pull request Apr 27, 2024 that will close this issue
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
k8s Kubernetes related items
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants