Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] GPU Feature discovery label formatter #3493

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

asaiacai
Copy link
Contributor

@asaiacai asaiacai commented Apr 27, 2024

Resolves #2460

This allows k8s to consume the node label nvidia.com/gpu.product created by GPU feature discovery which is commonly deployed through the NVIDIA GPU operator

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manual test: test against GKE labels (tested against T4)
  • Manual test: test against skypilot labeler script labels on EKS deployed via eks_test_cluster.yaml
  • Manual tests: deploy k3s with gpu-operator using deploy_k3s.sh modified to exclude the skypilot k8s labeler, ensure the following can run
# check nvidia-smi and nvidia.com/gpu.product info
nvidia-smi --query-gpu=name --format=csv,noheader,nounits
kubectl describe node | grep nvidia.com/gpu.product
# test skypilot against gpu type
sky show-gpus --cloud kubernetes
sky launch --cloud kubernetes --gpus <GPU_TYPE>
  • A100-80GB
  • A100
  • H100
  • T4
  • V100
  • A10G
  • P100
  • P4
  • L4

@asaiacai asaiacai marked this pull request as ready for review May 7, 2024 00:24
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @asaiacai! It looks very reasonable to me. @romilbhardwaj for another look to make sure it does not break our other formatters : )

sky/provision/kubernetes/utils.py Show resolved Hide resolved
Co-authored-by: Zhanghao Wu <[email protected]>
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asaiacai!

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved
sky/provision/kubernetes/utils.py Show resolved Hide resolved
sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved
tests/kubernetes/scripts/deploy_k3s.sh Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asaiacai. Tested on A100 and H100 from Lambda. Left a comment about documenting that this labelformatter cannot be used with autoscaling, otherwise lgtm!

sky/provision/kubernetes/utils.py Show resolved Hide resolved
@asaiacai
Copy link
Contributor Author

asaiacai commented Jun 4, 2024

just added the docstring @romilbhardwaj . Thanks for the review! lmk if i this needs anything else.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asaiacai!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] Support Nvidia GFD Labels for GPU type detection
3 participants