Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Realtime GPU availability of kubernetes cluster in sky show-gpus #3499

Merged
merged 28 commits into from
May 27, 2024

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Apr 30, 2024

Closes #2839 and and #3448. Shows realtime availability of GPUs on the cluster when --cloud kubernetes is passed to sky show-gpus.

Examples

On a kubernetes cluster with the following configuration:

  • 2x T4:4 nodes, for a total of 8 T4 GPUs
  • 2x V100:2 nodes, for a total of 4 V100 GPUs
  • 2 jobs running - 1 using T4:2 and another using V100:2.
$ sky show-gpus --cloud kubernetes     
GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS  
T4    1, 2, 3, 4    8           6               
V100  1, 2          4           2               

# With name and quantity filter
$ sky show-gpus T4:2 --cloud kubernetes
GPU  QTY_FILTER  TOTAL_GPUS  FILTERED_FREE_GPUS  
T4   2           8           6               

# With name and checking for 4x GPU. Note that `AVAILABLE_GPUS` is now 4, since only 4 GPUs are available as a set on a single node (the other node has only 2 (out of 4) GPUs available).
$ sky show-gpus T4:4 --cloud kubernetes
GPU  QTY_FILTER  TOTAL_GPUS  FILTERED_FREE_GPUS  
T4   4           8           4               

# Without cloud filter, behavior remains unchanged
$ sky show-gpus T4:4                   
GPU  QTY  CLOUD       INSTANCE_TYPE          DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  REGION       
T4   4    AWS         g4dn.12xlarge          16GB        48     192GB     $ 3.912       $ 1.378            us-west-2    
T4   4    Azure       Standard_NC64as_T4_v3  -           64     440GB     $ 4.352       $ 0.435            eastus       
T4   4    GCP         n1-standard-64         16GB        64     240GB     $ 4.440       $ 1.168            us-central1  
T4   4    Kubernetes  (attachable)           -           -      -         $ 0.000       $ 0.000            kubernetes   

# ======= Error messages ==========

# GPU not present on the cluster
$ sky show-gpus L4 --cloud kubernetes
No GPUs matching name 'L4' found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To list all available accelerators, run: sky show-gpus --cloud kubernetes.

# Checking for more quantity than is available
$ sky show-gpus T4:8 --cloud kubernetes
No GPUs matching name 'T4' with quantity 8 found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To list all available accelerators, run: sky show-gpus --cloud kubernetes.

# On a cluster with no GPUs (e.g., `sky local up`)
$ sky show-gpus --cloud kubernetes 
No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To further debug, run: sky check.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Rendered docs
  • Manual tests with the examples above and on a no GPU kubernetes cluster
  • pytest tests/test_list_accelerators.py

@romilbhardwaj romilbhardwaj marked this pull request as ready for review April 30, 2024 19:15
@romilbhardwaj
Copy link
Collaborator Author

romilbhardwaj commented Apr 30, 2024

Ran some tests + updated cli docs. Ready for review.

@Michaelvll Michaelvll self-requested a review May 1, 2024 05:02
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the support for this @romilbhardwaj! Mostly looks good to me with minor nits. Just tried it on a k8s cluster with GPUs and it seems working well.

sky/cli.py Outdated Show resolved Hide resolved
Comment on lines +50 to +52
return list_accelerators_realtime(gpus_only, name_filter, region_filter,
quantity_filter, case_sensitive,
all_regions, require_price)[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will calling this adds additional overhead to the list_accelerators? Since we are relying on the list_accelerators to generate the optimization candidate resources, which will be called multiple times during the failover process. Would be nice to make sure this does not add overhead. : )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point.. the overhead compared to a the previous implementation isn't much different since the previous implementation was also invoking the kubernetes API:

This branch:
multitime -n 5 sky launch --dryrun -y --gpus T4:1
===> multitime results
1: sky launch --dryrun -y --gpus T4:1
            Mean        Std.Dev.    Min         Median      Max
real        3.883       0.064       3.782       3.883       3.982
user        2.775       0.081       2.654       2.766       2.871
sys         3.136       0.285       2.676       3.268       3.448


Master: 
multitime -n 5 sky launch --dryrun -y --gpus T4:1
1: sky launch --dryrun -y --gpus T4:1
            Mean        Std.Dev.    Min         Median      Max
real        3.863       0.032       3.829       3.860       3.917
user        2.713       0.023       2.670       2.716       2.735
sys         3.438       0.097       3.267       3.471       3.535

That said, we should put a lru cache with a time-to-live (TTL) to expire based on time. Added a TODO.

sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator Author

Thanks @Michaelvll! Ready for another look.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @romilbhardwaj! LGTM. IIRC, we may want to have a separate section for the k8s table in sky show-gpus without any argument, so that it can be easier to distinguish those "on-prem" GPUs.

Also, it seems sky show-gpus t4 does not contain the kubernetes cluster, although sky show-gpus --cloud kubernetes does show the T4 GPUs. Can we show the k8s section in sky show-gpus t4 as well?

@romilbhardwaj romilbhardwaj added this to the v0.6 milestone May 23, 2024
@romilbhardwaj
Copy link
Collaborator Author

Thanks @Michaelvll - I've made some updates:

  1. Thanks for catching the case sensitivity bug! It's fixed now - sky show-gpus t4 or sky show-gpus T4 will show:
(base) ➜  ~ sky show-gpus t4
GPU  QTY  CLOUD       INSTANCE_TYPE          DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  REGION
T4   1    Kubernetes  (attachable)           -           -      -         $ 0.000       $ 0.000            kubernetes
T4   2    Kubernetes  (attachable)           -           -      -         $ 0.000       $ 0.000            kubernetes
T4   3    Kubernetes  (attachable)           -           -      -         $ 0.000       $ 0.000            kubernetes
T4   4    Kubernetes  (attachable)           -           -      -         $ 0.000       $ 0.000            kubernetes
T4   1    Azure       Standard_NC4as_T4_v3   -           4      28GB      $ 0.526       $ 0.053            eastus
...
  1. I've updated sky show-gpus to show Kubernetes GPUs in a separate table (in the examples below, P500 is a dummy GPU I created on one of the nodes to simulate any non-canonical GPUs that users may have on their cluster):
===== When Kubernetes is enabled and has GPUs =====
$ sky show-gpus
COMMON_GPU  AVAILABLE_QUANTITIES
A10         1, 2, 4
A10G        1, 4, 8
A100        1, 2, 4, 8, 16
A100-80GB   1, 2, 4, 8
H100        1, 2, 4, 6, 8, 12
K80         1, 2, 4, 8, 16
L4          1, 2, 4, 8
M60         1, 2, 4
P100        1, 2, 4
T4          1, 2, 3, 4, 8
V100        1, 2, 4, 8
V100-32GB   1, 2, 4, 8

KUBERNETES_GPU  QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
P500            1, 2, 3, 4    4           4
T4              1, 2, 3, 4    8           8
V100            1, 2          4           4

GOOGLE_TPU   AVAILABLE_QUANTITIES
tpu-v2-8     1
tpu-v2-32    1
tpu-v2-128   1
tpu-v2-256   1
tpu-v2-512   1
tpu-v3-8     1
tpu-v3-32    1
tpu-v3-64    1
tpu-v3-128   1
tpu-v3-256   1
tpu-v3-512   1
tpu-v3-1024  1
tpu-v3-2048  1

$ sky show-gpus --cloud kubernetes
GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
P500  1, 2, 3, 4    4           4
T4    1, 2, 3, 4    8           8
V100  1, 2          4           4

===== When Kubernetes is enabled but does not have GPUs =====
$ sky show-gpus
COMMON_GPU  AVAILABLE_QUANTITIES
A10         1, 2, 4
A10G        1, 4, 8
A100        1, 2, 4, 8, 16
A100-80GB   1, 2, 4, 8
H100        1, 2, 4, 6, 8, 12
K80         1, 2, 4, 8, 16
L4          1, 2, 4, 8
M60         1, 2, 4
P100        1, 2, 4
T4          1, 2, 4, 8
V100        1, 2, 4, 8
V100-32GB   1, 2, 4, 8

No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To further debug, run: sky check.

GOOGLE_TPU   AVAILABLE_QUANTITIES
tpu-v2-8     1
tpu-v2-32    1
tpu-v2-128   1
tpu-v2-256   1
tpu-v2-512   1
tpu-v3-8     1
tpu-v3-32    1
tpu-v3-64    1
tpu-v3-128   1
tpu-v3-256   1
tpu-v3-512   1
tpu-v3-1024  1
tpu-v3-2048  1

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.

$ sky show-gpus --cloud kubernetes
No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To further debug, run: sky check.

===== When Kubernetes is not enabled =====
$ sky show-gpus
COMMON_GPU  AVAILABLE_QUANTITIES
A10         1, 2, 4
A10G        1, 4, 8
A100        1, 2, 4, 8, 16
A100-80GB   1, 2, 4, 8
H100        1, 2, 4, 6, 8, 12
K80         1, 2, 4, 8, 16
L4          1, 2, 4, 8
M60         1, 2, 4
P100        1, 2, 4
T4          1, 2, 4, 8
V100        1, 2, 4, 8
V100-32GB   1, 2, 4, 8

GOOGLE_TPU   AVAILABLE_QUANTITIES
tpu-v2-8     1
tpu-v2-32    1
tpu-v2-128   1
tpu-v2-256   1
tpu-v2-512   1
tpu-v3-8     1
tpu-v3-32    1
tpu-v3-64    1
tpu-v3-128   1
tpu-v3-256   1
tpu-v3-512   1
tpu-v3-1024  1
tpu-v3-2048  1

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @romilbhardwaj!

I found having kubernetes GPUs mixed with the cloud tables a bit weird in sky show-gpus t4.

One idea: we just have two sections, one for clouds, and one for k8s? For the k8s section, we just show the real-time availability table.

Similarly for sky show-gpus, we can have two sections, each with a title, e.g., Clouds, Kubernetes (similar to our sky status with three sections for clusters, jobs, and services).

We can have the Kubernetes section at the top so as to make all the cloud tables more connected together : )

sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator Author

Thanks @Michaelvll - here's the latest behavior to help review:

===== When Kubernetes is enabled and has GPUs =====
(base) ➜  ~ sky show-gpus
Kubernetes GPUs
GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
P500  1, 2, 3, 4    4           4
T4    1, 2, 3, 4    8           8
V100  1, 2          4           4

Cloud GPUs
COMMON_GPU  AVAILABLE_QUANTITIES
A10         1, 2, 4
A10G        1, 4, 8
A100        1, 2, 4, 8, 16
A100-80GB   1, 2, 4, 8
H100        1, 2, 4, 6, 8, 12
K80         1, 2, 4, 8, 16
L4          1, 2, 4, 8
M60         1, 2, 4
P100        1, 2, 4
T4          1, 2, 4, 8
V100        1, 2, 4, 8
V100-32GB   1, 2, 4, 8

GOOGLE_TPU   AVAILABLE_QUANTITIES
tpu-v2-8     1
tpu-v2-32    1
tpu-v2-128   1
tpu-v2-256   1
tpu-v2-512   1
tpu-v3-8     1
tpu-v3-32    1
tpu-v3-64    1
tpu-v3-128   1
tpu-v3-256   1
tpu-v3-512   1
tpu-v3-1024  1
tpu-v3-2048  1

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.

$ sky show-gpus --cloud kubernetes
Kubernetes GPUs
GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
P500  1, 2, 3, 4    4           4
T4    1, 2, 3, 4    8           8
V100  1, 2          4           4

# GPU that only exists in kubernetes
$ sky show-gpus P500
Kubernetes GPUs
GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
P500  1, 2, 3, 4    4           4

Cloud GPUs
Resources 'P500' not found in cloud catalogs. To show available accelerators, run: sky show-gpus --all

# GPU that doesn't exist in Kubernetes
$ sky show-gpus H100
Kubernetes GPUs
Resources 'H100' not found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To show available accelerators on kubernetes, run: sky show-gpus --cloud kubernetes

Cloud GPUs
...

$ Invalid GPU name
$ sky show-gpus K9000
Kubernetes GPUs
Resources 'K9000' not found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To show available accelerators on kubernetes, run: sky show-gpus --cloud kubernetes

Cloud GPUs
Resources 'K9000' not found in cloud catalogs. To show available accelerators, run: sky show-gpus --all

===== When Kubernetes is enabled but does not have GPUs =====
(base) ➜  ~ sky show-gpus
COMMON_GPU  AVAILABLE_QUANTITIES
A10         1, 2, 4
A10G        1, 4, 8
A100        1, 2, 4, 8, 16
A100-80GB   1, 2, 4, 8
H100        1, 2, 4, 6, 8, 12
K80         1, 2, 4, 8, 16
L4          1, 2, 4, 8
M60         1, 2, 4
P100        1, 2, 4
T4          1, 2, 4, 8
V100        1, 2, 4, 8
V100-32GB   1, 2, 4, 8

GOOGLE_TPU   AVAILABLE_QUANTITIES
tpu-v2-8     1
tpu-v2-32    1
tpu-v2-128   1
tpu-v2-256   1
tpu-v2-512   1
tpu-v3-8     1
tpu-v3-32    1
tpu-v3-64    1
tpu-v3-128   1
tpu-v3-256   1
tpu-v3-512   1
tpu-v3-1024  1
tpu-v3-2048  1

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.

Note: No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To further debug, run: sky check.

$ sky show-gpus --cloud kubernetes
No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To further debug, run: sky check.


$ sky show-gpus --all
<Note is shown after the quantities before the start of the longer tables since that output can be quite long >
COMMON_GPU  AVAILABLE_QUANTITIES
...

GOOGLE_TPU   AVAILABLE_QUANTITIES
...

OTHER_GPU        AVAILABLE_QUANTITIES
...

Note: No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerator) are setup correctly. To further debug, run: sky check.

GPU   QTY  CLOUD       INSTANCE_TYPE        DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE
...

===== When Kubernetes is not enabled =====
$ sky show-gpus
COMMON_GPU  AVAILABLE_QUANTITIES
A10         1, 2, 4
A10G        1, 4, 8
A100        1, 2, 4, 8, 16
A100-80GB   1, 2, 4, 8
H100        1, 2, 4, 6, 8, 12
K80         1, 2, 4, 8, 16
L4          1, 2, 4, 8
M60         1, 2, 4
P100        1, 2, 4
T4          1, 2, 4, 8
V100        1, 2, 4, 8
V100-32GB   1, 2, 4, 8

GOOGLE_TPU   AVAILABLE_QUANTITIES
tpu-v2-8     1
tpu-v2-32    1
tpu-v2-128   1
tpu-v2-256   1
tpu-v2-512   1
tpu-v3-8     1
tpu-v3-32    1
tpu-v3-64    1
tpu-v3-128   1
tpu-v3-256   1
tpu-v3-512   1
tpu-v3-1024  1
tpu-v3-2048  1

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.

(base) ➜  ~ sky show-gpus --cloud kubernetes
Kubernetes is not enabled. To fix, run: sky check kubernetes

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @romilbhardwaj for updating this! It works great! LGTM!

sky/cli.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator

Michaelvll commented May 27, 2024

A minor point: for sky show-gpus -a, it would be nice to have the hint to be shown at the top instead of in the middle, since the latter is hard to see and find, especially we have the | less for the output.

@romilbhardwaj
Copy link
Collaborator Author

Thanks @Michaelvll! Moved the hint to the top for -a and simplified the logic a bit in 997bec1.

@romilbhardwaj romilbhardwaj merged commit e006a79 into master May 27, 2024
20 checks passed
@romilbhardwaj romilbhardwaj deleted the k8s_show_gpus_availability branch May 27, 2024 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] Show currently available GPUs on Kubernetes cluster
2 participants