SkyPilot v0.5.0
SkyPilot v0.5.0: SkyServe, New Provisioner, LLMs, Kubernetes, and More Clouds
We are excited to release SkyPilot v0.5.0, where we introduce a significant amount of new features and enhancements, including:
- SkyPilot Serving
- New provisioner
- LLM recipes for the latest open models and engines
- Kubernetes support improvement
- 4 new clouds (contributed by the cloud providers!)
and more!
Release Highlights
New Features
- Multiple candidate resources: SkyPilot now supports multiple candidate resources for a single task (using multiple accelerators,
any_of
orordered
inresources
), allowing users to significantly enlarge the resource pool and get higher availability. - New Provisioner: Provisioner gets a new implementation, which is 2x faster and more reliable for supported clouds. Support launching clusters with more than 100 nodes. Dependency requirements for clouds are also significantly reduced.
- Disk Tier: Introducing
best
disk tier for the best performance and cost, so you can choose the best disk for any cloud. (#2434) - Allow 2x spot jobs to be run concurrently
- Mount storage back after cluster restart
SkyServe
SkyServe is a serving system on top of SkyPilot that deploys and scales any HTTP services across one or more regions or clouds, with autoscaling, load balancing, and more.
- Introducing SkyServe: deploy and scale your AI models across multiple regions or clouds. (#2458)
- Autoscaler: Request rate based autoscaling policy. (#2868, #2878)
- Autoscaler: Support scaling to 0 when no requests (#2938)
- Rolling update: Support rolling update for existing services (#2935, #3057)
Other Enhancements
- Environment variable support in services field (#3078)
- Override task configurations with CLI arguments (#2979)
- Logging improvement for replicas (#2924, #2949)
- Smoke tests for SkyServe (#2911)
- Documents for SkyServe (#3022, #2794, #2864, #2894, #2922, #2989, #3182)
- UX improvements for SkyServe (#2895, #2940, #2961, #3054, #3176, #3094)
- Bug fixes and robustness improvement (#2811, #2822, #2860, #2995, #2983, #3058, #3075, #3226)
New LLM Recipes
- Gemma: Serve your Gemma on any cloud (#3207, #3220)
- SGLang: Speed up your LLM deployments with SGLang for 5x throughput on SkyServe (#3126, #3140, #3170, #3145)
- Mixtral 8x7B: Serving and scaling Mixtral 8x7B model on any regions/clouds (#2857, #2888, #3017, #3067, #2882)
- Mistral 7B: Official docs for hosting Mistral 7B from mistral.ai (#2615, #2856)
- CodeLlama: Hosting CodeLlama model with SkyServe and accessing it with API, chat or VSCode (#3050, #3143)
- LoRAX: efficient multi-lora LLM inference (#2883)
- axolotl: a latest LLM tool for finetuning AI models running on SkyPilot (#2784, #2789)
- Tabby: Self-host coding assistant Tabby on SkyPilot (#2597, #3068)
- vLLM: Serve with vLLM to expose OpenAI API for Vicuna and Mixtral (#2614, #2643, #2616, #2786, #2791, #2948,#3118)
- TGI: Scale the inference engine TGI with SkyServe (#3121)
Kubernetes
Kubernetes support received a number of New Features and Enhancements.
- Multi-node support for Kubernetes (#2609, #3019)
- Open ports support for Kubernetes (#2588, #2713, #2997, #3200)
- Support Coreweave label for GPUs in Kubernetes (Coreweave support under development) (#2650)
- Starting a kubernetes GPU cluster locally with
sky local up
(#2890) - Custom Image Support for Kubernetes Instances (#2729, #3019, #3210)
- New provisioner for kubernets for better performance and robustneess (#3019)
- Supporting Kubernetes cluster launched with k3s and Rancher (#3148)
Other Enhancements
- Support H100 80GB in Kubernetes (#2840)
- Share SSH jump pod across users to reduce resources consumption (#2826)
- Allow
KUBECONFIG
env var for config file specification (#3169) - Robustify the kubernetes cluster removement (#3043)
- Fixes GPU labeller (#2636, #2653)
- UX and Robustness improvement (#2638, #2712, #2589, #2785, #2551, #2795, #2884, #2913, #2795)
- Documents improvement (#2595, #2705, #2957, #2991, #2997, #3119)
More Clouds
SkyPilot now supports 13 cloud providers, including 4 new provider-contributed clouds: VMWare vSphere, RunPod, Fluidstack and Cudo Compute.
- RunPod: RunPod is a specialized AI cloud, with additional capacities for high-end GPUs. (#2980, #3018)
- Fluidstack: Fluidstack offers accessible GPUs for AI with low cost. (#3086, #3224)
- Cudo Compute: GPU cloud provides low cost GPUs powered with green energy. (#2975, #3224)
- VMWare vSphere: you can now bring your own vSphere cluster to SkyPilot. (docs) (#3000)
Clouds
AWS
New Features
- New provisioner for AWS: >2x faster for multi-node provisioning and more reliable for cluster launching. (#1702, #2719, #2792)
- Support for AWS Trainium accelerator (#2690)
- Support null for proxy command to filter regions (#2756)
- Support CUDA 12.1 with default image updates (#2788)
- Job scheduling on Inferentia and Trainium (#2969, #2798)
- Allow specifying security_group (#3133)
Enhancements
- Make public / private subnet selection robust (#2867)
- Avoid hanging for restarting an instance in STOPPING state (#2998)
- Remove sunset instance types (#2610)
- Add docs for custom VPC support (#2776)
Fixes
- Fix conda installation on AWS default image (#3206)
- Robustify the custom image support (#3216)
- Fix subnet selection for AWS and autodown for spot instances (#2921)
- Fix minimal permission for AWS (#2978)
- Improve opening ports for AWS (#2716)
- Autstop with new provisioner (#2719)
GCP
New Features
- Security: Custom VPC support for GCP. (#2764, #2772, #2854, #2944)
- Security: Support private IP with proxy jump on GCP. (#2819)
- New provisioner: Adopted new provisioner for GCP with >2x faster and more robust provisioning (#2681, #2719, #2943)
- Automatically use reserved instances from multiple reserved pools (#2836, #2681)
- Support L4 accelerator for GCP (#2724)
- Allow stopping spot clusters on GCP (#2877)
Enhancements
- Allow stopping VM with local SSD (#2587)
- Update default runtime version for TPU node (#2601, #2602)
- Handling transient error during launching GCP clusters (#2669)
- Update GCSFuse version to 1.3.0 for GCS storage mount (#2887)
- Set TPU VM the default option for TPU accelerators (#1758)
- Ignore missing gcp credentials for latest gcloud and avoid duplicating credentials (#3028, #3172, #3234)
Fixes
- Fix custom docker image support (#3218)
- Fix minimal roles required for GCP (#2704)
- Robustify the catalog fetching (#3141)
- Fix ports on TPU VM and cluster launched before 0.4.0 (#2641)
- Fix backward compatibility issue with GCP clusters (#2604)
- Fix
--disk-size
for Custom Machine Images (#2718) - Update catalog fetcher with more options (#2562)
- Assign GCP VMs with service account (#2972)
- Fix machine image support (#3030, #3236)
- Fix error handling for failed provisioning (#2852)
- Leave out TPU v5 in catalog as it is not supported (#2656)
- Fix GCP minimal permission (#2947, #2770, #2761)
Azure
Enhancements
- Make ports openning more robust (#2649, #2891, #3084)
- Additional arguments for Azure catalog fetcher and support H100 (#2561, #2844, #2847)
- Support CUDA 12.1 with default image updates (#2468)
- Support spot instances on Azure (#2871)
Fixes
- Fix custom docker image support (#3218)
- UX: Fix Azure disk tier explicitly shown in resources str (#3064)
- Fix status query for Azure (#3015)
SCP
- Fix SCP error raised in
sky check
(#3038)
CLI & Core interfaces
New Features
- Multi-node jobs fail fast fast for single node failure (#3081)
- Add configurations for not uploading credentials (#2904)
- Adding
sky status --endpoints
CLI (#3199) - Support more characters in cluster name (#3130)
- Show all regions and more accurate price in
sky show-gpus
(#2583, #2892, #2933, #2946, #3083, #3149, #3113) - Allow infering cloud from region or zone (#2632)
- Add
--commit
and--version
forsky
CLI (#2720, #2731, #2733)
Enhancements
- Robustify runtime initialization on remote cluster (#3132)
- Better error message for YAML parsing (#3040)
- Smarter GPU name completion (#3014)
- Speed up retry until up by not doing exponential backoff (#2821)
- Add schema validation for config (#2645)
- Allow
--disk-tier none
override (#2906) sky check
improvement (#3174, #3212, #3160)- Better logging for CLIs (#2535, #2691, #2728, #3139, #3175)
Fixes
- Fix permission issues for SSH config file on specific linux distributions (#3151)
- Fix
sky_logs
and mounting directory (#2667, #2845) - Fix job related commands (#2662, #2767)
- Fix
sky logs
with--sync-down
(#2660)
Deprecations
- Deprecate
cpunode/gpunode/tpunode
, hideadmin
(#2800) - Remove deprecated
Local
cloud which is now replaced by Kubernetes support (#3037, #3186)
Backend/Provisioner
New Features
- Support multiple candidate resources (#2498, #2803, #2833, #2886, #3107)
- Support launching 100-node cluster for AWS, GCP, Kubernetes, and RunPod (#3004, #3005)
- Support spaces in paths (#2762)
- Support long local username with special characters (#3105, #3130)
Enhancements
- Robustify termination of failed clusters during failover (#2990)
- Improve the ssh check for clusters just provisioned (#2797)
- Robustify failover to avoid terminating clusters that has user data (#2977)
- Move ssh config to
~/.ssh/generated/ssh
instead of directly editing~/.ssh/config
(#2706, #3069) - Code refactoring and cleanup (#2541, #2736, #3046, #2633, #2870, #2925, #3087, #3088, #3153)
- Improve usage collection (#2654, #2672)
- Better explanation of failover in docs (#2850, #2834)
Fixes
- Avoid backward compatibility issue with provisioner (#2682)
- Fix cloud provisioning internal file mount cache (#2715)
- Fix optimization for DAG when some resources provided are not feasible (#2657)
- Fix runtime installation on remote VM (#2909, #2912)
- Fix cluster termination when the cluster is not fully UP (#3025)
- Fixes for tests (#2651, #2976, #3023, #3166, #3167, #3202)
- Improve logging (#2594, #2678, #2696, #3003)
Managed spot
New Features
Enhancements
- Better logging and UX (#2630)
- Add docs for customizing spot controller (#2753)
- Add spot pipeline docs (#2936)
Fixes
- Fix private VPC support for spot jobs (#2874)
- Fix
~/.sky/config.yaml
for spot jobs (#2876) - Fix OOM for long running spot jobs (#2675)
- Fix AWS NoCredentialError caused by credential rotation (#2695)
- Fix Azure dependency on spot controller (#2875)
Storage
New Features
Enhancements
- Clarify the syntax for external and managed storage (#3162, #2804)
- Confirmation prompt for sky storage delete, and --yes flag to skip it (#2726)
- Refactor and clean up storage code (#2774, #2986)
Fixes
- Fix permission issue for S3 mounting on specific images (#3215)
- Fix spaces in source path for storages (#2835)
Dependencies
- Recommand nightly build in docs for better performance and robustness (#2984)
- Automatic build for nightly Docker image (#2229)
- Avoid ray dependency locally for AWS, GCP, and Kubernetes (#2625, #2943, #3019)
- Remove AWS dependency by default for better setup time and less confliction (#2841, #2942)
- Fix GCP dependency by updating google-api-python-client (#2577, #2759)
- Pin remote dependency for ray job (#2659)
- Robustify dependencies (#2642, #2679, #3024)
Examples
- NeMo distributed training for BERT and GPT3 (#2533)
- Add docker compose example to run multiple containers (#2745)
- Distributed ray train example (#2828)
- Benchmark Torch DDP (#2987)
- Example updates for supported models (#2637, #2825)
Full Changelog: v0.4.0...v0.5.0
Thanks to all contributors!
New contributors: @rtalaricw, @jackyk02, @Vaibhav2001, @rohanvaidya45, @shrinandan, @manishiitg, @amitkumarj441, @tgaddair, @aseriesof-tubes, @changxiaohui, @thams, @kishb87, @PratikKumar125, @mmcclean, @dtran24, @davidwagnerkc, @mjibril, @kbrgl, @msehsah1, @JungleCatSW, @Ying1123
Many thanks to all contributors who contributed to this release!
Contributors: @Michaelvll, @concretevitamin, @cblmemo, @romilbhardwaj, @MaoZiming, @landscapepainter, @sunny0826, @suquark, @Vaibhav2001, @infwinston, @hemildesai, @asaiacai, @shrinandan, @kishb87, @rtalaricw, @iojw, @aseriesof-tubes, @manishiitg, @jackyk02, @mmcclean, @thams, @amitkumarj441, @rohanvaidya45, @saihtaungkham, @tgaddair, @davidwagnerkc, @PratikKumar125, @dtran24, @changxiaohui, @mjibril, @kbrgl, @msehsah1, @JungleCatSW, @Ying1123