[Ray Autoscaling] Issues related to the handling of Pending Worker Nodes when scaling down #45195
Labels
core
Issues that should be addressed in Ray Core
enhancement
Request for new feature and/or capability
@external-author-action-required
Alternate tag for PRs where the author doesn't have labeling permission.
P1
Issue that should be fixed within a few weeks
Description
When using KubeRay to deploy k8s cluster, if the k8s cluster resources are tight during scale up, some Worker Nodes will always be in the Pending state.
After the job execution is completed, the running Worker Nodes will be scaled down according to the configured
idleTimeoutSeconds
.Then, as resources in the cluster become available, the Worker Nodes that were previously in the Pending state will be converted to the Running state and wait for
idleTimeoutSeconds
again to scale down.If there are too many Worker Nodes in the Pending state, it will take a long time to scale down all the unneeded nodes and release the occupied resources when the task execution is completed, which will result in lower resource utilization.
Use case
Users may use a large
maxWorkerNum
and submit a large number of Ray Tasks at once when auto-scaling is enabled. According to the current auto-scaling rules, it will try to allocate Worker Nodes that can satisfy the resources required by all Tasks, and a large number of Worker Nodes will be in the Pending state when k8s resources are tight, i.e.Available WorkerNodes
is a small number, while DesiredWorkerNodes
is a large number.When the task is completed, the
Available WorkerNodes
will be scaled down according to the configuredidleTimeoutSeconds
, and the released resources will be occupied by the Pending Worker Nodes, which will then wait foridleTimeoutSeconds
again, resulting in a long time to release the resources.Possible solution: Remove all Worker Nodes of a certain type that is in Pending state before scaling down a Worker Node of that type due to idle, otherwise Worker Nodes of that type should not be in idle state either.
The text was updated successfully, but these errors were encountered: