[Bug] [raycluster-controller] Kuberay cannot recreate new raycluster header pod when it has been evicted by kubelet as disk pressure #2125

xjhust · 2024-05-08T08:15:04Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I have deploy kuberay-operator and ray-cluster in my k8s cluster, it works fine most of the time. When I use fallocate command to enforce the k8s node to disk-pressure status where the raycluster head pod runs, the raycluster head pod will be evicted by kubelet. There is no new head pod created on the other k8s node with normal status after long time. And when I clear large file created by fallocate commad and relieve the k8s node disk pressure, the raycluster pod will by still evicted and no new head pod created.
So I have to manual delete the evicted head pod to make it work again, and obviously this will make our production environment unstable and the service will not be highly available.
My expected behavior is that the raycluster controller works well like deployment, when the head pod been evicted, it will auto recreate new head pod.

Reproduction script

Basic version information

Kubernetes: v1.20.15
ray-operator: v1.0.0
raycluster(ray): 2.9.0

Reproduction steps

deploy kuberay-operator and raycluster in the k8s cluster;
find the k8s node where the raycluster head pod runs, and use fallocate command like fallocate -l 1000G tmpfile to make the imagefs full, and then the raycluster head pod will be evicted by kubelet as disk pressure;
wait for a while, there is no new head pod created on the other k8s node;
release the imagefs by delete the tmpfile, and wait for a while, the raycluster head pod is still evicted, and the raycluster is still not available.

Anything else

This will happen every time when we make the raycluster head evicted.
We can find the relative log of the ray-operator pod, such as:
2024-05-08T07:55:30.304Z INFO controllers.RayCluster reconcilePods {"Found 1 head Pod": "ray-cluster-head-dqtxb", "Pod status": "Failed", "Pod restart policy": "Always", "Ray container terminated status": "nil"} 2024-05-08T07:55:30.304Z INFO controllers.RayCluster reconcilePods {"head Pod": "ray-cluster-head-dqtxb", "shouldDelete": false, "reason": "The status of the head Pod ray-cluster-head-dqtxb is Failed. However, KubeRay will not delete the Pod because its restartPolicy is set to 'Always' and it should be able to restart automatically."}

So I read the relative code in raycluster controller, I found that the raycluster controller rely on the Pod's restart policy to restart the Pod when its status is failed. But In this case, the pod has been evicted by kubelet will not restart, so the raycluster will not work after evicted.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

DmitriGekhtman · 2024-05-16T03:45:40Z

Well, at least the log message clearly expresses the authors' misunderstanding of Kubernetes eviction behavior (which to be fair is extremely confusing).

xjhust added bug Something isn't working triage labels May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [raycluster-controller] Kuberay cannot recreate new raycluster header pod when it has been evicted by kubelet as disk pressure #2125

[Bug] [raycluster-controller] Kuberay cannot recreate new raycluster header pod when it has been evicted by kubelet as disk pressure #2125

xjhust commented May 8, 2024

DmitriGekhtman commented May 16, 2024 •

edited

[Bug] [raycluster-controller] Kuberay cannot recreate new raycluster header pod when it has been evicted by kubelet as disk pressure #2125

[Bug] [raycluster-controller] Kuberay cannot recreate new raycluster header pod when it has been evicted by kubelet as disk pressure #2125

Comments

xjhust commented May 8, 2024

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Basic version information

Reproduction steps

Anything else

Are you willing to submit a PR?

DmitriGekhtman commented May 16, 2024 • edited

DmitriGekhtman commented May 16, 2024 •

edited