-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Fail the job, if the head node crashes #2161
Comments
|
Yes, mean KubeRay CRD. Using KubeRay v1.0.0 and upgrading to V1.1.1 resolved these problems. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
When the head node crashes due to OOM or other reasons, the cluster head node will respawn and the job will not gracefully resume when GCS fault tolerance is disabled. The job does not resume; however, the cluster recovers leaving lingering resources until the job is manually deleted.
Add in support for that when GCS is disabled, Ray Jobs fail when there are head node interruptions.
Reproduction script
Create a Ray job CRD with GCS fault tolerance disabled on the operator, delete the head node once the job begins. The job will stay in a running state until the job is manually deleted.
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: