[Bug] Fail the job, if the head node crashes #2161

peterghaddad · 2024-05-21T13:43:28Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

When the head node crashes due to OOM or other reasons, the cluster head node will respawn and the job will not gracefully resume when GCS fault tolerance is disabled. The job does not resume; however, the cluster recovers leaving lingering resources until the job is manually deleted.

Add in support for that when GCS is disabled, Ray Jobs fail when there are head node interruptions.

Reproduction script

Create a Ray job CRD with GCS fault tolerance disabled on the operator, delete the head node once the job begins. The job will stay in a running state until the job is manually deleted.

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

kevin85421 · 2024-05-25T17:51:13Z

until the job is manually deleted.

Do you mean RayJob CRD?
Which version of KubeRay do you use?

peterghaddad · 2024-06-03T16:29:50Z

Yes, mean KubeRay CRD.

Using KubeRay v1.0.0 and upgrading to V1.1.1 resolved these problems.

peterghaddad added bug Something isn't working triage labels May 21, 2024

kevin85421 added rayjob and removed triage labels May 25, 2024

peterghaddad closed this as completed Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Fail the job, if the head node crashes #2161

[Bug] Fail the job, if the head node crashes #2161

peterghaddad commented May 21, 2024 •

edited

kevin85421 commented May 25, 2024

peterghaddad commented Jun 3, 2024

[Bug] Fail the job, if the head node crashes #2161

[Bug] Fail the job, if the head node crashes #2161

Comments

peterghaddad commented May 21, 2024 • edited

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

kevin85421 commented May 25, 2024

peterghaddad commented Jun 3, 2024

peterghaddad commented May 21, 2024 •

edited