Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Fail the job, if the head node crashes #2161

Closed
2 tasks done
peterghaddad opened this issue May 21, 2024 · 2 comments
Closed
2 tasks done

[Bug] Fail the job, if the head node crashes #2161

peterghaddad opened this issue May 21, 2024 · 2 comments
Labels
bug Something isn't working rayjob

Comments

@peterghaddad
Copy link

peterghaddad commented May 21, 2024

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

When the head node crashes due to OOM or other reasons, the cluster head node will respawn and the job will not gracefully resume when GCS fault tolerance is disabled. The job does not resume; however, the cluster recovers leaving lingering resources until the job is manually deleted.

Add in support for that when GCS is disabled, Ray Jobs fail when there are head node interruptions.

Reproduction script

Create a Ray job CRD with GCS fault tolerance disabled on the operator, delete the head node once the job begins. The job will stay in a running state until the job is manually deleted.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@peterghaddad peterghaddad added bug Something isn't working triage labels May 21, 2024
@kevin85421
Copy link
Member

until the job is manually deleted.

  • Do you mean RayJob CRD?
  • Which version of KubeRay do you use?

@kevin85421 kevin85421 added rayjob and removed triage labels May 25, 2024
@peterghaddad
Copy link
Author

Yes, mean KubeRay CRD.

Using KubeRay v1.0.0 and upgrading to V1.1.1 resolved these problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working rayjob
Projects
None yet
Development

No branches or pull requests

2 participants