New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed K8s nodes leave jobs hanging indefinitely #2072
Labels
Comments
Kubernetes batch/v1 Job has a similar feature, Pod Failure Policy and Pod Disruption Conditions. /kind feature |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Observed Problem
I tested this with PyTorchJobs, but presumably this would apply to other job types as well
If you fully shutdown a node that the job is running on, and that job's RestartPolicy set to
OnFailure
, the PyTorchJob does not recover gracefully. The controller does recognize that the pod failed. However, it tries to terminate the pod before it creates a new one. Since the node is down, it cannot terminate the pod. So, the pod stays in a “terminating” state forever, and the PyTorchJob stays in a “Restarting” state forever.Proposed Solution
Terminating
status for more than the given intervalDeletePod
function in pod_control.go so that when it is checking the deletion timestamp, it will force a deletion if the configured elapsed time has passed since the deletion timestamp.I plan to submit a PR for this unless there's a better insight on how to handle this.
The text was updated successfully, but these errors were encountered: