Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job Status is failed, when scale-in ps. #2070

Open
Mesilenceki opened this issue Apr 18, 2024 · 4 comments
Open

Job Status is failed, when scale-in ps. #2070

Mesilenceki opened this issue Apr 18, 2024 · 4 comments

Comments

@Mesilenceki
Copy link

In the implementation of tf-operator, it supports the scaling of replicas. For instance, if I reduce the number of parameter server (ps) replicas from 4 to 2 by using kubectl delete pod, the pods receive the SIGTERM signal and the container exits with code 137. However, the operator still checks if the pod's exit code is normal, which would result in the job exiting abnormally. Is such an implementation somewhat peculiar, or could it be considered a bug?

@tenzen-y
Copy link
Member

IIUC, the TFJob doesn't support any scaling approaches, but there are not any validations.
For confirmation: @kubeflow/wg-training-leads

@johnugeorge
Copy link
Member

Yes. We do not have any handle this case. I am curious if you were doing this intentionally to test the behaviour ?

@Mesilenceki
Copy link
Author

Yes. We do not have any handle this case. I am curious if you were doing this intentionally to test the behaviour ?

Actually, we tend to make our tfjobs elastic scaling for resource utilization.I understand that supporting scaling-in within the operator should not pose any other issues, right?

@tenzen-y
Copy link
Member

We meant that this behavior is an intended and expected one, not a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants