New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job Status is failed, when scale-in ps. #2070
Comments
IIUC, the TFJob doesn't support any scaling approaches, but there are not any validations. |
Yes. We do not have any handle this case. I am curious if you were doing this intentionally to test the behaviour ? |
Actually, we tend to make our tfjobs elastic scaling for resource utilization.I understand that supporting scaling-in within the operator should not pose any other issues, right? |
We meant that this behavior is an intended and expected one, not a bug. |
In the implementation of tf-operator, it supports the scaling of replicas. For instance, if I reduce the number of parameter server (ps) replicas from 4 to 2 by using kubectl delete pod, the pods receive the SIGTERM signal and the container exits with code 137. However, the operator still checks if the pod's exit code is normal, which would result in the job exiting abnormally. Is such an implementation somewhat peculiar, or could it be considered a bug?
The text was updated successfully, but these errors were encountered: