Job Status is failed, when scale-in ps. #2070

Mesilenceki · 2024-04-18T01:51:53Z

In the implementation of tf-operator, it supports the scaling of replicas. For instance, if I reduce the number of parameter server (ps) replicas from 4 to 2 by using kubectl delete pod, the pods receive the SIGTERM signal and the container exits with code 137. However, the operator still checks if the pod's exit code is normal, which would result in the job exiting abnormally. Is such an implementation somewhat peculiar, or could it be considered a bug?

tenzen-y · 2024-04-18T04:55:12Z

IIUC, the TFJob doesn't support any scaling approaches, but there are not any validations.
For confirmation: @kubeflow/wg-training-leads

johnugeorge · 2024-04-18T06:19:20Z

Yes. We do not have any handle this case. I am curious if you were doing this intentionally to test the behaviour ?

Mesilenceki · 2024-04-19T03:10:23Z

Yes. We do not have any handle this case. I am curious if you were doing this intentionally to test the behaviour ?

Actually, we tend to make our tfjobs elastic scaling for resource utilization.I understand that supporting scaling-in within the operator should not pose any other issues, right?

tenzen-y · 2024-04-26T11:28:37Z

We meant that this behavior is an intended and expected one, not a bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job Status is failed, when scale-in ps. #2070

Job Status is failed, when scale-in ps. #2070

Mesilenceki commented Apr 18, 2024

tenzen-y commented Apr 18, 2024

johnugeorge commented Apr 18, 2024

Mesilenceki commented Apr 19, 2024

tenzen-y commented Apr 26, 2024

Job Status is failed, when scale-in ps. #2070

Job Status is failed, when scale-in ps. #2070

Comments

Mesilenceki commented Apr 18, 2024

tenzen-y commented Apr 18, 2024

johnugeorge commented Apr 18, 2024

Mesilenceki commented Apr 19, 2024

tenzen-y commented Apr 26, 2024