-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dlrover/blob/master/docs/tutorial/tf_elasticjob_on_k8s 【tf_elasticjob_on_k8s example failed to start】 【tf_elasticjob_on_k8s 示例启动失败】 #1121
Comments
I have fixed the example in #1141 . You can try the example with the latest master branch. |
something's not quite right, I'm not sure what would have caused it... |
it appears that workers are failed to start |
it does not matter. |
You can use kubectl to delete a worker and a new work will restart. |
while executing the tf_elasticjob_on_k8s example, job fails with error.
cd examples/tensorflow/criteo_deeprec
kubectl apply -f autoscale_job.yaml
initial error messages as follows:
error logs from the chief pod
kubectl logs -f deepctr-auto-scale-edljob-chief-0 -n dlrover
upon unpacking the image of
registry.cn-hangzhou.aliyuncs.com/intell-ai/dlrover:deeprec_criteo_v1
it appears that the directory is missing. however there are files in the directory of/dlrover/examples/tensorflow/criteo_deeprec
I've changed the
spec.containers.command
of theexamples/tensorflow/criteo_deeprec/autoscale_job.yaml/
file as follows for trying to remedy the issue:the failure persists with the following error logs:
dlrover-master error logs:
chief error logs:
The text was updated successfully, but these errors were encountered: