New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
master pod not getting started for pytorch job #2034
Comments
Is master not up ? random-exp-jw6qxmrm-master-0 doesn't resolve |
Yes the master pod is not getting scheduled. I see workers init failure and it shows crashloopbackoff |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm trying to run training operator standalone on openshift cluster with katib. When I apply a pytorch job the worker pods are getting created but for some reason the master pods are not getting started.
Here is the events log of the worker pod:
I have changed the init container image due to docker pull limits issue
Here is the pod log:
Here is the pytorch experiment I'm deploying
The text was updated successfully, but these errors were encountered: