Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

master pod not getting started for pytorch job #2034

Open
bharathappali opened this issue Mar 25, 2024 · 2 comments
Open

master pod not getting started for pytorch job #2034

bharathappali opened this issue Mar 25, 2024 · 2 comments

Comments

@bharathappali
Copy link

bharathappali commented Mar 25, 2024

I'm trying to run training operator standalone on openshift cluster with katib. When I apply a pytorch job the worker pods are getting created but for some reason the master pods are not getting started.

Here is the events log of the worker pod:

Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  Normal   Scheduled       9m35s                  default-scheduler  Successfully assigned sampler/random-exp-jw6qxmrm-worker-0 to acorvin-hpo-poc-jfrlm-worker-0-twvtz
  Normal   AddedInterface  9m33s                  multus             Add eth0 [10.131.5.61/23] from openshift-sdn
  Normal   Pulling         9m33s                  kubelet            Pulling image "quay.io/bharathappali/alpine:3.10"
  Normal   Pulled          9m32s                  kubelet            Successfully pulled image "quay.io/bharathappali/alpine:3.10" in 1.065165424s (1.065174057s including waiting)
  Warning  BackOff         2m49s                  kubelet            Back-off restarting failed container init-pytorch in pod random-exp-jw6qxmrm-worker-0_sampler(8d6860a7-204d-45c8-bb57-8d84a6cf8e66)
  Normal   Created         2m34s (x3 over 9m31s)  kubelet            Created container init-pytorch
  Normal   Started         2m34s (x3 over 9m31s)  kubelet            Started container init-pytorch
  Normal   Pulled          2m34s (x2 over 6m11s)  kubelet            Container image "quay.io/bharathappali/alpine:3.10" already present on machine

I have changed the init container image due to docker pull limits issue

Here is the pod log:

nslookup: can't resolve 'random-exp-jw6qxmrm-master-0': Name does not resolve
waiting for master
nslookup: can't resolve '(null)': Name does not resolve

Here is the pytorch experiment I'm deploying

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: random-exp
  namespace: sampler
spec:
  maxTrialCount: 25
  parallelTrialCount: 3
  maxFailedTrialCount: 3
  resumePolicy: Never
  objective:
    type: maximize
    goal: 0.9
    objectiveMetricName: accuracy
    additionalMetricNames: []
  algorithm:
    algorithmName: bayesianoptimization
    algorithmSettings:
      - name: base_estimator
        value: GP
      - name: n_initial_points
        value: '10'
      - name: acq_func
        value: gp_hedge
      - name: acq_optimizer
        value: auto
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: '0.01'
        max: '0.03'
        step: '0.01'
  metricsCollectorSpec:
    collector:
      kind: StdOut
  trialTemplate:
    primaryContainerName: pytorch
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    retain: false
    trialParameters:
      - name: learningRate
        reference: lr
        description: ''
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: PyTorchJob
      spec:
        pytorchReplicaSpecs:
          Master:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - name: pytorch
                    image: quay.io/bharathappali/pytorch-mnist-cpu:v0.16.0
                    resources:
                      limits:
                        cpu: "1"
                        memory: "2Gi"
                      requests:
                        cpu: "1"
                        memory: "1Gi"
                    command:
                      - python3
                      - /opt/pytorch-mnist/mnist.py
                      - '--epochs=1'
                      - '--lr=${trialParameters.learningRate}'
                      - '--momentum=0.5'
          Worker:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - name: pytorch
                    image: quay.io/bharathappali/pytorch-mnist-cpu:v0.16.0
                    resources:
                      limits:
                        cpu: "1"
                        memory: "2Gi"
                      requests:
                        cpu: "1"
                        memory: "1Gi"
                    command:
                      - python3
                      - /opt/pytorch-mnist/mnist.py
                      - '--epochs=1'
                      - '--lr=${trialParameters.learningRate}'
                      - '--momentum=0.5'

@johnugeorge
Copy link
Member

Is master not up ? random-exp-jw6qxmrm-master-0 doesn't resolve

@bharathappali
Copy link
Author

Yes the master pod is not getting scheduled. I see workers init failure and it shows crashloopbackoff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants