master pod not getting started for pytorch job #2034

bharathappali · 2024-03-25T08:22:51Z

I'm trying to run training operator standalone on openshift cluster with katib. When I apply a pytorch job the worker pods are getting created but for some reason the master pods are not getting started.

Here is the events log of the worker pod:

Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  Normal   Scheduled       9m35s                  default-scheduler  Successfully assigned sampler/random-exp-jw6qxmrm-worker-0 to acorvin-hpo-poc-jfrlm-worker-0-twvtz
  Normal   AddedInterface  9m33s                  multus             Add eth0 [10.131.5.61/23] from openshift-sdn
  Normal   Pulling         9m33s                  kubelet            Pulling image "quay.io/bharathappali/alpine:3.10"
  Normal   Pulled          9m32s                  kubelet            Successfully pulled image "quay.io/bharathappali/alpine:3.10" in 1.065165424s (1.065174057s including waiting)
  Warning  BackOff         2m49s                  kubelet            Back-off restarting failed container init-pytorch in pod random-exp-jw6qxmrm-worker-0_sampler(8d6860a7-204d-45c8-bb57-8d84a6cf8e66)
  Normal   Created         2m34s (x3 over 9m31s)  kubelet            Created container init-pytorch
  Normal   Started         2m34s (x3 over 9m31s)  kubelet            Started container init-pytorch
  Normal   Pulled          2m34s (x2 over 6m11s)  kubelet            Container image "quay.io/bharathappali/alpine:3.10" already present on machine

I have changed the init container image due to docker pull limits issue

Here is the pod log:

nslookup: can't resolve 'random-exp-jw6qxmrm-master-0': Name does not resolve
waiting for master
nslookup: can't resolve '(null)': Name does not resolve

Here is the pytorch experiment I'm deploying

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: random-exp
  namespace: sampler
spec:
  maxTrialCount: 25
  parallelTrialCount: 3
  maxFailedTrialCount: 3
  resumePolicy: Never
  objective:
    type: maximize
    goal: 0.9
    objectiveMetricName: accuracy
    additionalMetricNames: []
  algorithm:
    algorithmName: bayesianoptimization
    algorithmSettings:
      - name: base_estimator
        value: GP
      - name: n_initial_points
        value: '10'
      - name: acq_func
        value: gp_hedge
      - name: acq_optimizer
        value: auto
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: '0.01'
        max: '0.03'
        step: '0.01'
  metricsCollectorSpec:
    collector:
      kind: StdOut
  trialTemplate:
    primaryContainerName: pytorch
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    retain: false
    trialParameters:
      - name: learningRate
        reference: lr
        description: ''
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: PyTorchJob
      spec:
        pytorchReplicaSpecs:
          Master:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - name: pytorch
                    image: quay.io/bharathappali/pytorch-mnist-cpu:v0.16.0
                    resources:
                      limits:
                        cpu: "1"
                        memory: "2Gi"
                      requests:
                        cpu: "1"
                        memory: "1Gi"
                    command:
                      - python3
                      - /opt/pytorch-mnist/mnist.py
                      - '--epochs=1'
                      - '--lr=${trialParameters.learningRate}'
                      - '--momentum=0.5'
          Worker:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - name: pytorch
                    image: quay.io/bharathappali/pytorch-mnist-cpu:v0.16.0
                    resources:
                      limits:
                        cpu: "1"
                        memory: "2Gi"
                      requests:
                        cpu: "1"
                        memory: "1Gi"
                    command:
                      - python3
                      - /opt/pytorch-mnist/mnist.py
                      - '--epochs=1'
                      - '--lr=${trialParameters.learningRate}'
                      - '--momentum=0.5'

The text was updated successfully, but these errors were encountered:

johnugeorge · 2024-03-25T09:17:58Z

Is master not up ? random-exp-jw6qxmrm-master-0 doesn't resolve

bharathappali · 2024-03-25T14:15:56Z

Yes the master pod is not getting scheduled. I see workers init failure and it shows crashloopbackoff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

master pod not getting started for pytorch job #2034

master pod not getting started for pytorch job #2034

bharathappali commented Mar 25, 2024 •

edited

johnugeorge commented Mar 25, 2024

bharathappali commented Mar 25, 2024

master pod not getting started for pytorch job #2034

master pod not getting started for pytorch job #2034

Comments

bharathappali commented Mar 25, 2024 • edited

johnugeorge commented Mar 25, 2024

bharathappali commented Mar 25, 2024

bharathappali commented Mar 25, 2024 •

edited