Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example failed: examples/tensorflow/criteo_deeprec/manual_job.yaml #1136

Open
jason-i-vv opened this issue May 21, 2024 · 1 comment
Open

Comments

@jason-i-vv
Copy link

环境

  1. 运行分支为 master
  2. k8s 版本为 1.22
  3. cuda version 12.3

问题

执行了kubectl apply -f examples/tensorflow/criteo_deeprec/manual_job.yaml,worker 节点一直未出现,只有一个 master 在

kubectl  get po -n dlrover
NAME                                             READY   STATUS    RESTARTS   AGE
dlrover-controller-manager-85cb9778b-9sqb8       2/2     Running   0          5d16h
elasticjob-deepctr-manual-scale-dlrover-master   1/1     Running   0          18h

出现了几千条 scanPlan 数据

dlrover     deepctr-manual-scale-scaleplan-986    Scaling     102m
dlrover     deepctr-manual-scale-scaleplan-987    Scaling     101m
dlrover     deepctr-manual-scale-scaleplan-988    Scaling     100m
dlrover     deepctr-manual-scale-scaleplan-989    Scaling     99m
dlrover     deepctr-manual-scale-scaleplan-99     Scaling     16h
dlrover     deepctr-manual-scale-scaleplan-990    Succeeded   98m
dlrover     deepctr-manual-scale-scaleplan-991    Succeeded   97m
dlrover     deepctr-manual-scale-scaleplan-992    Scaling     96m
dlrover     deepctr-manual-scale-scaleplan-993    Scaling     95m
dlrover     deepctr-manual-scale-scaleplan-994    Succeeded   94m
dlrover     deepctr-manual-scale-scaleplan-995    Succeeded   93m
dlrover     deepctr-manual-scale-scaleplan-996    Scaling     92m
dlrover     deepctr-manual-scale-scaleplan-997    Succeeded   91m
dlrover     deepctr-manual-scale-scaleplan-998    Succeeded   90m
dlrover     deepctr-manual-scale-scaleplan-999    Succeeded   89m

且这些 scanPlan 的数据都是空的 :

 kubectl describe scaleplans.elastic.iml.github.io -n dlrover deepctr-manual-scale-scaleplan-999
Name:         deepctr-manual-scale-scaleplan-999
Namespace:    dlrover
Labels:       scale-type=auto
Annotations:  <none>
API Version:  elastic.iml.github.io/v1alpha1
Kind:         ScalePlan
Metadata:
  Creation Timestamp:  2024-05-21T02:23:58Z
  Generation:          1
  Managed Fields:
    API Version:  elastic.iml.github.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:scale-type:
        f:ownerReferences:
          .:
          k:{"uid":"a7665789-d3cd-4b42-998b-35e12a7e8d8f"}:
      f:spec:
        .:
        f:createPods:
        f:ownerJob:
        f:psHosts:
        f:removePods:
        f:replicaResourceSpecs:
    Manager:      OpenAPI-Generator
    Operation:    Update
    Time:         2024-05-21T02:23:58Z
    API Version:  elastic.iml.github.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:createTime:
        f:finishTime:
        f:phase:
    Manager:      manager
    Operation:    Update
    Subresource:  status
    Time:         2024-05-21T02:23:58Z
  Owner References:
    API Version:           elastic.iml.github.io/v1alpha1
    Block Owner Deletion:  true
    Kind:                  elasticjob
    Name:                  deepctr-manual-scale
    UID:                   a7665789-d3cd-4b42-998b-35e12a7e8d8f
  Resource Version:        43898773
  UID:                     8f451cf8-2d2c-4a8b-8208-b146187599e9
Spec:
  Create Pods:
  Owner Job:  deepctr-manual-scale
  Ps Hosts:
  Remove Pods:
  Replica Resource Specs:
Status:
  Create Time:  2024-05-21T02:23:58Z
  Finish Time:  2024-05-21T02:23:58Z
  Phase:        Succeeded
Events:         <none>

请问如何才能验证一个 tensorflow 的弹性,无论是手工的还是自动的

@workingloong
Copy link
Collaborator

workingloong commented May 28, 2024

这个例子我已在 PR #1141 中修复了。你可以按如下步骤

kubectl -n dlrover apply -f examples/tensorflow/criteo_deeprec/manual_job.yaml

这个job 将有如下 Pods

NAME                                             READY   STATUS              RESTARTS      AGE
deepctr-manual-scale-edljob-chief-0             1/1     Running             0             117s
deepctr-manual-scale-edljob-ps-0                 1/1     Running             0             4m33s
deepctr-manual-scale-edljob-worker-0             1/1     Running             0             4m33s
elasticjob-deepctr-manual-scale-dlrover-master   1/1     Running             0             4m49s

当 chief-0 和 worker-0 开始运行后,可以手动扩容增加一个worker

kubectl -n dlrover apply -f examples/tensorflow/criteo_deeprec/scale_plan.yaml

然后会看到有个新的worker-1

NAME                                             READY   STATUS              RESTARTS      AGE
deepctr-manual-scale-edljob-chief-0             1/1     Running             0             117s
deepctr-manual-scale-edljob-ps-0                 1/1     Running             0             4m33s
deepctr-manual-scale-edljob-worker-0             1/1     Running             0             4m33s
deepctr-manual-scale-edljob-worker-1             0/1     ContainerCreating   0             0s
elasticjob-deepctr-manual-scale-dlrover-master   1/1     Running             0             4m49s

如果不成功的话,可以确认下这个 master pod elasticjob-deepctr-manual-scale-dlrover-master 的镜像是不是 registry.cn-hangzhou.aliyuncs.com/intell-ai/dlrover:master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants