Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky Test: [It] should create desired Pods and Services: Distributed TFJob (4 workers, 2 PS) is succeeded #2086

Open
tenzen-y opened this issue Apr 27, 2024 · 0 comments

Comments

@tenzen-y
Copy link
Member

• [FAILED] [0.459 seconds]
TFJob controller Test Normal Path [It] should create desired Pods and Services
/home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/tfjob_controller_test.go:39

  Timeline >>
  STEP: Distributed TFJob (4 workers, 2 PS) is created @ 04/26/24 21:22:53.416
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-0" not found	{"tfjob": {"name":"test-case-norm-0","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-0"}
  2024-04-26T21:22:53Z	DEBUG	events	Created pod: test-case-norm-0-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-0","uid":"18381da0-8cbb-465d-bf07-96de003f9a1b"}, "reason": "SuccessfulCreatePod"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-0" not found	{"tfjob": {"name":"test-case-norm-0","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-0"}
  2024-04-26T21:22:53Z	DEBUG	events	Created pod: test-case-norm-0-worker-1	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-0","uid":"18381da0-8cbb-465d-bf07-96de003f9a1b"}, "reason": "SuccessfulCreatePod"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-0" not found	{"tfjob": {"name":"test-case-norm-0","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-0"}
  2024-04-26T21:22:53Z	DEBUG	events	Created pod: test-case-norm-0-worker-2	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-0","uid":"18381da0-8cbb-465d-bf07-96de003f9a1b"}, "reason": "SuccessfulCreatePod"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-0" not found	{"tfjob": {"name":"test-case-norm-0","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-0"}
  2024-04-26T21:22:53Z	DEBUG	events	Created pod: test-case-norm-0-worker-3	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-0","uid":"18381da0-8cbb-465d-bf07-96de003f9a1b"}, "reason": "SuccessfulCreatePod"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-0" not found	{"tfjob": {"name":"test-case-norm-0","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-0"}
  2024-04-26T21:22:53Z	DEBUG	events	Created service: test-case-norm-0-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-0","uid":"18381da0-8cbb-465d-bf07-96de003f9a1b"}, "reason": "SuccessfulCreateService"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-0" not found	{"tfjob": {"name":"test-case-norm-0","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-0"}
  2024-04-26T21:22:53Z	DEBUG	events	Created service: test-case-norm-0-worker-1	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-0","uid":"18381da0-8cbb-465d-bf07-96de003f9a1b"}, "reason": "SuccessfulCreateService"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-0" not found	{"tfjob": {"name":"test-case-norm-0","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-0"}
  2024-04-26T21:22:53Z	DEBUG	events	Created service: test-case-norm-0-worker-2	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-0","uid":"18[381](https://github.com/kubeflow/training-operator/actions/runs/8854426275/job/24317424179#step:4:382)da0-8cbb-465d-bf07-96de003f9a1b"}, "reason": "SuccessfulCreateService"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-0" not found	{"tfjob": {"name":"test-case-norm-0","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-0"}
  2024-04-26T21:22:53Z	DEBUG	events	Created service: test-case-norm-0-worker-3	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-0","uid":"18381da0-8cbb-465d-bf07-96de003f9a1b"}, "reason": "SuccessfulCreateService"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-0" not found	{"tfjob": {"name":"test-case-norm-0","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-0"}
  2024-04-26T21:22:53Z	DEBUG	events	Created pod: test-case-norm-0-ps-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-0","uid":"18381da0-8cbb-465d-bf07-96de003f9a1b"}, "reason": "SuccessfulCreatePod"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-0" not found	{"tfjob": {"name":"test-case-norm-0","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-0"}
  2024-04-26T21:22:53Z	DEBUG	events	Created pod: test-case-norm-0-ps-1	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-0","uid":"18381da0-8cbb-465d-bf07-96de003f9a1b"}, "reason": "SuccessfulCreatePod"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-0" not found	{"tfjob": {"name":"test-case-norm-0","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-0"}
  2024-04-26T21:22:53Z	DEBUG	events	Created service: test-case-norm-0-ps-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-0","uid":"18381da0-8cbb-465d-bf07-96de003f9a1b"}, "reason": "SuccessfulCreateService"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-0" not found	{"tfjob": {"name":"test-case-norm-0","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-0"}
  2024-04-26T21:22:53Z	DEBUG	events	Created service: test-case-norm-0-ps-1	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-0","uid":"18381da0-8cbb-465d-bf07-96de003f9a1b"}, "reason": "SuccessfulCreateService"}
  2024-04-26T21:22:53Z	INFO	KubeAPIWarningLogger	unknown field "spec.tfReplicaSpecs.PS.template.metadata.creationTimestamp"
  STEP: Distributed TFJob (4 workers, 2 PS) is created and all replicas are pending @ 04/26/24 21:22:53.458
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-1" not found	{"tfjob": {"name":"test-case-norm-1","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-1"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-1" not found	{"tfjob": {"name":"test-case-norm-1","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-1"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-1" not found	{"tfjob": {"name":"test-case-norm-1","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-1"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-1" not found	{"tfjob": {"name":"test-case-norm-1","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-1"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-1" not found	{"tfjob": {"name":"test-case-norm-1","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-1"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-1" not found	{"tfjob": {"name":"test-case-norm-1","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-1"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-1" not found	{"tfjob": {"name":"test-case-norm-1","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-1"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-1" not found	{"tfjob": {"name":"test-case-norm-1","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-1"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-1" not found	{"tfjob": {"name":"test-case-norm-1","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-1"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-1" not found	{"tfjob": {"name":"test-case-norm-1","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-1"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-1" not found	{"tfjob": {"name":"test-case-norm-1","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-1"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-1" not found	{"tfjob": {"name":"test-case-norm-1","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-1"}
  STEP: Distributed TFJob (4 workers, 2 PS) is created, 2 workers, 1 PS are pending @ 04/26/24 21:22:53.533
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-2" not found	{"tfjob": {"name":"test-case-norm-2","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-2"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-2" not found	{"tfjob": {"name":"test-case-norm-2","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-2"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-2" not found	{"tfjob": {"name":"test-case-norm-2","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-2"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-2" not found	{"tfjob": {"name":"test-case-norm-2","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-2"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-2" not found	{"tfjob": {"name":"test-case-norm-2","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-2"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-2" not found	{"tfjob": {"name":"test-case-norm-2","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-2"}
  2024-04-26T21:22:53Z	ERROR	Reconciler error	{"controller": "tfjob-controller", "object": {"name":"test-tfjob","namespace":"tfjob-ns-ls4h8"}, "namespace": "tfjob-ns-ls4h8", "name": "test-tfjob", "reconcileID": "1b2c2164-a3b1-4be5-b812-a9[388](https://github.com/kubeflow/training-operator/actions/runs/8854426275/job/24317424179#step:4:389)b02a99a", "error": "pods \"test-tfjob-worker-1\" is forbidden: unable to create new content in namespace tfjob-ns-ls4h8 because it is being terminated"}
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
  2024-04-26T21:22:53Z	DEBUG	events	Error creating: pods "test-tfjob-worker-1" is forbidden: unable to create new content in namespace tfjob-ns-ls4h8 because it is being terminated	{"type": "Warning", "object": {"kind":"TFJob","namespace":"tfjob-ns-ls4h8","name":"test-tfjob","uid":"fcdfcb63-412d-45f5-8cd0-dddc17442233","apiVersion":"kubeflow.org/v1","resourceVersion":"227"}, "reason": "FailedCreatePod"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-2" not found	{"tfjob": {"name":"test-case-norm-2","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-2"}
  2024-04-26T21:22:53Z	DEBUG	events	Created pod: test-case-norm-2-worker-2	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-2","uid":"58f59bd2-48bf-4f30-9530-2e7f7f63b434"}, "reason": "SuccessfulCreatePod"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-2" not found	{"tfjob": {"name":"test-case-norm-2","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-2"}
  2024-04-26T21:22:53Z	DEBUG	events	Created pod: test-case-norm-2-worker-3	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-2","uid":"58f59bd2-48bf-4f30-9530-2e7f7f63b434"}, "reason": "SuccessfulCreatePod"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-2" not found	{"tfjob": {"name":"test-case-norm-2","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-2"}
  2024-04-26T21:22:53Z	DEBUG	events	Created service: test-case-norm-2-worker-2	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-2","uid":"58f59bd2-48bf-4f30-9530-2e7f7f63b434"}, "reason": "SuccessfulCreateService"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-2" not found	{"tfjob": {"name":"test-case-norm-2","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-2"}
  2024-04-26T21:22:53Z	DEBUG	events	Created service: test-case-norm-2-worker-3	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-2","uid":"58f59bd2-48bf-4f30-9530-2e7f7f63b434"}, "reason": "SuccessfulCreateService"}
  2024-04-26T21:22:53Z	DEBUG	events	Created pod: test-case-norm-2-ps-1	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-2","uid":"58f59bd2-48bf-4f30-9530-2e7f7f63b434"}, "reason": "SuccessfulCreatePod"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-2" not found	{"tfjob": {"name":"test-case-norm-2","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-2"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-2" not found	{"tfjob": {"name":"test-case-norm-2","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-2"}
  2024-04-26T21:22:53Z	DEBUG	events	Created service: test-case-norm-2-ps-1	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-2","uid":"58f59bd2-48bf-4f30-9530-2e7f7f63b434"}, "reason": "SuccessfulCreateService"}
  STEP: Distributed TFJob (4 workers, 2 PS) is created, 2 workers, 1 PS are pending, 1 worker is succeeded @ 04/26/24 21:22:53.613
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-3" not found	{"tfjob": {"name":"test-case-norm-3","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-3"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-3" not found	{"tfjob": {"name":"test-case-norm-3","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-3"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-3" not found	{"tfjob": {"name":"test-case-norm-3","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-3"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-3" not found	{"tfjob": {"name":"test-case-norm-3","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-3"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-3" not found	{"tfjob": {"name":"test-case-norm-3","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-3"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-3" not found	{"tfjob": {"name":"test-case-norm-3","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-3"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-3" not found	{"tfjob": {"name":"test-case-norm-3","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-3"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-3" not found	{"tfjob": {"name":"test-case-norm-3","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-3"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-3" not found	{"tfjob": {"name":"test-case-norm-3","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-3"}
  2024-04-26T21:22:53Z	DEBUG	events	Created pod: test-case-norm-3-worker-3	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-3","uid":"214c8fcb-d647-4d1b-b9f9-de1f3e8f6924"}, "reason": "SuccessfulCreatePod"}
  2024-04-26T21:22:53Z	DEBUG	events	Created service: test-case-norm-3-worker-3	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-3","uid":"214c8fcb-d647-4d1b-b9f9-de1f3e8f6924"}, "reason": "SuccessfulCreateService"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-3" not found	{"tfjob": {"name":"test-case-norm-3","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-3"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-3" not found	{"tfjob": {"name":"test-case-norm-3","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-3"}
  2024-04-26T21:22:53Z	DEBUG	events	Created pod: test-case-norm-3-ps-1	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-3","uid":"214c8fcb-d647-4d1b-b9f9-de1f3e8f6924"}, "reason": "SuccessfulCreatePod"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-3" not found	{"tfjob": {"name":"test-case-norm-3","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-3"}
  2024-04-26T21:22:53Z	DEBUG	events	Created service: test-case-norm-3-ps-1	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-3","uid":"214c8fcb-d647-4d1b-b9f9-de1f3e8f6924"}, "reason": "SuccessfulCreateService"}
  STEP: Distributed TFJob (4 workers, 2 PS) is created and all replicas are running @ 04/26/24 21:22:53.704
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-4" not found	{"tfjob": {"name":"test-case-norm-4","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-4"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-4" not found	{"tfjob": {"name":"test-case-norm-4","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-4"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-4" not found	{"tfjob": {"name":"test-case-norm-4","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-4"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-4" not found	{"tfjob": {"name":"test-case-norm-4","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-4"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-4" not found	{"tfjob": {"name":"test-case-norm-4","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-4"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-4" not found	{"tfjob": {"name":"test-case-norm-4","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-4"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-4" not found	{"tfjob": {"name":"test-case-norm-4","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-4"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-4" not found	{"tfjob": {"name":"test-case-norm-4","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-4"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-4" not found	{"tfjob": {"name":"test-case-norm-4","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-4"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-4" not found	{"tfjob": {"name":"test-case-norm-4","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-4"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-4" not found	{"tfjob": {"name":"test-case-norm-4","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-4"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-4" not found	{"tfjob": {"name":"test-case-norm-4","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-4"}
  STEP: Distributed TFJob (4 workers, 2 PS) is created, 2 workers, 1 PS are pending, 1 worker is running @ 04/26/24 21:22:53.776
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-5" not found	{"tfjob": {"name":"test-case-norm-5","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-5"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-5" not found	{"tfjob": {"name":"test-case-norm-5","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-5"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-5" not found	{"tfjob": {"name":"test-case-norm-5","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-5"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-5" not found	{"tfjob": {"name":"test-case-norm-5","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-5"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-5" not found	{"tfjob": {"name":"test-case-norm-5","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-5"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-5" not found	{"tfjob": {"name":"test-case-norm-5","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-5"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-5" not found	{"tfjob": {"name":"test-case-norm-5","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-5"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-5" not found	{"tfjob": {"name":"test-case-norm-5","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-5"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-5" not found	{"tfjob": {"name":"test-case-norm-5","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-5"}
  2024-04-26T21:22:53Z	DEBUG	events	Created pod: test-case-norm-5-worker-3	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-5","uid":"280513e0-f904-4854-8b62-748e7f3d1889"}, "reason": "SuccessfulCreatePod"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-5" not found	{"tfjob": {"name":"test-case-norm-5","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-5"}
  2024-04-26T21:22:53Z	DEBUG	events	Created service: test-case-norm-5-worker-3	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-5","uid":"280513e0-f904-4854-8b62-748e7f3d1889"}, "reason": "SuccessfulCreateService"}
  2024-04-26T21:22:53Z	DEBUG	events	Created pod: test-case-norm-5-ps-1	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-5","uid":"280513e0-f904-4854-8b62-748e7f3d1889"}, "reason": "SuccessfulCreatePod"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-5" not found	{"tfjob": {"name":"test-case-norm-5","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-5"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-5" not found	{"tfjob": {"name":"test-case-norm-5","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-5"}
  2024-04-26T21:22:53Z	DEBUG	events	Created service: test-case-norm-5-ps-1	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-5","uid":"280513e0-f904-4854-8b62-748e7f3d1889"}, "reason": "SuccessfulCreateService"}
  STEP: Distributed TFJob (4 workers, 2 PS) is succeeded @ 04/26/24 21:22:53.818
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-6" not found	{"tfjob": {"name":"test-case-norm-6","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-6"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-6" not found	{"tfjob": {"name":"test-case-norm-6","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-6"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-6" not found	{"tfjob": {"name":"test-case-norm-6","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-6"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-6" not found	{"tfjob": {"name":"test-case-norm-6","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-6"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-6" not found	{"tfjob": {"name":"test-case-norm-6","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-6"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-6" not found	{"tfjob": {"name":"test-case-norm-6","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-6"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-6" not found	{"tfjob": {"name":"test-case-norm-6","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-6"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-6" not found	{"tfjob": {"name":"test-case-norm-6","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-6"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-6" not found	{"tfjob": {"name":"test-case-norm-6","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-6"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-6" not found	{"tfjob": {"name":"test-case-norm-6","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-6"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-6" not found	{"tfjob": {"name":"test-case-norm-6","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-6"}
  2024-04-26T21:22:53Z	INFO	TFJob.kubeflow.org "test-case-norm-6" not found	{"tfjob": {"name":"test-case-norm-6","namespace":"default"}, "unable to fetch TFJob": "default/test-case-norm-6"}
  2024-04-26T21:22:53Z	DEBUG	events	Error creating: services "test-case-norm-6-ps-1" already exists	{"type": "Warning", "object": {"kind":"TFJob","namespace":"default","name":"test-case-norm-6","uid":"5499f286-168c-4046-8286-[389](https://github.com/kubeflow/training-operator/actions/runs/8854426275/job/24317424179#step:4:390)3f2fa67ea"}, "reason": "FailedCreateService"}
  [FAILED] in [It] - /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/tfjob_controller_test.go:321 @ 04/26/24 21:22:53.876
  << Timeline

  [FAILED] Expected
      <bool>: false
  to be true
  In [It] at: /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/tfjob_controller_test.go:321 @ 04/26/24 21:22:53.876

Observed at: https://github.com/kubeflow/training-operator/actions/runs/8854426275/job/24317424179#step:4:364

@tenzen-y tenzen-y changed the title Flaky Test: [It] should create desired Pods and Services Flaky Test: [It] should create desired Pods and Services: Distributed TFJob (4 workers, 2 PS) is succeeded Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant