Read deployment failure from Kubernetes events #402

iszulcdeepsense · 2024-01-08T12:15:48Z

Sometimes deployment of a job gets stuck, if for instance, Kubernetes can't create a new pod due to not enough resources in the cluster. Deployment commands succeeds (kubectl apply), but the real deployment process happens in background later on, and then Kubernetes states that cluster is out of memory and reports an error event. However, Racetrack doesn't know that and still waits patiently until the pod is created, until the timeout occurs. It makes the deployment proceess unnecessarily long in this case.
It would be better to read the k8s events or get notified in this case, to show a meaningful error to a user and abort immediately as the error happens.

The text was updated successfully, but these errors were encountered:

anders314159 · 2024-04-19T05:19:51Z

Is there a failing example for this issue?
Will $ racetrack deploy job.yaml succeed while kubectl apply runs in the kubernetes plugin? Where is the timeout?
Do we want this feature to be in lifecyle-supervisor, or is it something in kubernetes plugin?

iszulcdeepsense · 2024-04-19T08:11:06Z

@anders314159

Is there a failing example for this issue?

A sort of. In case of requesting for too many CPU cores, like this:

resources:
  cpu_min: 10M # 10M is not millis, but Mega

Kubernetes says it's okay on kubectl apply stage, but then it's unable to create a pod. So there's a way for an improvement in getting this information back to a user.

Will $ racetrack deploy job.yaml succeed while kubectl apply runs in the kubernetes plugin? Where is the timeout?

racetrack deploy job.yaml will fail, but after 15 mintues (after the timeout).
The timeout comes from the liveness probe check, Lifecycle tries to check the /live endpoint of a job, but since the pod doesn't exist, it waits indefinitely until it appears:

racetrack/lifecycle/lifecycle/monitor/health.py

Line 36 in 5bb2f98

response = _wait_until_job_is_alive(base_url, deployment_timestamp, headers)

Do we want this feature to be in lifecyle-supervisor, or is it something in kubernetes plugin?

Good question. Probably it's going to end up in kubernetes plugin, but if possible we could bring some generic parts into the Lifecycle (in order not to repeat the same thing in many plugins).

anders314159 self-assigned this Mar 11, 2024

anders314159 mentioned this issue Apr 25, 2024

Read deployment failure from Kubernetes events TheRacetrack/plugin-kubernetes-infrastructure#14

Closed

iszulcdeepsense linked a pull request May 15, 2024 that will close this issue

14 read deployment failure from kubernetes events TheRacetrack/plugin-kubernetes-infrastructure#15

Merged

iszulcdeepsense closed this as completed in TheRacetrack/plugin-kubernetes-infrastructure#15 May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read deployment failure from Kubernetes events #402

Read deployment failure from Kubernetes events #402

iszulcdeepsense commented Jan 8, 2024

anders314159 commented Apr 19, 2024

iszulcdeepsense commented Apr 19, 2024 •

edited

Read deployment failure from Kubernetes events #402

Read deployment failure from Kubernetes events #402

Comments

iszulcdeepsense commented Jan 8, 2024

anders314159 commented Apr 19, 2024

iszulcdeepsense commented Apr 19, 2024 • edited

iszulcdeepsense commented Apr 19, 2024 •

edited