Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read deployment failure from Kubernetes events #402

Closed
iszulcdeepsense opened this issue Jan 8, 2024 · 2 comments · Fixed by TheRacetrack/plugin-kubernetes-infrastructure#15
Assignees

Comments

@iszulcdeepsense
Copy link
Collaborator

Sometimes deployment of a job gets stuck, if for instance, Kubernetes can't create a new pod due to not enough resources in the cluster. Deployment commands succeeds (kubectl apply), but the real deployment process happens in background later on, and then Kubernetes states that cluster is out of memory and reports an error event. However, Racetrack doesn't know that and still waits patiently until the pod is created, until the timeout occurs. It makes the deployment proceess unnecessarily long in this case.
It would be better to read the k8s events or get notified in this case, to show a meaningful error to a user and abort immediately as the error happens.

@anders314159 anders314159 self-assigned this Mar 11, 2024
@anders314159
Copy link
Contributor

  1. Is there a failing example for this issue?
  2. Will $ racetrack deploy job.yaml succeed while kubectl apply runs in the kubernetes plugin? Where is the timeout?
  3. Do we want this feature to be in lifecyle-supervisor, or is it something in kubernetes plugin?

@iszulcdeepsense
Copy link
Collaborator Author

iszulcdeepsense commented Apr 19, 2024

@anders314159

  1. Is there a failing example for this issue?

A sort of. In case of requesting for too many CPU cores, like this:

resources:
  cpu_min: 10M # 10M is not millis, but Mega

Kubernetes says it's okay on kubectl apply stage, but then it's unable to create a pod. So there's a way for an improvement in getting this information back to a user.

  1. Will $ racetrack deploy job.yaml succeed while kubectl apply runs in the kubernetes plugin? Where is the timeout?

racetrack deploy job.yaml will fail, but after 15 mintues (after the timeout).
The timeout comes from the liveness probe check, Lifecycle tries to check the /live endpoint of a job, but since the pod doesn't exist, it waits indefinitely until it appears:

response = _wait_until_job_is_alive(base_url, deployment_timestamp, headers)

  1. Do we want this feature to be in lifecyle-supervisor, or is it something in kubernetes plugin?

Good question. Probably it's going to end up in kubernetes plugin, but if possible we could bring some generic parts into the Lifecycle (in order not to repeat the same thing in many plugins).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants