-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clusterqueue_controller panics when workload has no admission #2182
Comments
Thank you for creating this issue!
|
Thank you for the quick reply @tenzen-y!
Edit: Also worth pointing out the workload had no template in |
For what its worth, I found this recent PR #2171 where it looks like a nil check is being added against the |
@preslavgerchev do you still have the spec for the job |
@trasc sure! Here's a slightly redacted yaml spec, just the values have been substituted. I have dropped the
|
Thanks @preslavgerchev , I'll have a deeper look on this. /assign |
Thank you for taking this. I'm suspecting that QuotaReserved brought us breaking change without backward compatibility since we added it recently. |
@preslavgerchev is there any chance you still have @alculquicondor have you encountered something like this? |
@trasc, as part of our update flow we updated the CRD Here's a link to the CRDs as we had them deployed as part of the upgrade |
Unfortunately I don't know how to dig dipper in this multiple CRD version scenario, can you remove the |
Maybe additionally confirm if there is no other old pod of Kueue running. If there is a Pod running v0.2.x, then it might be responding to webhooks and dropping fields. |
Unlikely, the wl has QuotaReserved condition which is more or less new. |
Yeah, but the webhook wouldn't drop conditions... it doesn't validate names of conditions, IIRC. We see that the Workload shared by the OP has no podset template or admission fields, which are two things that changed when we released v1beta1 |
So ... we have wl admitted by the |
hi @trasc, sorry for the late reply on this one. I currently cannot safely drop the If this is the recommended approach, I will look into first stopping kueue entirely, ensuring there are no workload objects in the cluster and then upgrading. |
Hi @preslavgerchev, |
That should be the kueue controller. The only way we create jobs is by labelling k8s jobs with the When we upgraded, there were no workload objects in the cluster. Which means that the new kueue controller (v0.6.2) had taken a k8s job with that label and created a workload object for it with the missing admission and pod set fields |
Can confirm, we only had one pod (v0.6.2) of the controller manager running |
The newer version of kueue controller v0.3+ are only using If there is no old kueue or custom controller working with |
@preslavgerchev can you open a support ticket with GKE as well? |
What happened:
We have recently updated kueue to 0.6.2. After some time the kueue manager started crashing with the following panic:
Upon further inspection we found that there is one workload object that has no
status.admission.clusterQueueName
:We use kueue by assigning labels to a k8s job as described here: https://kueue.sigs.k8s.io/docs/tasks/run/jobs/
To me it seems like there's a race condition in the controller logic as we had the new version (0.6.2) running for some time until it crashed. We had to scale down the deployment, delete the validating/mutating webhooks so we can manually get rid off the workload. After restarting everything back, it worked for 2-3 hours until the same panic reoccurred.
For completeness, here's our resource flavor, local and cluster queue definitions if those are needed:
We have deployed everything as specified in the 0.6.2 manifests:
https://github.com/kubernetes-sigs/kueue/releases/download/v0.6.2/manifests.yaml
The deployment's container are using the following images:
Environment:
kubectl version
): 1.27.7-gke.1121002git describe --tags --dirty --always
): 0.6.2The text was updated successfully, but these errors were encountered: