Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test] E2E tests for extensions can fail due to unavailability of clusterrole-machine-controller-manager.local.extensions.gardener.cloud webhook #9020

Open
plkokanov opened this issue Jan 10, 2024 · 2 comments · May be fixed by #9752
Assignees
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related area/scalability Scalability related area/testing Testing related kind/flake Tracking or fixing a flaky test

Comments

@plkokanov
Copy link
Contributor

How to categorize this issue?

/area testing
/kind flake

Which test(s)/suite(s) are flaking:
E2E tests for extensions which use the KinD setup can sometimes flake during the step which deploys the extensions' charts in the local KinD cluster.

CI link:

Reason for failure:
This can happen if the extensions' charts contain a clusterrole resource. E.g. the shoot-rsyslog-relp extension deploys a ClusterRole as part of the skaffold deployment for the shoot-rsyslog-relp-admission used for the e2e tests.

This skaffold deployment can fail with the following error:

Error: INSTALLATION FAILED: 1 error occurred:
	* Internal error occurred: failed calling webhook "clusterrole-machine-controller-manager.local.extensions.gardener.cloud": failed to call webhook: Post "[https://gardener-extension-provider-local.extension-provider-local-5mf8n.svc:443/clusterrole-machine-controller-manager?timeout=5s](https://gardener-extension-provider-local.extension-provider-local-5mf8n.svc/clusterrole-machine-controller-manager?timeout=5s)": dial tcp 10.2.126.230:443: connect: connection refused

The reason for the failure is that the gardener-extension-provider-local pods could get evicted by VPA during the deployment of the extension charts, meaning that the gardener-extension-provider-locals webhook server will be temporarily unavailable.

The clusterrole-machine-controller-manager.local.extensions.gardener.cloud webhook does not have any selectors:

return &extensionswebhook.Webhook{
Name: name,
Provider: provider,
Types: types,
Target: target,
Path: name,
Webhook: &admission.Webhook{Handler: handler, RecoverPanic: true},
FailurePolicy: &failurePolicy,
TimeoutSeconds: pointer.Int32(5),
}, nil

However, it is only responsible for the system:machine-controller-manager-runtime ClusterRole:
if newObj.GetName() != "system:machine-controller-manager-runtime" {
return nil
}
.

Therefore, anything that tries to deploy a ClusterRole while the gardener-extension-provider-local pods are down will fail.

Anything else we need to know:

@gardener-prow gardener-prow bot added area/testing Testing related kind/flake Tracking or fixing a flaky test labels Jan 10, 2024
@rfranzke
Copy link
Member

We should probably improve the webhooks in a more general way, i.e., introduce well-known labels for resources targeted by webhooks instead of switching over names. This way, we can add objectSelectors to all webhook registrations and prevent errors like this.

@ialidzhikov
Copy link
Member

/assign
/area auto-scaling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related area/scalability Scalability related area/testing Testing related kind/flake Tracking or fixing a flaky test
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants