[Flaky Test] E2E tests for extensions can fail due to unavailability of `clusterrole-machine-controller-manager.local.extensions.gardener.cloud` webhook #9020

plkokanov · 2024-01-10T09:27:12Z

How to categorize this issue?

/area testing
/kind flake

Which test(s)/suite(s) are flaking:
E2E tests for extensions which use the KinD setup can sometimes flake during the step which deploys the extensions' charts in the local KinD cluster.

CI link:

Reason for failure:
This can happen if the extensions' charts contain a clusterrole resource. E.g. the shoot-rsyslog-relp extension deploys a ClusterRole as part of the skaffold deployment for the shoot-rsyslog-relp-admission used for the e2e tests.

This skaffold deployment can fail with the following error:

Error: INSTALLATION FAILED: 1 error occurred:
	* Internal error occurred: failed calling webhook "clusterrole-machine-controller-manager.local.extensions.gardener.cloud": failed to call webhook: Post "[https://gardener-extension-provider-local.extension-provider-local-5mf8n.svc:443/clusterrole-machine-controller-manager?timeout=5s](https://gardener-extension-provider-local.extension-provider-local-5mf8n.svc/clusterrole-machine-controller-manager?timeout=5s)": dial tcp 10.2.126.230:443: connect: connection refused

The reason for the failure is that the gardener-extension-provider-local pods could get evicted by VPA during the deployment of the extension charts, meaning that the gardener-extension-provider-locals webhook server will be temporarily unavailable.

The clusterrole-machine-controller-manager.local.extensions.gardener.cloud webhook does not have any selectors:

gardener/pkg/provider-local/webhook/machinecontrollermanager/add.go

Lines 71 to 80 in bcaed6d

 return &extensionswebhook.Webhook{ 

 Name: name, 

 Provider: provider, 

 Types: types, 

 Target: target, 

 Path: name, 

 Webhook: &admission.Webhook{Handler: handler, RecoverPanic: true}, 

 FailurePolicy: &failurePolicy, 

 TimeoutSeconds: pointer.Int32(5), 

 }, nil

However, it is only responsible for the system:machine-controller-manager-runtime ClusterRole:

gardener/pkg/provider-local/webhook/machinecontrollermanager/mutator.go

Lines 30 to 32 in bcaed6d

 if newObj.GetName() != "system:machine-controller-manager-runtime" { 

 return nil 

 }

.

Therefore, anything that tries to deploy a ClusterRole while the gardener-extension-provider-local pods are down will fail.

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

rfranzke · 2024-01-19T12:44:24Z

We should probably improve the webhooks in a more general way, i.e., introduce well-known labels for resources targeted by webhooks instead of switching over names. This way, we can add objectSelectors to all webhook registrations and prevent errors like this.

ialidzhikov · 2024-02-13T13:07:54Z

/assign
/area auto-scaling

gardener-prow bot added area/testing Testing related kind/flake Tracking or fixing a flaky test labels Jan 10, 2024

gardener-prow bot assigned ialidzhikov Feb 13, 2024

gardener-prow bot added the area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related label Feb 13, 2024

ialidzhikov removed their assignment Apr 29, 2024

ialidzhikov self-assigned this May 13, 2024

ialidzhikov mentioned this issue May 15, 2024

extensions lib: Consider dropping EnsureKubeAPIServerService as it is no longer required after ManagedIstio/APIServerSNI is unconditionally enabled #9755

Closed

ialidzhikov added the area/scalability Scalability related label May 15, 2024

rfranzke linked a pull request May 27, 2024 that will close this issue

[provider-local] Harmonize local VPN setup with real-world scenario #9752

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaky Test] E2E tests for extensions can fail due to unavailability of `clusterrole-machine-controller-manager.local.extensions.gardener.cloud` webhook #9020

[Flaky Test] E2E tests for extensions can fail due to unavailability of `clusterrole-machine-controller-manager.local.extensions.gardener.cloud` webhook #9020

plkokanov commented Jan 10, 2024

rfranzke commented Jan 19, 2024

ialidzhikov commented Feb 13, 2024

[Flaky Test] E2E tests for extensions can fail due to unavailability of clusterrole-machine-controller-manager.local.extensions.gardener.cloud webhook #9020

[Flaky Test] E2E tests for extensions can fail due to unavailability of clusterrole-machine-controller-manager.local.extensions.gardener.cloud webhook #9020

Comments

plkokanov commented Jan 10, 2024

rfranzke commented Jan 19, 2024

ialidzhikov commented Feb 13, 2024

[Flaky Test] E2E tests for extensions can fail due to unavailability of `clusterrole-machine-controller-manager.local.extensions.gardener.cloud` webhook #9020

[Flaky Test] E2E tests for extensions can fail due to unavailability of `clusterrole-machine-controller-manager.local.extensions.gardener.cloud` webhook #9020