[Flaky Test] E2E tests for extensions can fail due to unavailability of clusterrole-machine-controller-manager.local.extensions.gardener.cloud
webhook
#9020
Labels
area/auto-scaling
Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related
area/scalability
Scalability related
area/testing
Testing related
kind/flake
Tracking or fixing a flaky test
How to categorize this issue?
/area testing
/kind flake
Which test(s)/suite(s) are flaking:
E2E tests for extensions which use the KinD setup can sometimes flake during the step which deploys the extensions' charts in the local KinD cluster.
CI link:
Reason for failure:
This can happen if the extensions' charts contain a
clusterrole
resource. E.g. theshoot-rsyslog-relp
extension deploys a ClusterRole as part of the skaffold deployment for theshoot-rsyslog-relp-admission
used for the e2e tests.This skaffold deployment can fail with the following error:
The reason for the failure is that the
gardener-extension-provider-local
pods could get evicted by VPA during the deployment of the extension charts, meaning that thegardener-extension-provider-local
s webhook server will be temporarily unavailable.The
clusterrole-machine-controller-manager.local.extensions.gardener.cloud
webhook does not have any selectors:gardener/pkg/provider-local/webhook/machinecontrollermanager/add.go
Lines 71 to 80 in bcaed6d
However, it is only responsible for the
system:machine-controller-manager-runtime
ClusterRole:gardener/pkg/provider-local/webhook/machinecontrollermanager/mutator.go
Lines 30 to 32 in bcaed6d
Therefore, anything that tries to deploy a ClusterRole while the
gardener-extension-provider-local
pods are down will fail.Anything else we need to know:
The text was updated successfully, but these errors were encountered: