GRM on Seeds might end up in a crash-loop when their kube-apiserver
domain starts resolving to a different IP
#9528
Labels
kube-apiserver
domain starts resolving to a different IP
#9528
How to categorize this issue?
/area robustness
/kind bug
What happened:
On Seeds which are also Shoots (where
KUBERNETES_SERVICE_HOST
environment variable is set) theGRM
deployed ingarden
namespace can end up in a crash loop when the domain of its ownkube-apiserver
suddenly resolves to a different IP.GRM pods on these Seeds.
Egress access to GRM is restricted by
allow-to-runtime-apiserver
NetworkPolicy ingarden
namespace. This NetworkPolicy allows Egress traffic on port 443 to the endpoints ofkubernetes.default.svc
and the IPs which are resolved from its kube-apiserver domain (ref).If the
kube-apiserver
domain starts resolving to a different domain and it is not accessible via its old IP anymore, GRM will crash after a while because it loses access tokube-apiserver
. The new GRM pods try to accesskube-apiserver
via its new IP, but this is not allows by the NetworkPolicy which includes the old IP only. It is not able to recover until the NetworkPolicy is updated.NetworkPolicy controller
ingardenlet
is responsible to updateallow-to-runtime-apiserver
NetworkPolicy. Whenkube-apiserver
IP changesgardenlet
needs to restart to. Usually, this should not be a problem because it has a NetworkPolicy which allows all Egress traffic to port 443.However, there is a scenario where
gardenlet
is not able to restart properly.gardenlet
requires the GRM HA webhook to start. Otherwise, it panics immediately.The panic happens before
allow-to-runtime-apiserver
NetworkPolicy is updated. It is probably a race between NetworkPolicy controller and seed controller which the later one always wins, because DNS resolution is taking some time.Thus, there is a race between the restart of
gardenlet
and GRM.If
gardenlet
is restarting first, the GRM webhook is still running and it can updateallow-to-runtime-apiserver
NetworkPolicy.If GRM ist restarting first, GRM and
gardenlet
are stuck in a crash loop.This situation can be solved manually only, e.g. by updating the IP in
allow-to-runtime-apiserver
NetworkPolicy or creating a temporary NetworkPolicy for GRM.What you expected to happen:
GRM should be able to handle IP changes of its
kube-apiserver
.How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
We could prevent this from happening when we create a special NetworkPolicy for GRM which allows Egress traffic to port 443 on any IP.
Another option could be that
gardenlet
startsNetworkPolicy controller
earlier, that it can update the network policies before it requires the GRM webhook to be available.Environment:
kubectl version
):The text was updated successfully, but these errors were encountered: