Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPV controller manager stuck during reconcile #2832

Open
jansoukup opened this issue Mar 18, 2024 · 1 comment
Open

CAPV controller manager stuck during reconcile #2832

jansoukup opened this issue Mar 18, 2024 · 1 comment
Labels
area/govmomi Issues or PRs related to the govmomi mode kind/bug Categorizes issue or PR as related to a bug.

Comments

@jansoukup
Copy link

/kind bug

We have 1 management cluster with 7 workload clusters. Each workload cluster has ~25 worker nodes.
Sometimes during the reconciliation of all workload clusters, CAPV stops reconciling without any significant information in the logs (nor in CAPI logs). No new VMs are visible in vCenter, nothing is deleted, and new Machines remain in the "Provisioning" state indefinitely.
The quickest fix is to restart the CAPV deployment, after which everything runs smoothly again.

CAPV controller manager:

I0315 16:23:04.608993       1 vimmachine.go:432] "capv-controller-manager/vspheremachine-controller/cluster-1-test/cluster-1-test-worker-ce3e0b-8rg5v: updated vm" vm="cluster-1-test/cluster-1-test-worker-ktfbx-7lrt2"
I0315 16:23:04.653791       1 http.go:143] "controller-runtime/webhook/webhooks: wrote response" webhook="/validate-infrastructure-cluster-x-k8s-io-v1beta1-vspherevm" code=200 reason= UID=8c8eb2a6-ccb9-4663-b78a-aee2dc62a9f4 allowed=true
I0315 16:23:04.645912       1 request.go:622] Waited for 91.100271ms due to client-side throttling, not priority and fairness, request: PATCH:https://10.43.0.1:443/apis/infrastructure.cluster.x-k8s.io/v1beta1/namespaces/cluster-2-test/vspherevms/cluster-2-test-master-mnmsp
I0315 16:23:04.653134       1 http.go:96] "controller-runtime/webhook/webhooks: received request" webhook="/validate-infrastructure-cluster-x-k8s-io-v1beta1-vspherevm" UID=8c8eb2a6-ccb9-4663-b78a-aee2dc62a9f4 kind="infrastructure.cluster.x-k8s.io/v1beta1, Kind=VSphereVM" resource={Group:infrastructure.cluster.x-k8s.io Version:v1beta1 Resource:vspherevms}
I0315 16:23:04.696343       1 request.go:622] Waited for 138.639409ms due to client-side throttling, not priority and fairness, request: PATCH:https://10.43.0.1:443/apis/infrastructure.cluster.x-k8s.io/v1beta1/namespaces/cluster-1-test/vspherevms/cluster-1-test-worker-ktfbx-rvsbd
I0315 16:23:04.658346       1 vimmachine.go:432] "capv-controller-manager/vspheremachine-controller/cluster-2-test/cluster-2-test-master-87ac03-9lxbp: updated vm" vm="cluster-2-test/cluster-2-test-master-mnmsp"
I0315 16:24:21.806753       1 http.go:96] "controller-runtime/webhook/webhooks: received request" webhook="/validate-infrastructure-cluster-x-k8s-io-v1beta1-vspheremachine" UID=518c3035-294d-4a27-940b-6099a1e849f3 kind="infrastructure.cluster.x-k8s.io/v1beta1, Kind=VSphereMachine" resource={Group:infrastructure.cluster.x-k8s.io Version:v1beta1 Resource:vspheremachines}
I0315 16:24:21.807204       1 http.go:143] "controller-runtime/webhook/webhooks: wrote response" webhook="/validate-infrastructure-cluster-x-k8s-io-v1beta1-vspheremachine" code=200 reason= UID=518c3035-294d-4a27-940b-6099a1e849f3 allowed=true
I0315 16:24:21.904439       1 http.go:96] "controller-runtime/webhook/webhooks: received request" webhook="/mutate-infrastructure-cluster-x-k8s-io-v1beta1-vspheremachine" UID=db175ab7-9f4c-4554-9d74-73756c742219 kind="infrastructure.cluster.x-k8s.io/v1beta1, Kind=VSphereMachine" resource={Group:infrastructure.cluster.x-k8s.io Version:v1beta1 Resource:vspheremachines}
I0315 16:24:21.905326       1 http.go:143] "controller-runtime/webhook/webhooks: wrote response" webhook="/mutate-infrastructure-cluster-x-k8s-io-v1beta1-vspheremachine" code=200 reason= UID=db175ab7-9f4c-4554-9d74-73756c742219 allowed=true
I0315 16:24:21.908133       1 http.go:96] "controller-runtime/webhook/webhooks: received request" webhook="/validate-infrastructure-cluster-x-k8s-io-v1beta1-vspheremachine" UID=a3c5edd2-4f72-4b3c-a5ef-5a6079e753f5 kind="infrastructure.cluster.x-k8s.io/v1beta1, Kind=VSphereMachine" resource={Group:infrastructure.cluster.x-k8s.io Version:v1beta1 Resource:vspheremachines}
I0315 16:24:21.909097       1 http.go:143] "controller-runtime/webhook/webhooks: wrote response" webhook="/validate-infrastructure-cluster-x-k8s-io-v1beta1-vspheremachine" code=200 reason= UID=a3c5edd2-4f72-4b3c-a5ef-5a6079e753f5 allowed=true

CAPI controller manager:

I0315 15:17:35.753443       1 machine_controller_phases.go:286] "Waiting for infrastructure provider to create machine infrastructure and report status.ready" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="cluster-2-test/cluster-2-test-worker-2z784-52fc2" namespace="cluster-2-test" name="cluster-2-test-worker-2z784-52fc2" reconcileID=f9599a99-1da2-4872-b531-98446f266c73 MachineSet="cluster-2-test/cluster-2-test-worker-2z784" MachineDeployment="cluster-2-test/cluster-2-test-worker" Cluster="cluster-2-test/cluster-2-test" VSphereMachine="cluster-2-test/cluster-2-test-worker-1e6e95-4cvmt"

Omit state where CAPV runs in this strange state, without any info in logs.

Our workaround is scheduled Job for CAPV restart twice per day.

Environment:

  • Cluster-api-provider-vsphere version: 1.7.5
  • Kubernetes version: (use kubectl version): 1.24.17
  • OS (e.g. from /etc/os-release): Ubuntu 22.04
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 18, 2024
@sbueringer
Copy link
Member

sbueringer commented Mar 22, 2024

I would guess that maybe a controller is stuck.

This could be confirmed via metrics (active workers or something) and via a go routine dump of the controller (via kill -ABRT )

@sbueringer sbueringer added the area/govmomi Issues or PRs related to the govmomi mode label Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/govmomi Issues or PRs related to the govmomi mode kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants