Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vsphere-csi-node-xxxxx are in CrashLoopBackOff #2519

Open
dattebayo6716 opened this issue Nov 27, 2023 · 8 comments
Open

vsphere-csi-node-xxxxx are in CrashLoopBackOff #2519

dattebayo6716 opened this issue Nov 27, 2023 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@dattebayo6716
Copy link

dattebayo6716 commented Nov 27, 2023

/kind bug

What steps did you take and what happened:

What I see on the provisioned cluster

  1. Some calico pods are in pending state
  2. Some coredns pods are in pending state
  3. vsphere-csi-controller-manager pod is in pending state
  4. vsphere-csi-node-xxxxx are in CrashLoopBackOff without much information
  5. There is NO log of what error has occurred. I checked logs in CAPI and CAPV pods in the bootstrap cluster. There is NO error in the provisioned cluster's pods as well.

What did you expect to happen:
I expected to see a cluster with all pods running.

Anything else you would like to add:
Below are some of the K output for reference.

Here are some of the env variables I have

# VSPHERE_TEMPLATE: "ubuntu-2204-kube-v1.27.3"
# CONTROL_PLANE_ENDPOINT_IP: "10.63.32.100"
# VIP_NETWORK_INTERFACE: "ens192"
# VSPHERE_TLS_THUMBPRINT: ""
# EXP_CLUSTER_RESOURCE_SET: true  
# VSPHERE_SSH_AUTHORIZED_KEY: ""

# VSPHERE_STORAGE_POLICY: ""
# CPI_IMAGE_K8S_VERSION: "v1.27.3"

All bootstrap pods are running without errors.

ubuntu@frun10926:~/k8s$ kubectl get po -A -o wide
NAMESPACE                           NAME                                                             READY   STATUS    RESTARTS      AGE     IP            NODE                 NOMINATED NODE   READINESS GATES
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-557b778d6b-qpxn7       1/1     Running   1 (24h ago)   2d22h   10.244.0.9    kind-control-plane   <none>           <none>
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-55d8f6b576-8hl5r   1/1     Running   1 (24h ago)   2d22h   10.244.0.10   kind-control-plane   <none>           <none>
capi-system                         capi-controller-manager-685454967c-tnmcj                         1/1     Running   3 (24h ago)   2d22h   10.244.0.8    kind-control-plane   <none>           <none>
capv-system                         capv-controller-manager-84d85cdcbd-cb2wp                         1/1     Running   3 (24h ago)   2d22h   10.244.0.11   kind-control-plane   <none>           <none>
cert-manager                        cert-manager-75d57c8d4b-7j4tk                                    1/1     Running   1 (24h ago)   2d22h   10.244.0.6    kind-control-plane   <none>           <none>
cert-manager                        cert-manager-cainjector-69d6f4d488-rvp67                         1/1     Running   2 (24h ago)   2d22h   10.244.0.5    kind-control-plane   <none>           <none>
cert-manager                        cert-manager-webhook-869b6c65c4-h6xdt                            1/1     Running   0             2d22h   10.244.0.7    kind-control-plane   <none>           <none>
kube-system                         coredns-5d78c9869d-djj9s                                         1/1     Running   0             2d22h   10.244.0.4    kind-control-plane   <none>           <none>
kube-system                         coredns-5d78c9869d-vltjl                                         1/1     Running   0             2d22h   10.244.0.3    kind-control-plane   <none>           <none>
kube-system                         etcd-kind-control-plane                                          1/1     Running   0             2d22h   172.18.0.2    kind-control-plane   <none>           <none>
kube-system                         kindnet-zp6c5                                                    1/1     Running   1 (24h ago)   2d22h   172.18.0.2    kind-control-plane   <none>           <none>
kube-system                         kube-apiserver-kind-control-plane                                1/1     Running   1 (24h ago)   2d22h   172.18.0.2    kind-control-plane   <none>           <none>
kube-system                         kube-controller-manager-kind-control-plane                       1/1     Running   1 (24h ago)   2d22h   172.18.0.2    kind-control-plane   <none>           <none>
kube-system                         kube-proxy-t2g5b                                                 1/1     Running   0             2d22h   172.18.0.2    kind-control-plane   <none>           <none>
kube-system                         kube-scheduler-kind-control-plane                                1/1     Running   1 (24h ago)   2d22h   172.18.0.2    kind-control-plane   <none>           <none>
local-path-storage                  local-path-provisioner-6bc4bddd6b-rkwwm                          1/1     Running   0             2d22h   10.244.0.2    kind-control-plane   <none>           <none>

Here are the pods on the vSphere cluster that was provisioned using CAPI

ubuntu@frun10926:~/k8s$ kubectl get po -A --kubeconfig=mcluster.kubeconfig -o wide
NAMESPACE         NAME                                       READY   STATUS             RESTARTS          AGE     IP                NODE                        NOMINATED NODE   READINESS GATES
calico-system     calico-kube-controllers-5f9d445bb4-hp7rt   0/1     Pending            0                 2d20h   <none>            <none>                      <none>           <none>
calico-system     calico-node-6mrpv                          1/1     Running            0                 2d20h   10.63.32.83       mcluster-md-0-4kxmk-zplmd   <none>           <none>
calico-system     calico-node-dg42m                          1/1     Running            0                 2d20h   10.63.32.84       mcluster-klljm              <none>           <none>
calico-system     calico-node-f6n9r                          1/1     Running            0                 2d20h   10.63.32.81       mcluster-md-0-4kxmk-wfscb   <none>           <none>
calico-system     calico-node-gtxcg                          1/1     Running            0                 2d20h   10.63.32.82       mcluster-md-0-4kxmk-gbcjj   <none>           <none>
calico-system     calico-typha-5b866db66c-sdnpv              1/1     Running            0                 2d20h   10.63.32.81       mcluster-md-0-4kxmk-wfscb   <none>           <none>
calico-system     calico-typha-5b866db66c-trwlj              1/1     Running            0                 2d20h   10.63.32.82       mcluster-md-0-4kxmk-gbcjj   <none>           <none>
calico-system     csi-node-driver-drblt                      2/2     Running            0                 2d20h   192.168.232.193   mcluster-klljm              <none>           <none>
calico-system     csi-node-driver-pbhvm                      2/2     Running            0                 2d20h   192.168.68.65     mcluster-md-0-4kxmk-zplmd   <none>           <none>
calico-system     csi-node-driver-vflj4                      2/2     Running            0                 2d20h   192.168.141.66    mcluster-md-0-4kxmk-gbcjj   <none>           <none>
calico-system     csi-node-driver-wzmtr                      2/2     Running            0                 2d20h   192.168.83.65     mcluster-md-0-4kxmk-wfscb   <none>           <none>
kube-system       coredns-5d78c9869d-ckdjb                   0/1     Pending            0                 2d20h   <none>            <none>                      <none>           <none>
kube-system       coredns-5d78c9869d-vlpkw                   0/1     Pending            0                 2d20h   <none>            <none>                      <none>           <none>
kube-system       etcd-mcluster-klljm                        1/1     Running            0                 2d20h   10.63.32.84       mcluster-klljm              <none>           <none>
kube-system       kube-apiserver-mcluster-klljm              1/1     Running            0                 2d20h   10.63.32.84       mcluster-klljm              <none>           <none>
kube-system       kube-controller-manager-mcluster-klljm     1/1     Running            0                 2d20h   10.63.32.84       mcluster-klljm              <none>           <none>
kube-system       kube-proxy-7dxb2                           1/1     Running            0                 2d20h   10.63.32.82       mcluster-md-0-4kxmk-gbcjj   <none>           <none>
kube-system       kube-proxy-gsgzz                           1/1     Running            0                 2d20h   10.63.32.84       mcluster-klljm              <none>           <none>
kube-system       kube-proxy-mp98t                           1/1     Running            0                 2d20h   10.63.32.83       mcluster-md-0-4kxmk-zplmd   <none>           <none>
kube-system       kube-proxy-x97w4                           1/1     Running            0                 2d20h   10.63.32.81       mcluster-md-0-4kxmk-wfscb   <none>           <none>
kube-system       kube-scheduler-mcluster-klljm              1/1     Running            0                 2d20h   10.63.32.84       mcluster-klljm              <none>           <none>
kube-system       kube-vip-mcluster-klljm                    1/1     Running            0                 2d20h   10.63.32.84       mcluster-klljm              <none>           <none>
kube-system       vsphere-cloud-controller-manager-hzvzj     1/1     Running            0                 2d20h   10.63.32.84       mcluster-klljm              <none>           <none>
kube-system       vsphere-csi-controller-664c45f69b-6ddz4    0/5     Pending            0                 2d20h   <none>            <none>                      <none>           <none>
kube-system       vsphere-csi-node-dtvrg                     2/3     CrashLoopBackOff   809 (3m57s ago)   2d20h   192.168.141.65    mcluster-md-0-4kxmk-gbcjj   <none>           <none>
kube-system       vsphere-csi-node-jcpxj                     2/3     CrashLoopBackOff   810 (73s ago)     2d20h   192.168.232.194   mcluster-klljm              <none>           <none>
kube-system       vsphere-csi-node-lpjxj                     2/3     CrashLoopBackOff   809 (2m22s ago)   2d20h   192.168.83.66     mcluster-md-0-4kxmk-wfscb   <none>           <none>
kube-system       vsphere-csi-node-nkh6m                     2/3     CrashLoopBackOff   809 (3m35s ago)   2d20h   192.168.68.66     mcluster-md-0-4kxmk-zplmd   <none>           <none>
tigera-operator   tigera-operator-84cf9b6dbb-w6lkf           1/1     Running            0                 2d20h   10.63.32.83       mcluster-md-0-4kxmk-zplmd   <none>           <none>

Here is a sample k describe for vsphere-csi-node-xxxx

ubuntu@frun10926:~/k8s$ kubectl describe pod  vsphere-csi-node-dtvrg -n kube-system --kubeconfig=mcluster.kubeconfig
Name:             vsphere-csi-node-dtvrg
Namespace:        kube-system
Priority:         0
Service Account:  default
Node:             mcluster-md-0-4kxmk-gbcjj/10.63.32.82
Start Time:       Fri, 24 Nov 2023 19:14:52 +0000
Labels:           app=vsphere-csi-node
                  controller-revision-hash=69967bd89d
                  pod-template-generation=1
                  role=vsphere-csi
Annotations:      cni.projectcalico.org/containerID: 0e30215c3f275ce821e98584c24cd139273c8c061af590ef5ddeb915b421e6ec
                  cni.projectcalico.org/podIP: 192.168.141.65/32
                  cni.projectcalico.org/podIPs: 192.168.141.65/32
Status:           Running
IP:               192.168.141.65
IPs:
  IP:           192.168.141.65
Controlled By:  DaemonSet/vsphere-csi-node
Containers:
  node-driver-registrar:
    Container ID:  containerd://075a9e6aa183294562e6edfbd55577f8eeca891c19cb43603973a1057d2f8125
    Image:         quay.io/k8scsi/csi-node-driver-registrar:v2.0.1
    Image ID:      quay.io/k8scsi/csi-node-driver-registrar@sha256:a104f0f0ec5fdd007a4a85ffad95a93cfb73dd7e86296d3cc7846fde505248d3
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=5
      --csi-address=$(ADDRESS)
      --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
    State:          Running
      Started:      Fri, 24 Nov 2023 19:31:30 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      ADDRESS:               /csi/csi.sock
      DRIVER_REG_SOCK_PATH:  /var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock
    Mounts:
      /csi from plugin-dir (rw)
      /registration from registration-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-glb6m (ro)
  vsphere-csi-node:
    Container ID:   containerd://b8ec60cc34ad576e31564f0d993b2b50440f8de2753f744c545cb772407ee654
    Image:          gcr.io/cloud-provider-vsphere/csi/release/driver:v3.1.2
    Image ID:       gcr.io/cloud-provider-vsphere/csi/release/driver@sha256:471db9143b6daf2abdb656383f9d7ad34123a22c163c3f0e62dc8921048566bb
    Port:           9808/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 27 Nov 2023 15:56:46 +0000
      Finished:     Mon, 27 Nov 2023 15:56:46 +0000
    Ready:          False
    Restart Count:  807
    Liveness:       http-get http://:healthz/healthz delay=10s timeout=3s period=5s #success=1 #failure=3
    Environment:
      CSI_ENDPOINT:               unix:///csi/csi.sock
      X_CSI_MODE:                 node
      X_CSI_SPEC_REQ_VALIDATION:  false
      VSPHERE_CSI_CONFIG:         /etc/cloud/csi-vsphere.conf
      LOGGER_LEVEL:               PRODUCTION
      X_CSI_LOG_LEVEL:            INFO
      NODE_NAME:                   (v1:spec.nodeName)
    Mounts:
      /csi from plugin-dir (rw)
      /dev from device-dir (rw)
      /etc/cloud from vsphere-config-volume (rw)
      /var/lib/kubelet from pods-mount-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-glb6m (ro)
  liveness-probe:
    Container ID:  containerd://3ccf0d77472d57ac853a20305fd7862c97163b2509e40977cdc735e26b21665a
    Image:         quay.io/k8scsi/livenessprobe:v2.1.0
    Image ID:      quay.io/k8scsi/livenessprobe@sha256:04a9c4a49de1bd83d21e962122da2ac768f356119fb384660aa33d93183996c3
    Port:          <none>
    Host Port:     <none>
    Args:
      --csi-address=/csi/csi.sock
    State:          Running
      Started:      Fri, 24 Nov 2023 19:31:54 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /csi from plugin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-glb6m (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  vsphere-config-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  csi-vsphere-config
    Optional:    false
  registration-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins_registry
    HostPathType:  Directory
  plugin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins/csi.vsphere.vmware.com/
    HostPathType:  DirectoryOrCreate
  pods-mount-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet
    HostPathType:  Directory
  device-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:  
  kube-api-access-glb6m:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 :NoSchedule op=Exists
                             :NoExecute op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason            Age                      From     Message
  ----     ------            ----                     ----     -------
  Warning  DNSConfigForming  28s (x20490 over 2d20h)  kubelet  Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.242.46.35 10.242.46.36 10.250.46.36

Environment:

  • Cluster-api-provider-vsphere version: 1.5.3
  • Kubernetes version: (use kubectl version): 1.27.3
  • OS (e.g. from /etc/os-release): Ubuntu 22.04 OVA image that vSphere recommends (with no changes to the OVA).
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 27, 2023
@chrischdi
Copy link
Member

Could you take a look on why vsphere-csi-controller-664c45f69b-6ddz4 is in Pending? (via kubectl describe pod)?

If I got it right this pod needs to be up first so the daemonset pods can succeed.

Did you use the default templates provided by CAPV or did you manually deploy CSI?

@dattebayo6716
Copy link
Author

dattebayo6716 commented Dec 14, 2023

I posted the sample output from the kubectl describe <pod> above.

I used the default template and followed instructions from the quick-start page to generate cluster yaml file.
I am not using the yaml files from the templates folder.

@chrischdi
Copy link
Member

chrischdi commented Dec 14, 2023

So something prevents the vsphere-csi-controller from getting scheduled. There may be taints or something else why this happens.

You need to figure out why that is and then the daemonset pods should also get ready.

@rvanderp3
Copy link
Contributor

can you get the events from that namespace?

@habibullinrsh
Copy link

habibullinrsh commented Feb 1, 2024

The csi-node-driver, which is installed using tigera-operator, conflicts with vsphere-csi-node. I couldn't disable the installation of csi-node-driver, so I use kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml

@chrischdi
Copy link
Member

Would be interesting to figure out together with https://github.com/kubernetes/cloud-provider-vsphere where the gaps are that both can run at the same time. (for CSI we simply consume the above).

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 1, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants