Stargate not being scheduled due to untolerated taint and service isn't being provisioned #1599

mohmdnofal · 2023-02-24T15:21:25Z

Hello,

I'm trying to deploy multi-cluster Cassandra the data pods are coming up nicely, however, stargate pod is stuck with (0/9 nodes are available: 6 node(s) had untolerated taint {app: cassandra}, 7 node(s) didn't match Pod's node affinity/selector.). also the other issue is i'm not seeing the "stargate-service" in any if the clusters.

Setup: 4 clusters (1 control plane and 3 data planes) in AKS

Error:
kubectl -n k8ssandra-operator describe pod demo-dc1-northeurope-northeurope-1-stargate-deployment-6dfnd2ck
.....
QoS Class: Burstable
Node-Selectors:
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message

Warning FailedScheduling 16m (x226 over 19h) default-scheduler 0/9 nodes are available: 6 node(s) had untolerated taint {app: cassandra}, 7 node(s) didn't match Pod's node affinity/selector. preemption: 0/9 nodes are available: 9 Preemption is not helpful for scheduling..
Normal NotTriggerScaleUp 23s (x6871 over 19h) cluster-autoscaler pod didn't trigger scale-up: 3 node(s) had untolerated taint {app: cassandra}

Configuration

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  name: demo
spec:
  cassandra:
    serverVersion: 4.0.3
    storageConfig:
      cassandraDataVolumeClaimSpec:
        storageClassName: managed-premium
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
    config:
      jvmOptions:
        heapSize: 512M
    networking:
      hostNetwork: false
    datacenters:
      - metadata:
           name: dc1-northeurope
        k8sContext: cassandra-cluster-northeurope
        size: 3
        tolerations:
          - key: "app"
            operator: "Equal"
            value: "cassandra"
            effect: "NoSchedule"        
        stargate:
          size: 1
          heapSize: 256M
          allowStargateOnDataNodes: true                    
        racks:
          - name: northeurope-1
            nodeAffinityLabels:
                topology.kubernetes.io/zone: northeurope-1
                agentpool: cassandraz1
          - name: northeurope-2
            nodeAffinityLabels:
                topology.kubernetes.io/zone: northeurope-2
                agentpool: cassandraz2
          - name: northeurope-3
            nodeAffinityLabels:
                topology.kubernetes.io/zone: northeurope-3
                agentpool: cassandraz3                        
.....

I didn't configure tolerations on stargate per the docs say that tolerations will be inherited from data pods "Tolerations are tolerations to apply to the Stargate pods. Leave nil to let the controller reuse the same tolerations used for data pods in this datacenter" I also tried to explicilty add tolerations to stargate which yielded the same result.

Helm Chart Version: k8ssandra-operator-0.39.3

Current State:

$ kubectl -n k8ssandra-operator get pods
NAME READY STATUS RESTARTS AGE
demo-dc1-northeurope-northeurope-1-stargate-deployment-6dfwjc5f 0/1 Pending 0 24m
demo-dc1-northeurope-northeurope-1-sts-0 2/2 Running 0 37m
demo-dc1-northeurope-northeurope-2-sts-0 2/2 Running 0 37m
demo-dc1-northeurope-northeurope-3-sts-0 2/2 Running 0 37m
k8ssandra-operator-77769c8855-8tj7j 1/1 Running 0 3d2h
k8ssandra-operator-cass-operator-767d5ffb64-g7hmj 1/1 Running 0 3d2h

$ kubectl -n k8ssandra-operator get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
demo-dc1-northeurope-additional-seed-service ClusterIP None 36m
demo-dc1-northeurope-all-pods-service ClusterIP None 9042/TCP,8080/TCP,9103/TCP,9000/TCP 36m
demo-dc1-northeurope-service ClusterIP None 9042/TCP,9142/TCP,8080/TCP,9103/TCP,9000/TCP 36m
demo-seed-service ClusterIP None 36m
k8ssandra-operator-cass-operator-webhook-service ClusterIP 10.0.180.22 443/TCP 3d2h
k8ssandra-operator-webhook-service ClusterIP 10.0.130.216 443/TCP 3d2h

Miles-Garnsey · 2023-05-16T01:50:06Z

Hi @mohmdnofal, can you let us know what you're trying to achieve here? It seems that you have some tainted nodes, is that because you are trying to cordon off some number of nodes to run only Cassandra?

vivekmishrasq1 · 2024-05-03T13:00:50Z

it might be because the stargate is having podAffinity to not schedule on nodes where cassandra pods are scheduled

affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: cassandra.datastax.com/cluster operator: In values: - demo - key: cassandra.datastax.com/datacenter operator: In values: - dc1 - key: cassandra.datastax.com/rack operator: In values: - default namespaces: - k8ssandra-operator

mohmdnofal added the question Further information is requested label Feb 24, 2023

Miles-Garnsey assigned emerkle826 May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stargate not being scheduled due to untolerated taint and service isn't being provisioned #1599

Stargate not being scheduled due to untolerated taint and service isn't being provisioned #1599

mohmdnofal commented Feb 24, 2023 •

edited

Miles-Garnsey commented May 16, 2023

vivekmishrasq1 commented May 3, 2024

Stargate not being scheduled due to untolerated taint and service isn't being provisioned #1599

Stargate not being scheduled due to untolerated taint and service isn't being provisioned #1599

Comments

mohmdnofal commented Feb 24, 2023 • edited

Miles-Garnsey commented May 16, 2023

vivekmishrasq1 commented May 3, 2024

mohmdnofal commented Feb 24, 2023 •

edited