Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stargate not being scheduled due to untolerated taint and service isn't being provisioned #1599

Open
mohmdnofal opened this issue Feb 24, 2023 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@mohmdnofal
Copy link

mohmdnofal commented Feb 24, 2023

Hello,

I'm trying to deploy multi-cluster Cassandra the data pods are coming up nicely, however, stargate pod is stuck with (0/9 nodes are available: 6 node(s) had untolerated taint {app: cassandra}, 7 node(s) didn't match Pod's node affinity/selector.). also the other issue is i'm not seeing the "stargate-service" in any if the clusters.

Setup: 4 clusters (1 control plane and 3 data planes) in AKS

Error:
kubectl -n k8ssandra-operator describe pod demo-dc1-northeurope-northeurope-1-stargate-deployment-6dfnd2ck
.....
QoS Class: Burstable
Node-Selectors:
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message


Warning FailedScheduling 16m (x226 over 19h) default-scheduler 0/9 nodes are available: 6 node(s) had untolerated taint {app: cassandra}, 7 node(s) didn't match Pod's node affinity/selector. preemption: 0/9 nodes are available: 9 Preemption is not helpful for scheduling..
Normal NotTriggerScaleUp 23s (x6871 over 19h) cluster-autoscaler pod didn't trigger scale-up: 3 node(s) had untolerated taint {app: cassandra}

Configuration

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  name: demo
spec:
  cassandra:
    serverVersion: 4.0.3
    storageConfig:
      cassandraDataVolumeClaimSpec:
        storageClassName: managed-premium
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
    config:
      jvmOptions:
        heapSize: 512M
    networking:
      hostNetwork: false
    datacenters:
      - metadata:
           name: dc1-northeurope
        k8sContext: cassandra-cluster-northeurope
        size: 3
        tolerations:
          - key: "app"
            operator: "Equal"
            value: "cassandra"
            effect: "NoSchedule"        
        stargate:
          size: 1
          heapSize: 256M
          allowStargateOnDataNodes: true                    
        racks:
          - name: northeurope-1
            nodeAffinityLabels:
                topology.kubernetes.io/zone: northeurope-1
                agentpool: cassandraz1
          - name: northeurope-2
            nodeAffinityLabels:
                topology.kubernetes.io/zone: northeurope-2
                agentpool: cassandraz2
          - name: northeurope-3
            nodeAffinityLabels:
                topology.kubernetes.io/zone: northeurope-3
                agentpool: cassandraz3                        
.....

I didn't configure tolerations on stargate per the docs say that tolerations will be inherited from data pods "Tolerations are tolerations to apply to the Stargate pods. Leave nil to let the controller reuse the same tolerations used for data pods in this datacenter" I also tried to explicilty add tolerations to stargate which yielded the same result.

Helm Chart Version: k8ssandra-operator-0.39.3

Current State:

$ kubectl -n k8ssandra-operator get pods
NAME READY STATUS RESTARTS AGE
demo-dc1-northeurope-northeurope-1-stargate-deployment-6dfwjc5f 0/1 Pending 0 24m
demo-dc1-northeurope-northeurope-1-sts-0 2/2 Running 0 37m
demo-dc1-northeurope-northeurope-2-sts-0 2/2 Running 0 37m
demo-dc1-northeurope-northeurope-3-sts-0 2/2 Running 0 37m
k8ssandra-operator-77769c8855-8tj7j 1/1 Running 0 3d2h
k8ssandra-operator-cass-operator-767d5ffb64-g7hmj 1/1 Running 0 3d2h

$ kubectl -n k8ssandra-operator get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
demo-dc1-northeurope-additional-seed-service ClusterIP None 36m
demo-dc1-northeurope-all-pods-service ClusterIP None 9042/TCP,8080/TCP,9103/TCP,9000/TCP 36m
demo-dc1-northeurope-service ClusterIP None 9042/TCP,9142/TCP,8080/TCP,9103/TCP,9000/TCP 36m
demo-seed-service ClusterIP None 36m
k8ssandra-operator-cass-operator-webhook-service ClusterIP 10.0.180.22 443/TCP 3d2h
k8ssandra-operator-webhook-service ClusterIP 10.0.130.216 443/TCP 3d2h

@mohmdnofal mohmdnofal added the question Further information is requested label Feb 24, 2023
@Miles-Garnsey
Copy link
Member

Hi @mohmdnofal, can you let us know what you're trying to achieve here? It seems that you have some tainted nodes, is that because you are trying to cordon off some number of nodes to run only Cassandra?

@vivekmishrasq1
Copy link

it might be because the stargate is having podAffinity to not schedule on nodes where cassandra pods are scheduled

affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: cassandra.datastax.com/cluster operator: In values: - demo - key: cassandra.datastax.com/datacenter operator: In values: - dc1 - key: cassandra.datastax.com/rack operator: In values: - default namespaces: - k8ssandra-operator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants