Does the operator handle CASSANDRA-17883? #539

rhuffy · 2023-06-14T16:29:16Z

What happened?

While reading through open Cassandra issues, I came across CASSANDRA-17883. The issue is that, when a C* node is removed, its IP address gets added to a list of ignoredEndpoints in MigrationCoordinator. In the C* source, there is a TODO comment that describes the issue:

        // TODO The endpoint address is now ignored but when a node with the same address is added again later,
        //  there will be no way to include it in schema synchronization other than restarting each other node
        //  see https://issues.apache.org/jira/browse/CASSANDRA-17883 for details

When a pod bounces and comes up with a different IP, the old IP is removed from gossip, and I believe it's also added to ignoredEndpoints. If another pod bounces and gets that original IP, my concern is that any schema changes on that node will be ignored by the rest of the cluster.

Does the operator do anything to handle this situation?

What did you expect to happen?

No response

How can we reproduce it (as minimally and precisely as possible)?

I don't have a repro on a test k8s cluster since I'm not sure how to force pods to come up with particular IPs.

You can, however, reproduce in Cassandra dtests with these steps

Create a 3 node cluster (127.0.0.1, 127.0.0.2, 127.0.0.3)
Stop node1
Stop node2, change its IP to 127.0.0.1 and start
Create a keyspace on node2.
Assert that node3 receives that schema change

Note that if node1 is restarted with some new IP, it will receive the schema change from node2, and pass it along to node3.

cass-operator version

1.15.0

Kubernetes version

1.24

Method of installation

No response

Anything else we need to know?

No response

The text was updated successfully, but these errors were encountered:

burmanm · 2023-06-19T12:03:56Z

I assume this is the same as #130 ?

adejanovski · 2023-06-19T13:25:50Z

@burmanm, it seems like a different (although somewhat related) issue.
Here the nodes won't refuse to start, which is apparently what's described in #130.
I'm not sure how the operator could detect that 🤔 The other nodes are the ones ignoring the node that inherited an old IP, so that node cannot tell (or can it?) that it's getting ignored.
Unless we can detect some schema update failures in the mgmt-api and bounce the node so that it gets a new IP?

rhuffy added the bug Something isn't working label Jun 14, 2023

adejanovski added the assess Issues in the state 'assess' label Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the operator handle CASSANDRA-17883? #539

Does the operator handle CASSANDRA-17883? #539

rhuffy commented Jun 14, 2023

burmanm commented Jun 19, 2023

adejanovski commented Jun 19, 2023

Does the operator handle CASSANDRA-17883? #539

Does the operator handle CASSANDRA-17883? #539

Comments

rhuffy commented Jun 14, 2023

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

cass-operator version

Kubernetes version

Method of installation

Anything else we need to know?

burmanm commented Jun 19, 2023

adejanovski commented Jun 19, 2023