Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MinReadySeconds to 5 in StatefulSets to prevent sudden double res… #647

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

burmanm
Copy link
Contributor

@burmanm burmanm commented May 2, 2024

What this PR does:
Adds a MinReadySeconds check for the StatefulSet.

Which issue(s) this PR fixes:
Fixes #648
Fixes #382

Checklist

  • Changes manually tested
  • Automated Tests added/updated
  • Documentation added/updated
  • CHANGELOG.md updated (not required for documentation PRs)
  • CLA Signed: DataStax CLA

@burmanm burmanm requested a review from a team as a code owner May 2, 2024 12:03
Copy link
Contributor

@adejanovski adejanovski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Sadly this doesn't prevent the multiple restarts at once.
What probably happens is consecutive updates to the sts, with the second one happening after the first pod has been restarted.
We end up with one pod finishing its restart from the first change, and another one going through the second change.
Capture d’écran 2024-05-03 à 15 29 02

I'm unsure what we can do against this, but minReadySeconds doesn't help.

@KeepItSimpleStupid
Copy link

Thanks @burmanm & @adejanovski for these experiments after our exchange in k8ssandra/k8ssandra#1432 : we were about to submit a Pull Request ;)

I believe the ability to configure MinReadySeconds is still valuable even if it doesn't solve our issue, but maybe default it to 0 (the k8s default if not specified) instead of 5s ?

Concerning the issue, we first tried to workaround it by disabling Reaper & Medusa on our clusters before a K8ssandra Operator/Cass Operator upgrade (Helm-chart based) to minimize the sources of changes for the StatefulSet, and we enable them again after the rollout... but it's not bulletproof.

Interestingly, I noticed our other k8s stateful workloads use different strategies than the one used by cass-operator (StatefulSet + spec.updateStrategy.type = RollingUpdate), so I wonder if they didn't face similar issues :

Maybe a way to solve this issue would be for cass-operator to switch to spec.updateStrategy.type = OnDelete and control the rollout completely ?

@burmanm
Copy link
Contributor Author

burmanm commented May 14, 2024

Well, there are two (three) things here. First of all, preventing a single StatefulSet (per rack - not per cluster / dc, those should be guarded by cass-operator, see below) from restarting multiple ones is relatively simple PR:

diff --git a/pkg/reconciliation/construct_statefulset.go b/pkg/reconciliation/construct_statefulset.go
index e48f3ce..a5443a7 100644
--- a/pkg/reconciliation/construct_statefulset.go
+++ b/pkg/reconciliation/construct_statefulset.go
@@ -142,7 +142,7 @@ func newStatefulSetForCassandraDatacenter(
                        },
                        Replicas:             &replicaCountInt32,
                        ServiceName:          dc.GetAllPodsServiceName(),
-                       PodManagementPolicy:  appsv1.ParallelPodManagement,
+                       PodManagementPolicy:  appsv1.OrderedReadyPodManagement,
                        Template:             *template,
                        VolumeClaimTemplates: volumeClaimTemplates,
                },

The bigger issue is that how is this happening in the first place when not editing StatefulSets directly (as no one should). I haven't been able to trigger this issue (and the reproducer by @adejanovski does neither) with just CassandraDatacenter changes. Editing StS directly bypasses the reconcile logic in cass-operator which makes it odd. MinReadySeconds itself should have prevented that since the controller-runtime cache would not be updated instantly in any case and we should be requeueing when an update happens. I don't think MinReadySeconds = 0 is necessarily even a good setting since all of these can cause small concurrency issues.

And to be fair, I don't even know if MinReadySeconds does not stop this behavior. Perhaps it works correctly in this case and the "reproducer" is simply wrong (since it's not testing real behavior but something else). Sort of like we can't stop user from deleting all the pods at the same time if they really want to.

The third problem is our use PodDisruptionPolicy, which does not know about the replication factors or availability requirements, so it can't actually enforce a sane cluster structure. If it knew things are always quorum for example, it could enforce that. Also, it only targets the cluster wide labeling, not for example per-rack settings. That's a much more complex issue to solve.

@KeepItSimpleStupid
Copy link

Hi !

About the first point :

Well, there are two (three) things here. First of all, preventing a single StatefulSet (per rack - not per cluster / dc, those should be guarded by cass-operator, see below) from restarting multiple ones is relatively simple PR:

I believe that .spec.podManagementPolicy = Parallel affects only scaling operations, not the updates as documented here.

However, there seems to be .spec.updateStrategy.rollingUpdate.maxUnavailable in alpha since k8s 1.24, but I imagine it's equal to 1 by default if not set or if the feature gate is not enabled 🤷‍♂️

@KeepItSimpleStupid
Copy link

The bigger issue is that how is this happening in the first place when not editing StatefulSets directly (as no one should). I haven't been able to trigger this issue (and the reproducer by @adejanovski does neither) with just CassandraDatacenter changes. Editing StS directly bypasses the reconcile logic in cass-operator which makes it odd. MinReadySeconds itself should have prevented that since the controller-runtime cache would not be updated instantly in any case and we should be requeueing when an update happens. I don't think MinReadySeconds = 0 is necessarily even a good setting since all of these can cause small concurrency issues.

And to be fair, I don't even know if MinReadySeconds does not stop this behavior. Perhaps it works correctly in this case and the "reproducer" is simply wrong (since it's not testing real behavior but something else). Sort of like we can't stop user from deleting all the pods at the same time if they really want to.

Maybe a good test would be to set MinReadySeconds to a ridiculously long delay to ensure when all events/webhooks are treated (almost) sequentially, the issue doesn't appear anymore ?
We could test it here with our own clusters which have been often hit by the issue, so that's why I was suggesting to set it to 0 by default (which is also the k8s default if not set), so you can merge the PR without any breaking change for the community, but allowing us to test more our assumptions ;)

@KeepItSimpleStupid
Copy link

The third problem is our use PodDisruptionPolicy, which does not know about the replication factors or availability requirements, so it can't actually enforce a sane cluster structure. If it knew things are always quorum for example, it could enforce that. Also, it only targets the cluster wide labeling, not for example per-rack settings. That's a much more complex issue to solve.

Just to state it, we encounter the issue with our single dc/single rack clusters, RF 3 which are always requested with quorum CL :/

@burmanm
Copy link
Contributor Author

burmanm commented May 14, 2024

I believe that .spec.podManagementPolicy = Parallel affects only scaling operations, not the updates as documented here.

Well, that's what the documentation says, but when I tested it I got different behavior.

However, there seems to be .spec.updateStrategy.rollingUpdate.maxUnavailable in alpha since k8s 1.24, but I imagine it's equal to 1 by default if not set or if the feature gate is not enabled 🤷‍♂️

The documentation says in this one that it requires a feature flag to be enabled (it defaults to false) thus making it kinda useless.

Just to state it, we encounter the issue with our single dc/single rack clusters, RF 3 which are always requested with quorum CL :/

The current setup is only setting minAvailable.

@burmanm
Copy link
Contributor Author

burmanm commented May 14, 2024

The current setup is only setting minAvailable.

And also to this, PDB is for eviction policy, I don't think StS controller cares about it.

…nReadySeconds in StatefulSets. Set the default MinReadySeconds to 5 in StatefulSets to prevent sudden double restarts from statefulset controller.
@burmanm
Copy link
Contributor Author

burmanm commented May 14, 2024

In case you still wanted to test this PR, the easiest way is to use gh pr checkout.

So gh pr checkout 647 and then depending on the target cluster, make docker-build docker-logger-build and push the images somewhere (running from our source requires build of system-logger image also), or with kind: make docker-kind docker-logger-kind to get the correct images.

And then make deploy will deploy the cass-operator to cass-operator namespace in a namespace scope (I should probably make this cluster-scope for testing purposes, but in this PR it's still namespace scope for testing).

@KeepItSimpleStupid
Copy link

There is an issue opened & a PR opened on the k8s repo
Statefulset with podManagementPolicy=Parallel ignores minReadySeconds on statefulset rollout update
kubernetes/kubernetes#119234
kubernetes/kubernetes#123975

@burmanm
Copy link
Contributor Author

burmanm commented May 14, 2024

Awesome, so I wasn't going crazy when reading the statefulset controller code. In that case, my latest commit should hopefully prevent the double change from happening. There's also updates for the k8ssandra-operator upgrade process in here that would further block any accidental restarts:

k8ssandra/k8ssandra-operator#1298

@KeepItSimpleStupid
Copy link

Nice ! I'll try to find some time this week to test all these changes :)

@burmanm
Copy link
Contributor Author

burmanm commented May 14, 2024

Yeah, I'll need to fix some more tomorrow, I made a mistake with the canary upgrade process braking these rules. So I need to find another way for the StS prevention.

As for OrderedReady, I didn't feel like it adds a lot of extra time to expanding clusters (seconds per node and not more than that, so irrelevant). I'll probably investigate it more to see if we should also apply that, since if there's no real downside, then we could prevent the behavior on the StS level also.

@burmanm
Copy link
Contributor Author

burmanm commented May 17, 2024

I've moved other than the MinReadySeconds / OrderedPodManagement change to another PR: #654

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants