✨ Fix leader election issue with workspace controller and other KCP Controllers #3111

sankar17 · 2024-04-04T09:03:56Z

This PR addresses the following,

Change the workspace controllers start logic inside runners to fix leader election issue.
The way we register controllers and define the runner is problematic, the runner calls start only. but in case leader election is lost start finishes (as it was waiting on <- ctx.Done()) which leads to the defer on the queue.Shutdown() to run. Once you shutdown a queue, there’s no way to restart it

Background:
At times we faced workspace controller creation stuck at scheduling phase and never recovers. Regarding leader election the requests/events queued to both leader and other pods aswell , this makes the queue depth to grow.

kcp-ci-bot · 2024-04-04T09:03:59Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kcp-ci-bot · 2024-04-04T09:04:09Z

Hi @sankar17. Thanks for your PR.

I'm waiting for a kcp-dev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yastij · 2024-04-05T17:29:41Z

pkg/server/controllers.go


 if err := s.registerController(&controllerWrapper{
 Name: workspace.ControllerName,
 Runner: func(ctx context.Context) {
+ workspaceController, err := workspace.NewController(


this issue happens with every controller, we need to to do it for the rest of the controllers

After this PR popped up I was wondering why it would only affect specific controllers, this seemed like a fairly fundamental flaw in how I implemented leader election.

@sankar17 is this something that you want to refactor at a bigger scale to address the architectural problem or would you like me to take a look?

@embik This has to be fixed for all the controllers, initially we got stuck with workspace controller creation stopped working with our longevity tests / over night test with 200 workspace creation / deletion . We saw the kcp metrics and the request queued to all 3/3 pods, though the leader election is enabled.

Currently I am working on changing this logic for all the controllers and need to do more tests.

if you think this is going to be fixed architectural level , please take a look at it.

CC: @yastij @palnabarun

Okay, taking a quick look at where queues are being started I think the approach in this PR is overall fine. @sankar17 feel free to carry on, we should change it for all controllers. I can also take over if you don't have time.

@embik I did changes to all the controllers, currently testing the fix, if all good I will update this PR in few days

palnabarun · 2024-04-05T19:13:49Z

/ok-to-test

ramramu3433 · 2024-04-11T06:33:53Z

/retest

sankar17 · 2024-04-11T07:47:29Z

/test pull-kcp-test-e2e

ramramu3433 · 2024-04-11T08:35:54Z

/retest

embik · 2024-04-11T08:52:00Z

@sankar17 @ramramu3433 these failures across the e2e test board don't seem like flakes to me. Does make test-e2e work locally for you for this branch?

sankar17 · 2024-04-11T09:20:01Z

/retest

sankar17 · 2024-04-11T09:20:37Z

@sankar17 @ramramu3433 these failures across the e2e test board don't seem like flakes to me. Does make test-e2e work locally for you for this branch?

I will test and udpate

embik · 2024-04-11T09:28:51Z

@sankar17 Please consider not running re-tests when tests are failing consistently, at least not without any code changes pushed. Those tests burn CI cycles without any real reason, we already know that they don't work.

sankar17 · 2024-04-11T09:31:55Z

@sankar17 Please consider not running re-tests when tests are failing consistently, at least not without any code changes pushed. Those tests burn CI cycles without any real reason, we already know that they don't work.

Sure I will make sure it works in local and do retest

embik · 2024-04-11T09:34:04Z

Thanks!

embik · 2024-04-14T20:09:58Z

pkg/server/controllers.go

+ // APIBinding indexers
+ indexers.AddIfNotPresentOrDie(s.KcpSharedInformerFactory.Apis().V1alpha1().APIBindings().Informer().GetIndexer(), cache.Indexers{
+ indexers.APIBindingsByAPIExport: indexers.IndexAPIBindingByAPIExport,
+ })
+
+ // APIExport indexers
+ indexers.AddIfNotPresentOrDie(s.KcpSharedInformerFactory.Apis().V1alpha1().APIExports().Informer().GetIndexer(), cache.Indexers{
+ indexers.ByLogicalClusterPathAndName: indexers.IndexByLogicalClusterPathAndName,
+ indexAPIExportsByAPIResourceSchema: indexAPIExportsByAPIResourceSchemasFunc,
+ })
+ indexers.AddIfNotPresentOrDie(s.CacheKcpSharedInformerFactory.Apis().V1alpha1().APIExports().Informer().GetIndexer(), cache.Indexers{
+ indexers.ByLogicalClusterPathAndName: indexers.IndexByLogicalClusterPathAndName,
+ indexAPIExportsByAPIResourceSchema: indexAPIExportsByAPIResourceSchemasFunc,
+ })


What is the reason for moving the indexer setup out of the controller creation to here (this happens in other parts of the code, so this is more of a placeholder reference for all those changes)?

The core reason we moved registration of indexers outside the controller.NewFooController is because with this change, all of the controller.NewFooController calls are inside the Runner function which is called everytime a KCP pod becomes a leader.

indexers.AddIfNotPresentOrDie is supposed to be run only once in the lifecycle of a KCP pod because otherwise it will panic.

The mental model is that:

installFooController outside of Runner has steps that can or needed to run one-time only in the lifecycle of the KCP pod

Runner has all the steps that can be run multiple times in the runtime of a Pod.

Note: The same KCP pod can somehow lose leader-election and get elected again in due course of time. The changes in this PR enables that scenario.

@embik this is a little bit complicated to wrap at one go. Maybe we can explain over the community call later.

@palnabarun I first understood indexers.AddIfNotPresentOrDie to be idempotent since it filters out already present indexers (that's why I asked), but my feeling is the problem isn't that part of the logic, it's that calling indexer.AddIndexers even with an empty list of indexers (because indexers.AddIfNotPresentOrDie filtered out all existing indexers) will panic when the store is already started -- do I get this right? Or is the problem that it's called with an empty list of indexers (because they all already exist)?

I won't be on the next community call unfortunately, but I don't mind anyone else reviewing this and discussing it there either if that works better.

embik

Overall this PR looks good to me but it needs:

Squashed commits.
DCO sign-off on all commits.
a release note in the PR description.

kcp-ci-bot · 2024-04-16T09:35:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from embik. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sankar17 · 2024-04-16T12:09:44Z

/test pull-kcp-verify

sankar17 · 2024-04-16T12:26:27Z

/retest-required

sttts · 2024-04-24T06:50:45Z

pkg/server/controllers.go

+ indexAPIExportEndpointSliceByAPIExport = "indexAPIExportEndpointSliceByAPIExport"
+ indexAPIExportEndpointSlicesByPartition = "indexAPIExportEndpointSlicesByPartition"
+)
+


can we qualify these more? indexWhatByWhat

sttts · 2024-04-24T06:51:28Z

pkg/server/controllers.go

+
+ return []string{}, nil
+}
+


would move all of these into an index.go. This file here is getting big.

Sure , we wil do

sttts · 2024-04-24T06:54:09Z

pkg/server/controllers.go

+ kubeClusterClient,
+ s.KcpSharedInformerFactory.Core().V1alpha1().LogicalClusters(),
+ s.KubeSharedInformerFactory.Rbac().V1().ClusterRoleBindings(),
+ )


🔴 What is the semantics here now in the Runner? Do we ensure the informer factor is started after all the runners? Looking at the code this can't be the case: controller.Start blocks. This means we have a race of the individual informers not being started if the factory is faster than the runners.

In few controllers the wait for informer cache to sync logic is implemented using wait method before start.Can we just add this for all the controllers for which informers to be sycned before start ?

https://github.com/sankar17/kcp/blob/sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix/pkg/server/controllers.go#L719-L732

https://github.com/sankar17/kcp/blob/sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix/pkg/server/controllers.go#L912-L918

https://github.com/sankar17/kcp/blob/sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix/pkg/server/controllers.go#L991-L997

https://github.com/sankar17/kcp/blob/sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix/pkg/server/controllers.go#L1217-L1225

https://github.com/sankar17/kcp/blob/sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix/pkg/server/controllers.go#L1254-L1261

https://github.com/sankar17/kcp/blob/sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix/pkg/server/controllers.go#L1467-L1478

CC: @yastij @palnabarun

Signed-off-by: sankarm <[email protected]>

kcp-ci-bot · 2024-04-29T13:13:05Z

@sankar17: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kcp-verify	`eb6571d`	link	true	`/test pull-kcp-verify`
pull-kcp-verify-codegen	`eb6571d`	link	true	`/test pull-kcp-verify-codegen`

Full PR test history

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

kcp-ci-bot · 2024-05-30T13:52:17Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sankar17 · 2024-05-30T14:39:38Z

This is implemented with alternative approach #3132
, hence this PR is no longer needed

sankar17 changed the title ~~✨ Set controllers rest config to 30 secs and Fix leader election issue with workspace controller~~ ✨ Set controllers rest config Timeout to 30 secs and Fix leader election issue with workspace controller Apr 4, 2024

sankar17 changed the title ~~✨ Set controllers rest config Timeout to 30 secs and Fix leader election issue with workspace controller~~ ✨ Set controllers rest config timeout to 30 secs and Fix leader election issue with workspace controller Apr 4, 2024

sankar17 force-pushed the sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix branch from 55f9e70 to 987b9a8 Compare April 4, 2024 09:08

kcp-ci-bot added dco-signoff: yes Indicates the PR's author has signed the DCO. and removed dco-signoff: no Indicates the PR's author has not signed the DCO. labels Apr 4, 2024

yastij reviewed Apr 5, 2024

View reviewed changes

kcp-ci-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 5, 2024

sankar17 force-pushed the sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix branch from 11084cb to 3ce207d Compare April 10, 2024 14:31

kcp-ci-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 10, 2024

sankar17 requested review from yastij and embik April 11, 2024 08:38

kcp-ci-bot added the dco-signoff: no Indicates the PR's author has not signed the DCO. label Apr 11, 2024

kcp-ci-bot removed the dco-signoff: yes Indicates the PR's author has signed the DCO. label Apr 11, 2024

sankar17 changed the title ~~✨ Set controllers rest config timeout to 30 secs and Fix leader election issue with workspace controller~~ ✨ Fix leader election issue with workspace controller and other KCP Controllers Apr 11, 2024

embik reviewed Apr 14, 2024

View reviewed changes

embik requested changes Apr 15, 2024

View reviewed changes

kcp-ci-bot assigned embik Apr 15, 2024

kcp-ci-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 16, 2024

sankar17 force-pushed the sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix branch from 237b632 to f57951c Compare April 16, 2024 09:53

kcp-ci-bot added dco-signoff: yes Indicates the PR's author has signed the DCO. and removed dco-signoff: no Indicates the PR's author has not signed the DCO. labels Apr 16, 2024

sankar17 force-pushed the sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix branch from 6d230d9 to 804f5c4 Compare April 16, 2024 12:34

sankar17 force-pushed the sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix branch from 804f5c4 to 2706416 Compare April 23, 2024 15:50

sttts reviewed Apr 24, 2024

View reviewed changes

leader election issue fixed for all the kcp controlelrs

eb6571d

Signed-off-by: sankarm <[email protected]>

sankar17 force-pushed the sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix branch from 2706416 to eb6571d Compare April 29, 2024 13:05

kcp-ci-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2024

sankar17 closed this May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Fix leader election issue with workspace controller and other KCP Controllers #3111

✨ Fix leader election issue with workspace controller and other KCP Controllers #3111

sankar17 commented Apr 4, 2024 •

edited

kcp-ci-bot commented Apr 4, 2024

kcp-ci-bot commented Apr 4, 2024

yastij Apr 5, 2024

embik Apr 6, 2024

sankar17 Apr 7, 2024

embik Apr 9, 2024

sankar17 Apr 10, 2024

palnabarun commented Apr 5, 2024

ramramu3433 commented Apr 11, 2024

sankar17 commented Apr 11, 2024

ramramu3433 commented Apr 11, 2024

embik commented Apr 11, 2024 •

edited

sankar17 commented Apr 11, 2024

sankar17 commented Apr 11, 2024

embik commented Apr 11, 2024

sankar17 commented Apr 11, 2024

embik commented Apr 11, 2024

embik Apr 14, 2024

palnabarun Apr 15, 2024

palnabarun Apr 15, 2024

embik Apr 15, 2024 •

edited

embik left a comment

kcp-ci-bot commented Apr 16, 2024

sankar17 commented Apr 16, 2024

sankar17 commented Apr 16, 2024

sttts Apr 24, 2024

sttts Apr 24, 2024

sankar17 Apr 24, 2024

sttts Apr 24, 2024

sankar17 Apr 25, 2024

kcp-ci-bot commented Apr 29, 2024

kcp-ci-bot commented May 30, 2024

sankar17 commented May 30, 2024

✨ Fix leader election issue with workspace controller and other KCP Controllers #3111

✨ Fix leader election issue with workspace controller and other KCP Controllers #3111

Conversation

sankar17 commented Apr 4, 2024 • edited

kcp-ci-bot commented Apr 4, 2024

kcp-ci-bot commented Apr 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

palnabarun commented Apr 5, 2024

ramramu3433 commented Apr 11, 2024

sankar17 commented Apr 11, 2024

ramramu3433 commented Apr 11, 2024

embik commented Apr 11, 2024 • edited

sankar17 commented Apr 11, 2024

sankar17 commented Apr 11, 2024

embik commented Apr 11, 2024

sankar17 commented Apr 11, 2024

embik commented Apr 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

embik Apr 15, 2024 • edited

Choose a reason for hiding this comment

embik left a comment

Choose a reason for hiding this comment

kcp-ci-bot commented Apr 16, 2024

sankar17 commented Apr 16, 2024

sankar17 commented Apr 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kcp-ci-bot commented Apr 29, 2024

kcp-ci-bot commented May 30, 2024

sankar17 commented May 30, 2024

sankar17 commented Apr 4, 2024 •

edited

embik commented Apr 11, 2024 •

edited

embik Apr 15, 2024 •

edited