Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Fix leader election issue with workspace controller and other KCP Controllers #3111

Conversation

sankar17
Copy link

@sankar17 sankar17 commented Apr 4, 2024

This PR addresses the following,

Change the workspace controllers start logic inside runners to fix leader election issue.
The way we register controllers and define the runner is problematic, the runner calls start only. but in case leader election is lost start finishes (as it was waiting on <- ctx.Done()) which leads to the defer on the queue.Shutdown() to run. Once you shutdown a queue, there’s no way to restart it

Background:
At times we faced workspace controller creation stuck at scheduling phase and never recovers. Regarding leader election the requests/events queued to both leader and other pods aswell , this makes the queue depth to grow.

@kcp-ci-bot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kcp-ci-bot kcp-ci-bot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. dco-signoff: no Indicates the PR's author has not signed the DCO. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 4, 2024
@kcp-ci-bot
Copy link
Contributor

Hi @sankar17. Thanks for your PR.

I'm waiting for a kcp-dev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sankar17 sankar17 changed the title ✨ Set controllers rest config to 30 secs and Fix leader election issue with workspace controller ✨ Set controllers rest config Timeout to 30 secs and Fix leader election issue with workspace controller Apr 4, 2024
@sankar17 sankar17 changed the title ✨ Set controllers rest config Timeout to 30 secs and Fix leader election issue with workspace controller ✨ Set controllers rest config timeout to 30 secs and Fix leader election issue with workspace controller Apr 4, 2024
@sankar17 sankar17 force-pushed the sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix branch from 55f9e70 to 987b9a8 Compare April 4, 2024 09:08
@kcp-ci-bot kcp-ci-bot added dco-signoff: yes Indicates the PR's author has signed the DCO. and removed dco-signoff: no Indicates the PR's author has not signed the DCO. labels Apr 4, 2024

if err := s.registerController(&controllerWrapper{
Name: workspace.ControllerName,
Runner: func(ctx context.Context) {
workspaceController, err := workspace.NewController(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this issue happens with every controller, we need to to do it for the rest of the controllers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR popped up I was wondering why it would only affect specific controllers, this seemed like a fairly fundamental flaw in how I implemented leader election.

@sankar17 is this something that you want to refactor at a bigger scale to address the architectural problem or would you like me to take a look?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@embik This has to be fixed for all the controllers, initially we got stuck with workspace controller creation stopped working with our longevity tests / over night test with 200 workspace creation / deletion . We saw the kcp metrics and the request queued to all 3/3 pods, though the leader election is enabled.

Currently I am working on changing this logic for all the controllers and need to do more tests.

if you think this is going to be fixed architectural level , please take a look at it.

CC: @yastij @palnabarun

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, taking a quick look at where queues are being started I think the approach in this PR is overall fine. @sankar17 feel free to carry on, we should change it for all controllers. I can also take over if you don't have time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@embik I did changes to all the controllers, currently testing the fix, if all good I will update this PR in few days

@palnabarun
Copy link
Member

/ok-to-test

@kcp-ci-bot kcp-ci-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 5, 2024
@sankar17 sankar17 force-pushed the sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix branch from 11084cb to 3ce207d Compare April 10, 2024 14:31
@kcp-ci-bot kcp-ci-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 10, 2024
@ramramu3433
Copy link

/retest

@sankar17
Copy link
Author

/test pull-kcp-test-e2e

@ramramu3433
Copy link

/retest

@sankar17 sankar17 requested review from yastij and embik April 11, 2024 08:38
@embik
Copy link
Member

embik commented Apr 11, 2024

@sankar17 @ramramu3433 these failures across the e2e test board don't seem like flakes to me. Does make test-e2e work locally for you for this branch?

@sankar17
Copy link
Author

/retest

@sankar17
Copy link
Author

@sankar17 @ramramu3433 these failures across the e2e test board don't seem like flakes to me. Does make test-e2e work locally for you for this branch?

I will test and udpate

@embik
Copy link
Member

embik commented Apr 11, 2024

@sankar17 Please consider not running re-tests when tests are failing consistently, at least not without any code changes pushed. Those tests burn CI cycles without any real reason, we already know that they don't work.

@sankar17
Copy link
Author

@sankar17 Please consider not running re-tests when tests are failing consistently, at least not without any code changes pushed. Those tests burn CI cycles without any real reason, we already know that they don't work.

Sure I will make sure it works in local and do retest

@embik
Copy link
Member

embik commented Apr 11, 2024

Thanks!

@kcp-ci-bot kcp-ci-bot added the dco-signoff: no Indicates the PR's author has not signed the DCO. label Apr 11, 2024
@kcp-ci-bot kcp-ci-bot removed the dco-signoff: yes Indicates the PR's author has signed the DCO. label Apr 11, 2024
@sankar17 sankar17 changed the title ✨ Set controllers rest config timeout to 30 secs and Fix leader election issue with workspace controller ✨ Fix leader election issue with workspace controller and other KCP Controllers Apr 11, 2024
Comment on lines +685 to +715
// APIBinding indexers
indexers.AddIfNotPresentOrDie(s.KcpSharedInformerFactory.Apis().V1alpha1().APIBindings().Informer().GetIndexer(), cache.Indexers{
indexers.APIBindingsByAPIExport: indexers.IndexAPIBindingByAPIExport,
})

// APIExport indexers
indexers.AddIfNotPresentOrDie(s.KcpSharedInformerFactory.Apis().V1alpha1().APIExports().Informer().GetIndexer(), cache.Indexers{
indexers.ByLogicalClusterPathAndName: indexers.IndexByLogicalClusterPathAndName,
indexAPIExportsByAPIResourceSchema: indexAPIExportsByAPIResourceSchemasFunc,
})
indexers.AddIfNotPresentOrDie(s.CacheKcpSharedInformerFactory.Apis().V1alpha1().APIExports().Informer().GetIndexer(), cache.Indexers{
indexers.ByLogicalClusterPathAndName: indexers.IndexByLogicalClusterPathAndName,
indexAPIExportsByAPIResourceSchema: indexAPIExportsByAPIResourceSchemasFunc,
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for moving the indexer setup out of the controller creation to here (this happens in other parts of the code, so this is more of a placeholder reference for all those changes)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core reason we moved registration of indexers outside the controller.NewFooController is because with this change, all of the controller.NewFooController calls are inside the Runner function which is called everytime a KCP pod becomes a leader.

indexers.AddIfNotPresentOrDie is supposed to be run only once in the lifecycle of a KCP pod because otherwise it will panic.

The mental model is that:

  • installFooController outside of Runner has steps that can or needed to run one-time only in the lifecycle of the KCP pod
  • Runner has all the steps that can be run multiple times in the runtime of a Pod.

Note: The same KCP pod can somehow lose leader-election and get elected again in due course of time. The changes in this PR enables that scenario.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@embik this is a little bit complicated to wrap at one go. Maybe we can explain over the community call later.

Copy link
Member

@embik embik Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@palnabarun I first understood indexers.AddIfNotPresentOrDie to be idempotent since it filters out already present indexers (that's why I asked), but my feeling is the problem isn't that part of the logic, it's that calling indexer.AddIndexers even with an empty list of indexers (because indexers.AddIfNotPresentOrDie filtered out all existing indexers) will panic when the store is already started -- do I get this right? Or is the problem that it's called with an empty list of indexers (because they all already exist)?

I won't be on the next community call unfortunately, but I don't mind anyone else reviewing this and discussing it there either if that works better.

Copy link
Member

@embik embik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this PR looks good to me but it needs:

  1. Squashed commits.
  2. DCO sign-off on all commits.
  3. a release note in the PR description.

@kcp-ci-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from embik. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kcp-ci-bot kcp-ci-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 16, 2024
@sankar17 sankar17 force-pushed the sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix branch from 237b632 to f57951c Compare April 16, 2024 09:53
@kcp-ci-bot kcp-ci-bot added dco-signoff: yes Indicates the PR's author has signed the DCO. and removed dco-signoff: no Indicates the PR's author has not signed the DCO. labels Apr 16, 2024
@sankar17
Copy link
Author

/test pull-kcp-verify

@sankar17
Copy link
Author

/retest-required

@sankar17 sankar17 force-pushed the sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix branch from 6d230d9 to 804f5c4 Compare April 16, 2024 12:34
@sankar17 sankar17 force-pushed the sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix branch from 804f5c4 to 2706416 Compare April 23, 2024 15:50
indexAPIExportEndpointSliceByAPIExport = "indexAPIExportEndpointSliceByAPIExport"
indexAPIExportEndpointSlicesByPartition = "indexAPIExportEndpointSlicesByPartition"
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we qualify these more? indexWhatByWhat


return []string{}, nil
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would move all of these into an index.go. This file here is getting big.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure , we wil do

kubeClusterClient,
s.KcpSharedInformerFactory.Core().V1alpha1().LogicalClusters(),
s.KubeSharedInformerFactory.Rbac().V1().ClusterRoleBindings(),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 What is the semantics here now in the Runner? Do we ensure the informer factor is started after all the runners? Looking at the code this can't be the case: controller.Start blocks. This means we have a race of the individual informers not being started if the factory is faster than the runners.

@sankar17 sankar17 force-pushed the sankar17/rest-cfg-timeout-and-workspace-ctrl-leader-election-fix branch from 2706416 to eb6571d Compare April 29, 2024 13:05
@kcp-ci-bot
Copy link
Contributor

@sankar17: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kcp-verify eb6571d link true /test pull-kcp-verify
pull-kcp-verify-codegen eb6571d link true /test pull-kcp-verify-codegen

Full PR test history

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kcp-ci-bot kcp-ci-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2024
@kcp-ci-bot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sankar17
Copy link
Author

This is implemented with alternative approach #3132
, hence this PR is no longer needed

@sankar17 sankar17 closed this May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has signed the DCO. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants