Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make lifecycle-manager HA #186

Open
2rs2ts opened this issue Jan 10, 2024 · 0 comments
Open

Make lifecycle-manager HA #186

2rs2ts opened this issue Jan 10, 2024 · 0 comments

Comments

@2rs2ts
Copy link
Contributor

2rs2ts commented Jan 10, 2024

Is this a BUG REPORT or FEATURE REQUEST?: Feature Request

What happened: If you run multiple replicas of this software in order to make sure that it evicting itself isn't disruptive and doesn't cause any alerts about under-replicated deployments (something a lot of cluster operators do in order to catch capacity issues or other uptime issues,) then it will log a bunch of errors and warnings because the multiple replicas step on each others' toes. If you run just one replica, well, you get all the problems of having something important to your normal rollout operations just disappear sometimes because it got evicted, which is... suboptimal, to say the least.

What you expected to happen: I would like for this project to do leader election via Leases (pretty easy to do with the k8s golang SDK) and, if its leader disappears, the next leader should be able to pick up on any ongoing operation like a node drain or responding to a lifecycle event.

Implementing leader election is pretty easy, but implementing picking up where you left off might be a little more complicated–not sure exactly how much more, since I'm not really familiar with the codebase, but I reckon that in a worst case scenario, it would just mean you'd have to either write logic that intuits what the previous leader was up to (which you might already have, given that you wrote this thing without leader election, and it doesn't seem to just completely kill ASG updates when it evicts itself) or you may need to write some sort of progress state to a ConfigMap.

How to reproduce it (as minimally and precisely as possible): For the errors in logs, just run multiple replicas and do a normal ASG update. For all the degenerate cases caused by having only 1 replica, well, you're already experiencing it with your own clusters probably, aren't you?

Anything else we need to know?:

Environment:

  • Kubernetes version: we run several versions, but I think this applies to any relatively modern version. Leases are not a brand new feature at all.

Other debugging information (if applicable):

This is one of the kinds of errors you will see if you run multiple replicas:

time="2024-01-04T22:54:32Z" level=error msg="failed to complete lifecycle action: ValidationError: No active Lifecycle Action found with instance ID i-0f51b5dffa4dd4b12\n\tstatus code: 400, request id: 92d0b3d9-711a-4482-aab8-33265e7cab49"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant