-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize the eviction of pods with volumes #848
Comments
Any opinion @gardener/mcm-maintainers? |
Hi Tim, we can support this though I am doubtful whether we should make it configurable in the shoot YAML as it can possibly lead to severe degradation if operator configures a high value and then a fair amount of effort diagnosing/trouble-shooting such issues. However, I think we can introduce a fixed degree of parallelism in evicting Pods with PVs after relevant testing of the behaviour on problematic providers like Azure. Now that we have implemented #781, today we wait for all volumes to be detached from the Node before proceeding to VM deletion. Hence those edge cases where still attached volumes cause the attach/detach controller to go into timeouts, is ameliorated. |
Thanks for the feedback @elankath. It wasn't meant to be an option for shoot owners. The degree of parallelism can also be configured by Gardenlet via its config. |
That fine. A "hidden knob" like a CLI option |
How to categorize this issue?
/area performance
/kind enhancement
/priority 3
What would you like to be added:
MCM should provide a knob to configure the degree of parallel evictions for pods with volumes.
Why is this needed:
#262 established a serial eviction of pods with volumes to make the overall node drain process faster, esp. for cloud providers where many parallel detach/attach operations lead to rate limits and huge back-offs.
On some infrastructures and to some degree, a parallel eviction for pods with volumes may lead to a beneficial performance boost. Today, shoot clusters with many nodes often need a considerable amount of time to perform rolling updates. We see this aspect being one of the root causes that can be improved.
The text was updated successfully, but these errors were encountered: