🌱 Allow inplace update of fields related to deletion during Machine deletion #10589

davidvossel · 2024-05-10T17:51:34Z

Fixes #10588
/area machine

What this PR does / why we need it:

Machine's default to nodeDrainTimeout: 0s, which blocks indefinitely if a pod can't be evicted. We can't change the nodeDrainTimeout in-place from the MachineDeployment or MachineSet after a machine is marked for deletion.

This results in a machine that wedged forever but can't be updated using the top level objects that own the machine.

To fix this, this PR allows fields related to machine deletion to be updated in place even when the machine is marked for deletion.

NOTE - I did not add unit tests yet for this PR. I want confirmation this is an acceptable approve before investing time into testing.

enxebre · 2024-05-13T07:05:27Z

Thanks @davidvossel change makes sense to me. Smoother deletion is actually one of the supporting use cases for inplace propagation. Let's include some unit tests.

See related #5880
#9285

davidvossel · 2024-05-13T17:53:48Z

Let's include some unit tests.

@enxebre i extended the existing unit test to cover the case of updating a deleting machine.

internal/controllers/machineset/machineset_controller.go

enxebre · 2024-05-20T11:48:47Z

/lgtm

k8s-ci-robot · 2024-05-20T11:48:52Z

LGTM label has been added.

Git tree hash: 44936fae936d0eab3c39b86c432c24c5e199979d

sbueringer · 2024-05-23T13:10:23Z

internal/controllers/machineset/machineset_controller.go

@@ -362,8 +362,21 @@ func (r *Reconciler) syncMachines(ctx context.Context, machineSet *clusterv1.Mac
 log := ctrl.LoggerFrom(ctx)


Just trying to think through various cases where Machines belonging to MachineSets are deleted

MD is deleted

The following happens:

MD goes away

ownerRef triggers MS deletion

MS goes away

ownerRef triggers Machine deletion

=> Current PR doesn't help in this scenario, because the MS will already be gone when the deletionTimestamp is set on the Machines. In this case folks would have to modify the timestamps on each Machine individually.

I recently had a discussion with @vincepri, that we should maybe consider changing our MD deletion flow. Basically adding a finalizer on MD & MS, so MD & MS stick around until all Machines are gone. If we would do this, the MS => Machine propagation of the timeouts implemented here would help for this case as well

MD is scaled down to 0

The following happens:

MD scales down MS to 0

MS deletes Machine

=> This PR helps in this case because the timeouts are then propagated from MS to Machine

MD rollout

The following happens:

Someone updates the MD (e.g. bump the Kubernetes version)

MD creates a new MS and scales it up

In parallel MD scales down the old MS to 0

=> In this scenario the current PR won't help, because the MD controller does not propagate the timeouts from MD to all MS (only to the new/current one, not to the old ones)

I see how this PR addresses scenario 2. Wondering if we want to solve this problem more holistically. (maybe I also missed some cases, not sure)

Here's what's going on... the use case is subtle, but an easy one to get trapped by.

a MS is created with the default node drain timeout of (wait forever).

The MS needs to scale down to zero (but not be deleted). The intent it to bring this MS back online at some point.

The user discovers that the default node drain timeout is blocking the scale down to zero. The user likely only encounters this drain block the first time they scale down to zero because there are typically other nodes available during normal scale down operations which allows PDB to be satisfied.

The outcome is that the user is now trapped. They can't gracefully scale the MS down to zero because the default node drain timeout can't be updated on the machines. So the user is either forced to take some manual action to tear down the machines or delete the MS.

By allowing the node drain timeout to be modified while the machines are marked for deletion, we give the user a path to unblock themselves using the top level api (either MS or MD) rather than mutating individual machines or performing some other manual operation.

sbueringer · 2024-05-23T13:16:23Z

internal/controllers/machineset/machineset_controller.go

 if !m.DeletionTimestamp.IsZero() {
+ patch := client.MergeFrom(m.DeepCopy())


Let's please use the patchHelper here (see e.g. l.1010), it's doing basically the same here, but I would prefer using it so we can profit from error handling & potential improvements here as well (e.g. logging etc.)

sbueringer · 2024-05-23T13:18:20Z

internal/controllers/machineset/machineset_controller_test.go

+ g.Expect(reconciler.syncMachines(ctx, ms, []*clusterv1.Machine{updatedInPlaceMutatingMachine, deletingMachine})).To(Succeed())
+ updatedDeletingMachine := deletingMachine.DeepCopy()
+
+ g.Eventually(func(g Gomega) {


Eventually shouldn't be required here (syncMachines does synchronous calls and GetAPIReader() reads without cache)

…ine deletion Signed-off-by: David Vossel <[email protected]>

k8s-ci-robot · 2024-05-30T14:46:20Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from enxebre. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enxebre · 2024-05-31T14:27:03Z

/lgtm
/assign @sbueringer

k8s-ci-robot · 2024-05-31T14:27:08Z

LGTM label has been added.

Git tree hash: a7820897401d291e168e73cfc2ea745d5f2c8d87

k8s-ci-robot added area/machine Issues or PRs related to machine lifecycle management cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 10, 2024

k8s-ci-robot requested review from JoelSpeed and richardcase May 10, 2024 17:51

davidvossel force-pushed the nodedrain-during-delete branch from 9dfb847 to 7e05934 Compare May 13, 2024 17:53

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 13, 2024

sbueringer changed the title ~~Allow inplace update of fields related to deletion during Machine deletion~~ 🌱 Allow inplace update of fields related to deletion during Machine deletion May 14, 2024

enxebre reviewed May 15, 2024

View reviewed changes

internal/controllers/machineset/machineset_controller.go Outdated Show resolved Hide resolved

davidvossel force-pushed the nodedrain-during-delete branch from 7e05934 to b69cea4 Compare May 16, 2024 20:48

k8s-ci-robot assigned enxebre May 20, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 20, 2024

sbueringer reviewed May 23, 2024

View reviewed changes

Propagate values inplace during machine deletion that pertain to mach…

da2b7c8

…ine deletion Signed-off-by: David Vossel <[email protected]>

davidvossel force-pushed the nodedrain-during-delete branch from b69cea4 to da2b7c8 Compare May 30, 2024 14:45

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 30, 2024

k8s-ci-robot requested a review from enxebre May 30, 2024 14:45

chrischdi mentioned this pull request May 31, 2024

🐛 Correctly handle concurrent updates to ClusterResourceSetBinding #10656

Open

k8s-ci-robot assigned sbueringer May 31, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 31, 2024

enxebre mentioned this pull request May 31, 2024

Consider implementing "forced" MD foreground deletion #10710

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌱 Allow inplace update of fields related to deletion during Machine deletion #10589

🌱 Allow inplace update of fields related to deletion during Machine deletion #10589

davidvossel commented May 10, 2024

enxebre commented May 13, 2024

davidvossel commented May 13, 2024

enxebre commented May 20, 2024

k8s-ci-robot commented May 20, 2024

sbueringer May 23, 2024 •

edited

davidvossel May 30, 2024

sbueringer May 23, 2024

davidvossel May 30, 2024

sbueringer May 23, 2024

davidvossel May 30, 2024

k8s-ci-robot commented May 30, 2024

enxebre commented May 31, 2024

k8s-ci-robot commented May 31, 2024

		@@ -362,8 +362,21 @@ func (r Reconciler) syncMachines(ctx context.Context, machineSet clusterv1.Mac
		log := ctrl.LoggerFrom(ctx)

		if !m.DeletionTimestamp.IsZero() {
		patch := client.MergeFrom(m.DeepCopy())

🌱 Allow inplace update of fields related to deletion during Machine deletion #10589

Are you sure you want to change the base?

🌱 Allow inplace update of fields related to deletion during Machine deletion #10589

Conversation

davidvossel commented May 10, 2024

enxebre commented May 13, 2024

davidvossel commented May 13, 2024

enxebre commented May 20, 2024

k8s-ci-robot commented May 20, 2024

sbueringer May 23, 2024 • edited

Choose a reason for hiding this comment

davidvossel May 30, 2024

Choose a reason for hiding this comment

sbueringer May 23, 2024

Choose a reason for hiding this comment

davidvossel May 30, 2024

Choose a reason for hiding this comment

sbueringer May 23, 2024

Choose a reason for hiding this comment

davidvossel May 30, 2024

Choose a reason for hiding this comment

k8s-ci-robot commented May 30, 2024

enxebre commented May 31, 2024

k8s-ci-robot commented May 31, 2024

sbueringer May 23, 2024 •

edited