Update CrashLoopBackOff condition as soon as timeout expires #790

himanshu-kun · 2023-02-24T04:44:57Z

How to categorize this issue?

/area robustness
/kind bug
/priority 2

What happened:

Currently MCM doesn't turn CrashLoopBackoff(CLBF) machines to Failed as soon as creationTimeout expires , but delays have been observed which can range to 2min to any time.

There are 2 parts of the problem:

timeout check for CLBF machine is done after making CreateMachine() driver call . CreateMachine() itself could take any amount of time as it provisions VM on the cloudprovider (in Azure big delays like 30min or more have been seen before)
even if the 1) doesn't exist to contribute to the delay, the retry/re-push to the queue can also introduce the delay of ShortRetry(3min). This happens because the retry period is not calculated considering the time left before timeout but is just a constant value of ShortRetry(3min)

What you expected to happen:
Turn to Failed as soon as timeout expires.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

We could take inspiration from machineDeployment logic which turns Progressing condition to False with reason ProgressDeadlineExceeded as soon as the deadline exceeds.

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
Others:

The text was updated successfully, but these errors were encountered:

elankath · 2023-02-24T09:35:55Z

Grooming Discussion Results

Our retry handling in machine creation is present in machineCreateErrorHandler in machine_util.go. This method specifies a medium retry (3min) for enqueuing the machine object if the VM instance failed to be created.
The error handler method mentioned above leverages the helper method getCreateFailurePhase which transitions the the machine phase: CrashLoopBackoff ➡ Failed on creation timeout.
We should ideally transition the machine phase to Failed faster if we are very close to the creation timeout and if the next retry added to the current time exceeds the creation timeout.
To ensure that the above works fine, we should establish that Driver.CreateMachine returns within a bounded time. We can do this by making sure there is a deadline on the Context passed to the CreateMachine method. The deadline should be the time pending for us to reach machine creation timeout.

aaronfern · 2023-11-10T09:40:32Z

On further testing we saw that although this change is fine if MCM runs standalone, there is a scenario that when CA runs alongside this change can cause CAs backoff mechanism to not work.

The case we focussed on was CA backing off from a MachineDeployment if the Machine does not get ready and CA knows that there is another MachineDeployment whose Machine spec is sufficient and adequate for incoming workload. In normal circumstances, CA would backoff, scale down the MachineDeployment it had originally scaled up, and scale up a different MachineDeployment whose Machines would then be used.

With this change however (assuming MCMs --machine-creation-timeout and CAs --max-node-provision-time are the same), MCM will replace the Machine before CA has a chance to look at it and calculate it's backoff parameters. So from CAs point of view it won't see a node not in the Running state for longer than required for it to backoff from the MachineDeployment

Solution:

* Set the --machine-creation-timeout in MCM to 0 by default so that unless set, MCM will not have a timeout for machine creation. Field can be set by the user in the shoot spec

* Document this behaviour and advice that MCMs --machine-creation-timeout should be a value greater then CAs --max-node-provision-time

Updating the solution to be more specific and descriptive

Update MCM to be able to interpret a negative or zero timeout value for --machine-creation-timeout. This means that if this flag is set to 0 or some negative value, then MCM will not have a timeout for machine creation. This does not change the default value for this flag, and if no value is passed, MCM will still use it's current default
Update gardenlet code to set a negative value to --machine-creation-timeout for MCM be default if no value is set via the shoot spec. If a value is specified in the shoot spec, then the value specified is always used
This behaviour needs to be documented and there needs to be a recommendation to set MCMs --machine-creation-timeout value to be greater than CAs --max-node-provision-time value.
Validation needs to be added to the shoot to not allow machineCreationTimeout to be greater than maxNodeProvisionTime
No additional change from CA is needed

elankath · 2023-11-10T11:18:01Z

Regarding Set the --machine-creation-timeout in MCM to 0 by default - is this OK for standalone consumers of the MCM ? Do we have any stakeholders where folks use the MCM but not our fork of the CA?

unmarshall · 2023-11-10T11:54:37Z

We should just add a doc that the default value is set to 0. There are 2 different recommendations:

If MCM is used standalone without using CA then consumers should use a value > 0.
If MCM is used along with CA then:
1. It is recommended that you leave the default value of machine-creation-timeout to 0.
2. If there is a real need to set this value then machine-creation-timeout should be > CA's max-node-provision-time

himanshu-kun added the kind/bug Bug label Feb 24, 2023

gardener-robot added area/robustness Robustness, reliability, resilience related priority/2 Priority (lower number equals higher priority) labels Feb 24, 2023

himanshu-kun added the needs/planning Needs (more) planning with other MCM maintainers label Feb 24, 2023

himanshu-kun assigned aaronfern Apr 17, 2023

himanshu-kun mentioned this issue Jul 3, 2023

Consider failed machine as terminating gardener/autoscaler#118

Closed

aaronfern linked a pull request Sep 12, 2023 that will close this issue

Reconcile machine before creation deadline #849

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update CrashLoopBackOff condition as soon as timeout expires #790

Update CrashLoopBackOff condition as soon as timeout expires #790

himanshu-kun commented Feb 24, 2023 •

edited

elankath commented Feb 24, 2023 •

edited

aaronfern commented Nov 10, 2023 •

edited

elankath commented Nov 10, 2023

unmarshall commented Nov 10, 2023

Update CrashLoopBackOff condition as soon as timeout expires #790

Update CrashLoopBackOff condition as soon as timeout expires #790

Comments

himanshu-kun commented Feb 24, 2023 • edited

elankath commented Feb 24, 2023 • edited

Grooming Discussion Results

aaronfern commented Nov 10, 2023 • edited

Solution:

elankath commented Nov 10, 2023

unmarshall commented Nov 10, 2023

himanshu-kun commented Feb 24, 2023 •

edited

elankath commented Feb 24, 2023 •

edited

aaronfern commented Nov 10, 2023 •

edited