Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reboot nodes after cloudinit #160

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

bsdlp
Copy link
Contributor

@bsdlp bsdlp commented Jul 10, 2023

while debugging #159 i noticed that after the package upgrades that kubitect applies when updateOnBoot is true, ubuntu reports that it needs a reboot to apply those upgrades. this change reboots the nodes using cloudinit after everything is done.

i'd also like to get your thoughts on whether or not this should be configurable / if updateOnBoot should be the one that sets reboot powerstate

@MusicDin
Copy link
Owner

MusicDin commented Jul 10, 2023

I like the idea of rebooting VMs if updateOnBoot is true.

We could simply set in cloud-init:

power_state:
  mode: reboot
  condition: ${update}

However, a reboot can cause an error in remot-exec that waits for cloud-init to finish. So we need to figure out, how to prevent this.

@MusicDin
Copy link
Owner

One way of doing it would be to duplicate the remote-exec block within the vm_domain.
The first provisioner can be set to ignore the error (on_failure: continue) and the second one should catch it (on_failure: fail).

This way, if the reboot happens, the error (lost connection) will be ignored, and the second provisioner will wait for cloud-init to finish. If cloud-init does not finish, the second provisioner will throw an error.

In such case, the second provisioner should have the timeout set to 1 minute or something like that.


This is just a quick-fix that would probably work, but I would prefer a more reliable approach (if possible).

@bsdlp
Copy link
Contributor Author

bsdlp commented Jul 13, 2023

reading cloud-init documentation, it looks like cloud-init will report that it is done before the reboot is executed. as such, we should block on cloud-init status --wait and that should be sufficient. alternatively we can keep the logfile parsing loop but set a delay on powerstate to something greater than 2s to make sure that the remote-exec doesn't get interrupted by reboot. (i would prefer moving to calling the cloud-init cli)

from power state docs:

Using this module ensures that cloud-init is entirely finished with modules that would be executed.

fwiw i've initialized my nodes probably 100 times over the past week with reboot mode and haven't run into any issues

@MusicDin
Copy link
Owner

MusicDin commented Jul 31, 2023

I would prefer the cloud-init --wait command over the current log-parsing approach, with the addition of redirecting the standard output to /dev/null. Otherwise, cloud-init command outputs a dot every second.

cloud-init --wait > /dev/null

Regarding the reboot, I always run into the following error:

wait: remote command exited without exit status or exit signal

To me, it seems that either the VM is terminated too quickly or cloud-init reports the exit status too late. However, status: done is outputed before the error is raised.

I suggest that we add the second remote provisioner, where the first one will be ignored on error and the second one will connect to the VM once its rebooted. This way we also ensure that VMs are properly started after the reboot. Any thoughts on that?

@MusicDin
Copy link
Owner

MusicDin commented Aug 2, 2023

I think that cloud-init terminates the VM just a bit too fast, which produces the above error occasionally. I've tried setting the reboot delay of 1 second, which passed all of my tests successfully. Setting it to 3 or maybe even 5 seconds should be reliable enough for now:

power_state:
  mode: reboot
  condition: ${update}
  delay: 5

Any thoughts against this approach?

@bsdlp
Copy link
Contributor Author

bsdlp commented Aug 14, 2023

ah - just realized that delay is in minutes - would a delay of 1 minute make more sense?

https://cloudinit.readthedocs.io/en/latest/reference/modules.html#power-state-change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants