Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buildbot infrastructure instability caused by time synchronization on latent workers #269

Open
pmisik opened this issue Jan 4, 2024 · 7 comments

Comments

@pmisik
Copy link

pmisik commented Jan 4, 2024

Hi

I guess there is Buildbot infrastructure instability caused by time synchronization on latent workers.
On latent workers p12-pd-?? I'm seeing bizarre errors that seem to be time sync related.
It looks as if the time synchronization occurred during the execution of steps.
Reasons why I suspect time sync issue is that I randomly seeing these problems:

@p12tic what do you think?

@verm
Copy link
Member

verm commented Jan 4, 2024

I just checked and it seems that ntpd was started incorrectly on service3 but I don't see anything in the logs about there being any time issues when I did restart it the adjustment was microseconds.

@pmisik
Copy link
Author

pmisik commented Jan 5, 2024

I’m not sure if you use VM's for running worker machines.
Since the time offset was significant (16497 seconds=04:34:57), I wonder if this is one of the issues with time synchronization you can have on the VM infrastructure (at least I've encountered them).

  • Booted VM with image with time sync disabled on VM machine startup
  • It then starts the buildbot-worker and starts executing commands with the wrong time from image/snapshot.
  • While the buildbot-worker is running, ntpd deamon/service runs and invokes time synchronization and shifts the time against the NTP server.
  • During the run, the VM guest agent/daemon/service starts and again causes synchronization but against the host and shifts the time.
  • In VM cluster with multiple nodes/hosts: During the running of the VM, the running VM will be relocated from one node to another. Subsequently, time synchronization will occur in the guest OS via VM guest agent/daemon/service, which will shift the time.

For example, here https://buildbot.buildbot.net/#/builders/108/builds/2120 is an interesting situation where there is probably a time shifted twice.

  • Step 6 /tmp/bbvenv/bin/pip install -e master -e worker according to the webUI of master took 7 seconds but according to the log from the worker it took elapsedTime=-16486.171892 (negative duration)

  • Step 7 set -e according to the webUI of master took 5 seconds but according to the log from the worker it took elapsedTime=16497.727639 (positive duration)

@p12tic
Copy link
Member

p12tic commented Jan 5, 2024

Interesting. These workers are on a machine I boot up when I want faster test execution. Recently I migrated them to podman containers using gVisor container runtime. Probably gVisor doesn't fake syscalls well enough.

@verm
Copy link
Member

verm commented Jan 5, 2024

Just checking in this isn't an issue with time on the master? Sounds like it's not but I want to make sure if there's anything I need to do let me know.

@p12tic
Copy link
Member

p12tic commented Jan 5, 2024

@verm There's no issues with time on master. For any issues in p12-* workers the worker setup is the first suspect.

@verm
Copy link
Member

verm commented Jan 6, 2024

@p12tic okay great!

@pmisik
Copy link
Author

pmisik commented Jan 18, 2024

Now, it looks like p12-pd-? workers have run out of disk space for /home because errors like

error Error: ENOSPC: no space left on device, mkdir '/home/buildbot/...
https://buildbot.buildbot.net/#/builders/126/builds/128
https://buildbot.buildbot.net/#/builders/122/builds/1120

WARNING: Building wheel for buildbot failed: [Errno 28] No space left on device: '/home/buildbot/.cache/pip/wheels/62'
https://buildbot.buildbot.net/#/builders/127/builds/132

This applies at least to

  • p12-pd-4
  • p12-pd-5
  • p12-pd-6
  • p12-pd-14
  • p12-pd-19
  • p12-pd-33
  • p12-pd-34
  • p12-pd-37
  • p12-pd-38
  • p12-pd-39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants