Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bucc vm does not survive a restart #214

Open
damzog opened this issue Aug 14, 2020 · 10 comments
Open

bucc vm does not survive a restart #214

damzog opened this issue Aug 14, 2020 · 10 comments

Comments

@damzog
Copy link
Contributor

damzog commented Aug 14, 2020

Hi,

we still use 0.92 on openstack. I observed that after a restart of the bucc vm e.g. be bucc ssh -> shutdown -r now the vm is not coming up again: It is rebooted but no monit process running, the persistent disk seems not to be mounted properly, see below. Any ideas? Is it a stemcell problem?

bosh/0:/var/vcap/bosh/bin# monit summary
/var/vcap/monit/job/0024_nats.monitrc:3: Warning: the executable does not exist '/var/vcap/jobs/bpm/bin/bpm'
/var/vcap/monit/job/0024_nats.monitrc:4: Warning: the executable does not exist '/var/vcap/jobs/bpm/bin/bpm'
/var/vcap/monit/job/0023_postgres.monitrc:3: Warning: the executable does not exist '/var/vcap/jobs/postgres/bin/postgres_ctl'
/var/vcap/monit/job/0023_postgres.monitrc:5: Warning: the executable does not exist '/var/vcap/jobs/postgres/bin/postgres_ctl'
[...]
/bosh_dns_resolvconf_ctl'
/var/vcap/monit/job/0001_director-bosh-dns.monitrc:3: Warning: the executable does not exist '/var/vcap/jobs/bpm/bin/bpm'
/var/vcap/monit/job/0001_director-bosh-dns.monitrc:4: Warning: the executable does not exist '/var/vcap/jobs/bpm/bin/bpm'
monit: no status available -- the monit daemon is not running

@ramonskie
Copy link
Contributor

i only know of this problem in combination with virtualbox cpi
it could have gone wrong because of several reasons.
and is this a one time occurrence or everytime?

if the disk still exists thats in the state file ./state/state.json
than you should be able to just do a bucc up.

@damzog
Copy link
Contributor Author

damzog commented Aug 14, 2020

Yes it is reproducible. A bucc up doesn't help because it detects no change and will not act. bucc up --recreate will recreate the vm and everything works fine again.

@owwweiha
Copy link

This issue also occurs on vSphere. On reboot, /var/vcap/store and /var/vcap/data are not mounted. Workaround: execute bucc up with the --recreate flag.

@chewfred
Copy link

I think we have the same problem with bucc up --lite --cpi=docker-desktop. When i restart the bosh instance in docker.. all the https request does not work

@ramonskie
Copy link
Contributor

this is a cpi issues unfortunately. nothing much we can do about it from a bucc perspective.
we can try to fix it in the cpi and make a pr there.
if anyone is up for that?

@owwweiha
Copy link

@ramonskie can you explain this in more detail? If I got you correctly, this issue occurs in at least the openstack, docker, vsphere and virtualbox cpi.

@ramonskie
Copy link
Contributor

i have not seen this issue occurring in vsphere
only on docker/virtualbox. and thats due to how the disks are mounted via those specific cpi's in combination with the bosh agent.

see this long standing open issue cloudfoundry/bosh-virtualbox-cpi-release#7
so in order to fix this. someone should fix those issues in the cpi/agent.
unfortunately we cannot ducktape a fix in bucc in this case.
the only thing we can do is either let the bosh team know and let them prioritize the work.
or fix it and make a pr to bosh

@owwweiha
Copy link

Well, we are facing this issue with the vSphere CPI and @damzog, who opened this issue, uses the openstack cpi. That's why I'm asking. For me it sounds like it's not only a bug with the docker/virtualbox CPI but with some other component. :(

@ramonskie
Copy link
Contributor

is it reproducible?
have you already done some preliminary work of debugging this issue?
as we are testing it on vsphere with full upgrade scenarios etc and have not seen these kind of errors yet

@owwweiha
Copy link

Yes, I can reproduce this behaviour, just did it. We noticed this issue while performing some failover tests (e.g., vSphere HA moving and restarting the VM) but it's also reproducable by simply rebooting the bucc VM via vSphere GUI or by using govc vm.power -r=true
As far as we know, updating bucc is not affected by this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants