Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovering from and diagnosing lxc copy hangs #13492

Closed
webdock-io opened this issue May 20, 2024 · 2 comments
Closed

Recovering from and diagnosing lxc copy hangs #13492

webdock-io opened this issue May 20, 2024 · 2 comments

Comments

@webdock-io
Copy link

lxd v5.21.1 LTS
Ubuntu Noble

zfs pool backed systems

We are seeing periodic hangs when doing lxc copy which may or may not be network related (we are investigating that). But regardless, what happens is that lxc copy will hang indefinitely where we see a zfs send process in our process list. If we then conclude it must be hung, as we see on activity on the zfs pool on the receiving side, then if we try and do an strace we get this:

# strace -p 2429972
strace: Process 2429972 attached
write(2, "warning: cannot send 'lxd/contai"..., 110) = 110
close(3)                                = 0
close(4)                                = 0
close(5)                                = 0
exit_group(1)                           = ?
+++ exited with 1 +++

And the process terminates at that point.

The lxc copy process will still hang around in the process list however, and if we strace that, we get:

# strace -p 1765544
strace: Process 1765544 attached
futex(0x1121fe8, FUTEX_WAIT_PRIVATE, 0, NULL

So not only does zfs send hang, which may be something on our end, but lxd just sits there and doesn't realize zfs has died.

If we then kill the lxc copy process all may seem well and good - but if we try to redo the copy, we get a message from the target server along the lines of "Instance is busy performing a Create operation"

The only way to recover from that is to reload the snap daemon on the target, which is bad as that interrupts any other copy operations going on with the "LXD is shutting down" message.

So, the issues here are a few:

  1. LXD does not try to detect whether zfs send is doing its thing, that would be pretty cool if possible with timeouts or something like that
  2. LXD doesn't realize the zfs process has been killed and moves on
  3. The receiving end is left in a bad state

Any hints, information and tips to help us with this issue would be much appreciated :)

@webdock-io
Copy link
Author

webdock-io commented May 21, 2024

A few followup notes from here which we've discovered in the past 24 hours:

We are seeing hanging behavior on specific hosts for specific instances and the issue seems twofold where we occasionally see a hang on an initial full instance copy with - but more reliably and reproducable is we see hangs on incremental --refresh copies. Here zfs send is never invoked and the lxc copy process just hangs indefinitely. Any hints as to how to debug what it is doing exactly would be helpful.

Second, we realized/found out that just doing an strace on any zfs send process whether it's working or not causes it to abort with the message above. So the example/message we show in our original post here can basically be disregarded as that's an artefact of us doing strace -p on the zfs send process.

I really don't know why that kills zfs send, but it does.

Edit: Some supplemental information:

Initial refresh runs where it copies all the data, then it just hangs indefinitely

# lxc copy --mode push --refresh --stateless beldexnode1 backupserver:leeta-beldexnode1-primary
Transferring instance: beldexnode1: 383.00MB (53.86MB/s)

After waiting for up to 10 hours, we can break out and if we retry we get:

# lxc copy --mode push --refresh --stateless beldexnode1 backupserver:leeta-beldexnode1-primary
Error: Failed getting exclusive access to instance: Instance is busy running a "create" operation

@webdock-io
Copy link
Author

We ended up doing the hardcore solution here and reboot our host system and upgrade lxd to latest/stable. After which, copy operations worked smoothly again.

I guess this can be closed, however it would be excellent if you could leave some pointers as to which log files we can look at if this happens again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant