Recovering from and diagnosing lxc copy hangs #13492

webdock-io · 2024-05-20T15:59:13Z

lxd v5.21.1 LTS
Ubuntu Noble

zfs pool backed systems

We are seeing periodic hangs when doing lxc copy which may or may not be network related (we are investigating that). But regardless, what happens is that lxc copy will hang indefinitely where we see a zfs send process in our process list. If we then conclude it must be hung, as we see on activity on the zfs pool on the receiving side, then if we try and do an strace we get this:

# strace -p 2429972
strace: Process 2429972 attached
write(2, "warning: cannot send 'lxd/contai"..., 110) = 110
close(3)                                = 0
close(4)                                = 0
close(5)                                = 0
exit_group(1)                           = ?
+++ exited with 1 +++

And the process terminates at that point.

The lxc copy process will still hang around in the process list however, and if we strace that, we get:

# strace -p 1765544
strace: Process 1765544 attached
futex(0x1121fe8, FUTEX_WAIT_PRIVATE, 0, NULL

So not only does zfs send hang, which may be something on our end, but lxd just sits there and doesn't realize zfs has died.

If we then kill the lxc copy process all may seem well and good - but if we try to redo the copy, we get a message from the target server along the lines of "Instance is busy performing a Create operation"

The only way to recover from that is to reload the snap daemon on the target, which is bad as that interrupts any other copy operations going on with the "LXD is shutting down" message.

So, the issues here are a few:

LXD does not try to detect whether zfs send is doing its thing, that would be pretty cool if possible with timeouts or something like that
LXD doesn't realize the zfs process has been killed and moves on
The receiving end is left in a bad state

Any hints, information and tips to help us with this issue would be much appreciated :)

The text was updated successfully, but these errors were encountered:

webdock-io · 2024-05-21T07:55:30Z

A few followup notes from here which we've discovered in the past 24 hours:

We are seeing hanging behavior on specific hosts for specific instances and the issue seems twofold where we occasionally see a hang on an initial full instance copy with - but more reliably and reproducable is we see hangs on incremental --refresh copies. Here zfs send is never invoked and the lxc copy process just hangs indefinitely. Any hints as to how to debug what it is doing exactly would be helpful.

Second, we realized/found out that just doing an strace on any zfs send process whether it's working or not causes it to abort with the message above. So the example/message we show in our original post here can basically be disregarded as that's an artefact of us doing strace -p on the zfs send process.

I really don't know why that kills zfs send, but it does.

Edit: Some supplemental information:

Initial refresh runs where it copies all the data, then it just hangs indefinitely

# lxc copy --mode push --refresh --stateless beldexnode1 backupserver:leeta-beldexnode1-primary
Transferring instance: beldexnode1: 383.00MB (53.86MB/s)

After waiting for up to 10 hours, we can break out and if we retry we get:

# lxc copy --mode push --refresh --stateless beldexnode1 backupserver:leeta-beldexnode1-primary
Error: Failed getting exclusive access to instance: Instance is busy running a "create" operation

webdock-io · 2024-05-22T12:53:55Z

We ended up doing the hardcore solution here and reboot our host system and upgrade lxd to latest/stable. After which, copy operations worked smoothly again.

I guess this can be closed, however it would be excellent if you could leave some pointers as to which log files we can look at if this happens again.

webdock-io closed this as completed Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovering from and diagnosing lxc copy hangs #13492

Recovering from and diagnosing lxc copy hangs #13492

webdock-io commented May 20, 2024

webdock-io commented May 21, 2024 •

edited

webdock-io commented May 22, 2024

Recovering from and diagnosing lxc copy hangs #13492

Recovering from and diagnosing lxc copy hangs #13492

Comments

webdock-io commented May 20, 2024

webdock-io commented May 21, 2024 • edited

webdock-io commented May 22, 2024

webdock-io commented May 21, 2024 •

edited