-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recovering from and diagnosing lxc copy hangs #13492
Comments
A few followup notes from here which we've discovered in the past 24 hours: We are seeing hanging behavior on specific hosts for specific instances and the issue seems twofold where we occasionally see a hang on an initial full instance copy with - but more reliably and reproducable is we see hangs on incremental --refresh copies. Here zfs send is never invoked and the lxc copy process just hangs indefinitely. Any hints as to how to debug what it is doing exactly would be helpful. Second, we realized/found out that just doing an strace on any zfs send process whether it's working or not causes it to abort with the message above. So the example/message we show in our original post here can basically be disregarded as that's an artefact of us doing strace -p on the zfs send process. I really don't know why that kills zfs send, but it does. Edit: Some supplemental information: Initial refresh runs where it copies all the data, then it just hangs indefinitely
After waiting for up to 10 hours, we can break out and if we retry we get:
|
We ended up doing the hardcore solution here and reboot our host system and upgrade lxd to latest/stable. After which, copy operations worked smoothly again. I guess this can be closed, however it would be excellent if you could leave some pointers as to which log files we can look at if this happens again. |
lxd v5.21.1 LTS
Ubuntu Noble
zfs pool backed systems
We are seeing periodic hangs when doing lxc copy which may or may not be network related (we are investigating that). But regardless, what happens is that lxc copy will hang indefinitely where we see a zfs send process in our process list. If we then conclude it must be hung, as we see on activity on the zfs pool on the receiving side, then if we try and do an strace we get this:
And the process terminates at that point.
The lxc copy process will still hang around in the process list however, and if we strace that, we get:
So not only does zfs send hang, which may be something on our end, but lxd just sits there and doesn't realize zfs has died.
If we then kill the lxc copy process all may seem well and good - but if we try to redo the copy, we get a message from the target server along the lines of "Instance is busy performing a Create operation"
The only way to recover from that is to reload the snap daemon on the target, which is bad as that interrupts any other copy operations going on with the "LXD is shutting down" message.
So, the issues here are a few:
Any hints, information and tips to help us with this issue would be much appreciated :)
The text was updated successfully, but these errors were encountered: