Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck on PLANNING #791

Open
lapo-luchini opened this issue May 20, 2024 · 6 comments
Open

Stuck on PLANNING #791

lapo-luchini opened this issue May 20, 2024 · 6 comments

Comments

@lapo-luchini
Copy link
Contributor

It happens to me more often lately that the process is stuck on some PLANNING phases:
image
If I do signal reset, and then wake-up, it starts again and sometimes it works, while other times after a while it gets stuck again (on random filesystems apparently, not always the same).
How can I help debug this?
And, as a safeguard, would it be possible to have a watchdog… like, if a phase hasn't ended by 10 minutes, abort it and consider it failed for this round?

@A1bi
Copy link

A1bi commented Jun 4, 2024

I am also experiencing this frequently now on FreeBSD 14.0. Pretty sure it only started recently, maybe with the update to FreeBSD 14.0. I also haven't seen it on my Linux based systems yet, so it may be related to that FreeBSD update.

@kapsel
Copy link

kapsel commented Jun 4, 2024

We are also seeing this issue, we are on FreeBSD 14.0. Restarting zrepl works.

@lapo-luchini
Copy link
Contributor Author

This might be "better" than a full restart, as a work-around:
sudo zrepl signal reset <name>

@dsh2dsh
Copy link
Contributor

dsh2dsh commented Jun 5, 2024

JFYI, in my fork I implemented a timeout and it helped me a lot. Also, using a cron spec I configured some jobs with same ZFS datasets, so zrepl fires them at different time and they are not intersecting.

@lapo-luchini
Copy link
Contributor Author

Wow @dsh2dsh there's a lot of work in your fork! Why it doesn't show up on github.com as a fork?
(that makes it more difficult to inspect the diff between the two projects)
@problame any chance any of that work will be included?
Would be nice to have a new official release. :)

@lapo-luchini
Copy link
Contributor Author

lapo-luchini commented Jun 11, 2024

At least for me this seems to be related to the number of parallel seize estimation steps, it no longer happened to me since I did this:

  replication:
    concurrency:
      size_estimates: 1
      #size_estimates: 4
      steps: 10

Edit to add: nope… changing that value solved it on a server, but on a different server it actually got things worse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants