Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sanoid --prune-snapshots performance is bad because of iszfsbusy #912

Open
sdettmer opened this issue Apr 18, 2024 · 5 comments
Open

sanoid --prune-snapshots performance is bad because of iszfsbusy #912

sdettmer opened this issue Apr 18, 2024 · 5 comments

Comments

@sdettmer
Copy link

sdettmer commented Apr 18, 2024

I noticed that on my machine sanoid appears a little slow. I reduced snaps to free a bit of space and first run sanoid got a bit to work on. I noticed that deletion takes ~100ms per zfs destroy, but sanoid needs ~500ms for each. I think because iszfsbusy() takes much time, and it is not executed once per data set, but for each snapshot (I think it is a race anyway).

Do I understand correctly --force-prune skips that check (and does nothing else)? Under which circumstances does it help?

With --force-prune, I get more than 5 x the speed EDIT: 10 x the speed and I think it is still safe unless the expiry rules would be so short that they would expire recent snapshots which could be used by replication (and then, if send would start a fraction of a second later, would be lost anyway).
Do I understand correctly that this check normally should not be needed, and even if zfs send would be running, there would only be an error message but no disturbance? Do we still need this or could we make --force-prune the "default"?

@phreaker0
Copy link
Collaborator

@jimsalterjrs i think we can remove the whole send/recv check. zfs destroy will error out if the snapshot is in use and if we need to know the reason why it failed we can capture the output.

What do you think? If you are okay with it a will create a PR for it.

@jimsalterjrs
Copy link
Owner

@jimsalterjrs i think we can remove the whole send/recv check. zfs destroy will error out if the snapshot is in use and if we need to know the reason why it failed we can capture the output.

What do you think? If you are okay with it a will create a PR for it.

Sounds fine... I knew this was belt-and-suspenders, but did not expect iszfsbusy() to cause so much of a slowdown. Is the slowdown our own fault, or is something upstream performing less-well than it ought to? If the latter, we should probably file a bug upstream despite "fixing" it on our end by not bothering to check.

Thanks for offering to create the PR!

@sdettmer
Copy link
Author

I think the performance impact is just to big when having "empty" snapshots (like sanoid frequently during the nights). Destroying them happens in an instant, but ps -Ac on a node with lets say 5000 processes takes ~500ms (I just tested on two randomly picked systems).

(Normally with a couple of snaps to be deleted, it should have no big effect, but when new people come to sanioid, they may do so because of an issue, like facing slowness and finding 100.000 snaps, so they fix the rules and then the performance impacts. I think I found exactly one such issue in the internet, where someone proposed a script being "ten times faster" (IIRC) possibly for the same reason.)

But for @phreaker0 I have to state that I have no patch yet! I think changing the default value shouldn't be too difficult and I think could prepare a patch proposal. Or did you already took a look? What do you all think?

sdettmer added a commit to sdettmer/sanoid that referenced this issue Apr 20, 2024
…y making --force-prune default (and adding --no-force-prune to get old behavior
sdettmer added a commit to sdettmer/sanoid that referenced this issue Apr 20, 2024
sdettmer added a commit to sdettmer/sanoid that referenced this issue Apr 20, 2024
…y making --force-prune default (and adding --no-force-prune to get old behavior)
sdettmer added a commit to sdettmer/sanoid that referenced this issue Apr 20, 2024
…y making --force-prune default (and adding --double-check-prune to get old behavior)
sdettmer added a commit to sdettmer/sanoid that referenced this issue Apr 20, 2024
…y making --force-prune default (and adding --double-check-prune to get old behavior)
sdettmer added a commit to sdettmer/sanoid that referenced this issue Apr 20, 2024
…y making --force-prune default (and adding --no-force-prune to get old behavior)
sdettmer added a commit to sdettmer/sanoid that referenced this issue Apr 20, 2024
@sdettmer
Copy link
Author

sdettmer commented Apr 20, 2024

Indeed the change seems very straightforward (I hope I did not miss anything).

I see three major approaches and I made a proposal patch for each:

  1. dev/sde/no-force-prune - the "least invasive" probably is to make --force-prune default and add --no-force-prune as option to restore the old behavior
  2. dev/sde/double-check-prune - I think the name --no-force-prune is irritating, so it might be renamed to --double-check-prune
  3. dev/sde/remove-force-prune - If we think the whole functionality is not needed, best would to remove it and keep things more clean. Maybe this could be done in a later version, when experiences shows, that no one ever used --no-force-prune or --double-check-prune (doing this later instead of right now possibly could be too much safety, but I work in an industry with 20 years warranty periods so I'm used to think this way :))

@sdettmer
Copy link
Author

sdettmer commented Apr 20, 2024

That are alternatives - of course one and only one could be merged

I'm happy to receive feedback for the choosen option and to improve my proposal according to your review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants