Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic randomly occurs on node shutdown, leading to unclean shutdown #1661

Open
hhsel opened this issue Jul 23, 2023 · 1 comment
Open

Panic randomly occurs on node shutdown, leading to unclean shutdown #1661

hhsel opened this issue Jul 23, 2023 · 1 comment

Comments

@hhsel
Copy link
Contributor

hhsel commented Jul 23, 2023

Expected behaviour

Panic should not happen on normal node shutdown.

Actual behaviour

panic: sync: WaitGroup is reused before previous Wait has returned randomly happens on node shutdown, leading to unclean shutdown and data loss on the node.
I stop one of the non-validator nodes once in a day to safely take a disk snapshot.
I have observed this panic message once in a month or two.

Steps to reproduce the behaviour

Launch a QBFT cluster and schedule a normal shutdown once in a day.
Sometimes panic: sync: WaitGroup is reused before previous Wait has returned message appears on node shutdown, causing data loss on the node.

@hhsel
Copy link
Contributor Author

hhsel commented Jul 23, 2023

This might be related to ethereum/go-ethereum#27509 and applying ethereum/go-ethereum#27665 might help alleviating this issue.

It seems that this panic message is not completely random and there are some situations where its probability gets high, which means just repeating systemctl start and systemctl stop is not enough to reproduce this.
I suspect this is some kind of race condition and there needs to be enough dirty caches for that to happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant