-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout ticker has incorrect behavior when two timeouts placed in quick succession #3091
Comments
ValarDragon
added
enhancement
New feature or request
needs-triage
This issue/PR has not yet been triaged by the team.
labels
May 17, 2024
This was referenced May 17, 2024
After understanding the race condition more, this will occur whenever we schedule two timeouts at ~the same time. We can get timeout 1 timer delay, but with timeout 2's value. From skimming the test I wrote, we are definitely at risk of this if the timeouts are scheduled within 100000ns of each other. but the true value for risk starting is likely higher. |
melekes
added
bug
Something isn't working
consensus
and removed
enhancement
New feature or request
needs-triage
This issue/PR has not yet been triaged by the team.
labels
May 18, 2024
ValarDragon
changed the title
Timeout ticker has incorrect behavior in a race condition with negative timeout durations
Timeout ticker has incorrect behavior when two timeouts placed in quick succession
May 22, 2024
github-merge-queue bot
pushed a commit
that referenced
this issue
May 22, 2024
… scheduled at ~approximately the same time (#3092) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
mergify bot
pushed a commit
that referenced
this issue
May 22, 2024
… scheduled at ~approximately the same time (#3092) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 153281a)
mergify bot
pushed a commit
that referenced
this issue
May 22, 2024
… scheduled at ~approximately the same time (#3092) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 153281a) # Conflicts: # .changelog/v0.38.3/bug-fixes/3092-consensus-timeout-ticker-data-race.md # consensus/ticker.go # consensus/ticker_test.go
mergify bot
pushed a commit
that referenced
this issue
May 22, 2024
… scheduled at ~approximately the same time (#3092) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 153281a) # Conflicts: # consensus/ticker.go # consensus/ticker_test.go
This was referenced May 22, 2024
sergio-mena
pushed a commit
that referenced
this issue
May 22, 2024
… scheduled at ~approximately the same time (backport #3092) (#3105) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com). Co-authored-by: Dev Ojha <[email protected]>
sergio-mena
added a commit
that referenced
this issue
May 22, 2024
… scheduled at ~approximately the same time (backport #3092) (#3106) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <[email protected]> Co-authored-by: Sergio Mena <[email protected]>
sergio-mena
added a commit
that referenced
this issue
May 22, 2024
… scheduled at ~approximately the same time (backport #3092) (#3107) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <[email protected]> Co-authored-by: Sergio Mena <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
PaddyMc
pushed a commit
to osmosis-labs/cometbft
that referenced
this issue
May 23, 2024
… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) cometbft#3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <[email protected]> Co-authored-by: Sergio Mena <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Merged
7 tasks
mergify bot
added a commit
to osmosis-labs/cometbft
that referenced
this issue
May 23, 2024
… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) cometbft#3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <[email protected]> Co-authored-by: Sergio Mena <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit a77f195)
7 tasks
PaddyMc
pushed a commit
to osmosis-labs/cometbft
that referenced
this issue
May 23, 2024
… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) (#68) cometbft#3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <[email protected]> Co-authored-by: Sergio Mena <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit a77f195) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Feature Request
Summary
I found that we can get negative timeout duration at least in tests, and subsequently incorrect behavior while working on: #3008 . I have made a PR that can reproduce this failing test (at least locally)
Suppose we have code like (We have effectively this in create empty blocks tests)
The intended behavior is oneof:
Unfortunately we also get the behavior
Which is our bug. This should be causing test flakyness. unfortunately I don't know enough about our timeout usages to know if this poses any safety problems.
The text was updated successfully, but these errors were encountered: