Timeout ticker has incorrect behavior when two timeouts placed in quick succession #3091

ValarDragon · 2024-05-17T10:10:01Z

Feature Request

Summary

I found that we can get negative timeout duration at least in tests, and subsequently incorrect behavior while working on: #3008 . I have made a PR that can reproduce this failing test (at least locally)

Suppose we have code like (We have effectively this in create empty blocks tests)

                startTime := time.Now()
		negTimeout := timeoutInfo{Duration: -1 * time.Millisecond, Height: height, Round: 0, Step: types.RoundStepNewHeight}
		timeout := timeoutInfo{Duration: 5 * time.Millisecond, Height: height, Round: 0, Step: types.RoundStepNewRound}
		ticker.ScheduleTimeout(negTimeout)
		ticker.ScheduleTimeout(timeout)

		// Wait for the timeout to be received
		to := <-c
                elapsedTime := time.Now().Sub(startTime)

The intended behavior is oneof:

to := negTimeout, elapsedTime ~= 0
to := timeout, elapsedTime ~= 5ms

Unfortunately we also get the behavior

to := timeout, elapsedTime ~= 0

Which is our bug. This should be causing test flakyness. unfortunately I don't know enough about our timeout usages to know if this poses any safety problems.

The text was updated successfully, but these errors were encountered:

ValarDragon · 2024-05-17T12:41:28Z

After understanding the race condition more, this will occur whenever we schedule two timeouts at ~the same time. We can get timeout 1 timer delay, but with timeout 2's value. From skimming the test I wrote, we are definitely at risk of this if the timeouts are scheduled within 100000ns of each other. but the true value for risk starting is likely higher.

… scheduled at ~approximately the same time (#3092) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

… scheduled at ~approximately the same time (#3092) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 153281a)

… scheduled at ~approximately the same time (#3092) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 153281a) # Conflicts: # .changelog/v0.38.3/bug-fixes/3092-consensus-timeout-ticker-data-race.md # consensus/ticker.go # consensus/ticker_test.go

… scheduled at ~approximately the same time (#3092) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 153281a) # Conflicts: # consensus/ticker.go # consensus/ticker_test.go

… scheduled at ~approximately the same time (backport #3092) (#3105) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com). Co-authored-by: Dev Ojha <[email protected]>

… scheduled at ~approximately the same time (backport #3092) (#3106) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <[email protected]> Co-authored-by: Sergio Mena <[email protected]>

… scheduled at ~approximately the same time (backport #3092) (#3107) #3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <[email protected]> Co-authored-by: Sergio Mena <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) cometbft#3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <[email protected]> Co-authored-by: Sergio Mena <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) cometbft#3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <[email protected]> Co-authored-by: Sergio Mena <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit a77f195)

… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) (#68) cometbft#3091 The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus. Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it. In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case. The test I created does fail on main. --- #### PR checklist - [X] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3092 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <[email protected]> Co-authored-by: Sergio Mena <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit a77f195) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

ValarDragon added enhancement New feature or request needs-triage This issue/PR has not yet been triaged by the team. labels May 17, 2024

This was referenced May 17, 2024

fix: TimeoutTicker returns wrong value/timeout pair when timeouts are scheduled at ~approximately the same time #3092

Merged

perf: Make mempool update async from block.Commit #3008

Merged

melekes added bug Something isn't working consensus and removed enhancement New feature or request needs-triage This issue/PR has not yet been triaged by the team. labels May 18, 2024

melekes added this to the 2024-Q2 milestone May 18, 2024

ValarDragon changed the title ~~Timeout ticker has incorrect behavior in a race condition with negative timeout durations~~ Timeout ticker has incorrect behavior when two timeouts placed in quick succession May 22, 2024

ValarDragon closed this as completed May 22, 2024

PaddyMc mentioned this issue May 23, 2024

fix: TimeoutTicker returns wrong value/timeout pair when timeouts are… osmosis-labs/cometbft#67

Merged

7 tasks

mergify bot mentioned this issue May 23, 2024

fix: TimeoutTicker returns wrong value/timeout pair when timeouts are… (backport #67) osmosis-labs/cometbft#68

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout ticker has incorrect behavior when two timeouts placed in quick succession #3091

Timeout ticker has incorrect behavior when two timeouts placed in quick succession #3091

ValarDragon commented May 17, 2024

ValarDragon commented May 17, 2024 •

edited

Timeout ticker has incorrect behavior when two timeouts placed in quick succession #3091

Timeout ticker has incorrect behavior when two timeouts placed in quick succession #3091

Comments

ValarDragon commented May 17, 2024

Feature Request

Summary

ValarDragon commented May 17, 2024 • edited

ValarDragon commented May 17, 2024 •

edited