Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout ticker has incorrect behavior when two timeouts placed in quick succession #3091

Closed
ValarDragon opened this issue May 17, 2024 · 1 comment
Labels
bug Something isn't working consensus
Milestone

Comments

@ValarDragon
Copy link
Collaborator

Feature Request

Summary

I found that we can get negative timeout duration at least in tests, and subsequently incorrect behavior while working on: #3008 . I have made a PR that can reproduce this failing test (at least locally)

Suppose we have code like (We have effectively this in create empty blocks tests)

                startTime := time.Now()
		negTimeout := timeoutInfo{Duration: -1 * time.Millisecond, Height: height, Round: 0, Step: types.RoundStepNewHeight}
		timeout := timeoutInfo{Duration: 5 * time.Millisecond, Height: height, Round: 0, Step: types.RoundStepNewRound}
		ticker.ScheduleTimeout(negTimeout)
		ticker.ScheduleTimeout(timeout)

		// Wait for the timeout to be received
		to := <-c
                elapsedTime := time.Now().Sub(startTime)

The intended behavior is oneof:

  • to := negTimeout, elapsedTime ~= 0
  • to := timeout, elapsedTime ~= 5ms

Unfortunately we also get the behavior

  • to := timeout, elapsedTime ~= 0

Which is our bug. This should be causing test flakyness. unfortunately I don't know enough about our timeout usages to know if this poses any safety problems.

@ValarDragon
Copy link
Collaborator Author

ValarDragon commented May 17, 2024

After understanding the race condition more, this will occur whenever we schedule two timeouts at ~the same time. We can get timeout 1 timer delay, but with timeout 2's value. From skimming the test I wrote, we are definitely at risk of this if the timeouts are scheduled within 100000ns of each other. but the true value for risk starting is likely higher.

@melekes melekes added bug Something isn't working consensus and removed enhancement New feature or request needs-triage This issue/PR has not yet been triaged by the team. labels May 18, 2024
@melekes melekes added this to the 2024-Q2 milestone May 18, 2024
@ValarDragon ValarDragon changed the title Timeout ticker has incorrect behavior in a race condition with negative timeout durations Timeout ticker has incorrect behavior when two timeouts placed in quick succession May 22, 2024
github-merge-queue bot pushed a commit that referenced this issue May 22, 2024
… scheduled at ~approximately the same time (#3092)

#3091 

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.


---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
mergify bot pushed a commit that referenced this issue May 22, 2024
… scheduled at ~approximately the same time (#3092)

#3091

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit 153281a)
mergify bot pushed a commit that referenced this issue May 22, 2024
… scheduled at ~approximately the same time (#3092)

#3091

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit 153281a)

# Conflicts:
#	.changelog/v0.38.3/bug-fixes/3092-consensus-timeout-ticker-data-race.md
#	consensus/ticker.go
#	consensus/ticker_test.go
mergify bot pushed a commit that referenced this issue May 22, 2024
… scheduled at ~approximately the same time (#3092)

#3091

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit 153281a)

# Conflicts:
#	consensus/ticker.go
#	consensus/ticker_test.go
sergio-mena pushed a commit that referenced this issue May 22, 2024
… scheduled at ~approximately the same time (backport #3092) (#3105)

#3091 

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.


---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #3092 done by
[Mergify](https://mergify.com).

Co-authored-by: Dev Ojha <[email protected]>
sergio-mena added a commit that referenced this issue May 22, 2024
… scheduled at ~approximately the same time (backport #3092) (#3106)

#3091 

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.


---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #3092 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <[email protected]>
Co-authored-by: Sergio Mena <[email protected]>
sergio-mena added a commit that referenced this issue May 22, 2024
… scheduled at ~approximately the same time (backport #3092) (#3107)

#3091 

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.


---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #3092 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <[email protected]>
Co-authored-by: Sergio Mena <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
PaddyMc pushed a commit to osmosis-labs/cometbft that referenced this issue May 23, 2024
… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107)

cometbft#3091 

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.


---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#3092 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <[email protected]>
Co-authored-by: Sergio Mena <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
mergify bot added a commit to osmosis-labs/cometbft that referenced this issue May 23, 2024
… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107)

cometbft#3091

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#3092 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <[email protected]>
Co-authored-by: Sergio Mena <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
(cherry picked from commit a77f195)
PaddyMc pushed a commit to osmosis-labs/cometbft that referenced this issue May 23, 2024
… scheduled at ~approximately the same time (backport cometbft#3092) (cometbft#3107) (#68)

cometbft#3091

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#3092 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <[email protected]>
Co-authored-by: Sergio Mena <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
(cherry picked from commit a77f195)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working consensus
Projects
Status: Done
Development

No branches or pull requests

2 participants