Add unstable on_thread_park_id() to runtime Builder (for stuck task watchdog) #6370
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Fixes #6353. I'm looking for a way to detect when a task is not yielding back to tokio properly. Currently such a task can cause other random tasks to not get run. We had an incident where a buggy task went into a busy loop and put a service into a zombie state: it still was responding to most requests, but some background tasks were not running as expected.
If there is an existing way to detect a stuck task, I'd be delighted to be enlightened.
Solution
Add a new method
on_thread_park_id()
that includes the worker id being parked. This allows a watchdog process to determine which workers are not parked and also not polling (available already fromRuntimeMetrics::worker_poll_count()
). The watchdog code would look something like this:Rather than printing, a real watchdog implementation could call
std::process::abort()
if the task stays stuck past some duration (e.g. 10 seconds).