-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce global and per-tenant flags to control rule evaluation concurrency #8146
base: main
Are you sure you want to change the base?
Conversation
…urrency This change introduces a global boolean flag named `-ruler.concurrent-rule-evaluation` to control if Mimir runs independent rules concurrently. In addition to this flag, a per tenant configuration option of `-ruler.max-concurrent-rule-evaluations` is also introduced to control the amount of concurrency we can have per tenant. By default, the new feature is disabled globally and it can also be disabbled per tenant by using a value of `0` as part of `-ruler.max-concurrent-rule-evaluations`. Signed-off-by: gotjosh <[email protected]>
Signed-off-by: gotjosh <[email protected]>
Going back to draft, as there are two things I need to do:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Josh for working on this. This PR does what is it says and I'm not super sure it does what we need. I see two main issues.
First of all, in a multi-tenant Mimir cluster, the concurrency is unbounded because the max concurrency is configurable on a per-tenant basis but there's no per-ruler instance max concurrency. I think this is something we should do to ensure that each ruler instance will not fire an unbounded number of concurrent queries (we still want the ruler to keep spreading queries over time as much as possible).
Second, and more tricky, the queries to run concurrently get selected randomly. What I mean is that given the concurrency is limited, there's no algorithm to decide which query should be executed concurrently and which shouldn't, among all the independent queries (the ones for which is feasible to run concurrently). Our goal is to make to sure we never miss rule group evaluations. We don't care to run concurrently queries for a rule group that evaluated every 1m and all their queries take 10s to run, because we're well below the budget. On the contrary, we want to run concurrently the queries for rule groups that are at risk of missed evaluation. I'm wondering if we can track how long it takes to evaluate each rule group and enable concurrency only for rule groups that take more than 50% of their evaluation period, as a gauge to only do it for rule groups that are at risk of misses.
@@ -316,6 +317,9 @@ func DefaultTenantManagerFactory( | |||
// Wrap the queryable with our custom logic. | |||
wrappedQueryable := WrapQueryableWithReadConsistency(queryable, logger) | |||
|
|||
// Determine if we need to enable concurrent evaluations based on the global flag and per-tenant limits. | |||
concurrentEvaluationEnabled := cfg.EnableConcurrentRuleEvaluation && overrides.RulerMaxConcurrentRuleEvaluations(userID) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need EnableConcurrentRuleEvaluation
at all? I would just automatically enable it when the max is > 0.
@@ -134,6 +134,8 @@ type Config struct { | |||
// Allow to override timers for testing purposes. | |||
RingCheckPeriod time.Duration `yaml:"-"` | |||
rulerSyncQueuePollFrequency time.Duration `yaml:"-"` | |||
|
|||
EnableConcurrentRuleEvaluation bool `yaml:"enable_concurrent_rule_evaluation" category:"advanced"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide too keep this flag (see other comment) then should be marked as experimental and then listed in docs/sources/mimir/configure/about-versioning.md
.
@@ -181,6 +181,7 @@ type Limits struct { | |||
RulerRecordingRulesEvaluationEnabled bool `yaml:"ruler_recording_rules_evaluation_enabled" json:"ruler_recording_rules_evaluation_enabled" category:"experimental"` | |||
RulerAlertingRulesEvaluationEnabled bool `yaml:"ruler_alerting_rules_evaluation_enabled" json:"ruler_alerting_rules_evaluation_enabled" category:"experimental"` | |||
RulerSyncRulesOnChangesEnabled bool `yaml:"ruler_sync_rules_on_changes_enabled" json:"ruler_sync_rules_on_changes_enabled" category:"advanced"` | |||
RulerMaxConcurrentRuleEvaluations int64 `yaml:"ruler_max_concurrent_rule_evaluations" json:"ruler_max_concurrent_rule_evaluations" category:"advanced"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would mark as experimental and list it in docs/sources/mimir/configure/about-versioning.md
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to this, new experimental options should be disabled by default until we're confident in the option and default value.
What this PR does
This change introduces a global boolean flag named
-ruler.concurrent-rule-evaluation
to control if Mimir runs independent rules concurrently. In addition to this flag, a per tenant configuration option of-ruler.max-concurrent-rule-evaluations
is also introduced to control the amount of concurrency we can have per tenant.By default, the new feature is disabled globally, and it can also be disabled per tenant by using a value of
0
as part of-ruler.max-concurrent-rule-evaluations
.Which issue(s) this PR fixes or relates to
Part of https://github.com/grafana/mimir-squad/issues/2047
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.NB: I couldn't think of a way to test this without incurring in a significant effort to set a test it, but I'm happy to spend the time if we think it's worth it.