Introduce global and per-tenant flags to control rule evaluation concurrency #8146

gotjosh · 2024-05-15T15:06:39Z

What this PR does

This change introduces a global boolean flag named -ruler.concurrent-rule-evaluation to control if Mimir runs independent rules concurrently. In addition to this flag, a per tenant configuration option of -ruler.max-concurrent-rule-evaluations is also introduced to control the amount of concurrency we can have per tenant.

By default, the new feature is disabled globally, and it can also be disabled per tenant by using a value of 0 as part of -ruler.max-concurrent-rule-evaluations.

Which issue(s) this PR fixes or relates to

Part of https://github.com/grafana/mimir-squad/issues/2047

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

NB: I couldn't think of a way to test this without incurring in a significant effort to set a test it, but I'm happy to spend the time if we think it's worth it.

…urrency This change introduces a global boolean flag named `-ruler.concurrent-rule-evaluation` to control if Mimir runs independent rules concurrently. In addition to this flag, a per tenant configuration option of `-ruler.max-concurrent-rule-evaluations` is also introduced to control the amount of concurrency we can have per tenant. By default, the new feature is disabled globally and it can also be disabbled per tenant by using a value of `0` as part of `-ruler.max-concurrent-rule-evaluations`. Signed-off-by: gotjosh <[email protected]>

Signed-off-by: gotjosh <[email protected]>

gotjosh · 2024-05-16T08:49:15Z

Going back to draft, as there are two things I need to do:

Put a global limit around concurrency
Mark the flags as experimental

pracucci

Thanks Josh for working on this. This PR does what is it says and I'm not super sure it does what we need. I see two main issues.

First of all, in a multi-tenant Mimir cluster, the concurrency is unbounded because the max concurrency is configurable on a per-tenant basis but there's no per-ruler instance max concurrency. I think this is something we should do to ensure that each ruler instance will not fire an unbounded number of concurrent queries (we still want the ruler to keep spreading queries over time as much as possible).

Second, and more tricky, the queries to run concurrently get selected randomly. What I mean is that given the concurrency is limited, there's no algorithm to decide which query should be executed concurrently and which shouldn't, among all the independent queries (the ones for which is feasible to run concurrently). Our goal is to make to sure we never miss rule group evaluations. We don't care to run concurrently queries for a rule group that evaluated every 1m and all their queries take 10s to run, because we're well below the budget. On the contrary, we want to run concurrently the queries for rule groups that are at risk of missed evaluation. I'm wondering if we can track how long it takes to evaluate each rule group and enable concurrency only for rule groups that take more than 50% of their evaluation period, as a gauge to only do it for rule groups that are at risk of misses.

pracucci · 2024-05-19T07:58:31Z

pkg/ruler/compat.go

@@ -316,6 +317,9 @@ func DefaultTenantManagerFactory(
 // Wrap the queryable with our custom logic.
 wrappedQueryable := WrapQueryableWithReadConsistency(queryable, logger)

+ // Determine if we need to enable concurrent evaluations based on the global flag and per-tenant limits.
+ concurrentEvaluationEnabled := cfg.EnableConcurrentRuleEvaluation && overrides.RulerMaxConcurrentRuleEvaluations(userID) > 0


Do we need EnableConcurrentRuleEvaluation at all? I would just automatically enable it when the max is > 0.

pracucci · 2024-05-19T07:59:10Z

pkg/ruler/ruler.go

@@ -134,6 +134,8 @@ type Config struct {
 // Allow to override timers for testing purposes.
 RingCheckPeriod time.Duration `yaml:"-"`
 rulerSyncQueuePollFrequency time.Duration `yaml:"-"`
+
+ EnableConcurrentRuleEvaluation bool `yaml:"enable_concurrent_rule_evaluation" category:"advanced"`


If we decide too keep this flag (see other comment) then should be marked as experimental and then listed in docs/sources/mimir/configure/about-versioning.md.

pracucci · 2024-05-19T07:59:25Z

pkg/util/validation/limits.go

@@ -181,6 +181,7 @@ type Limits struct {
 RulerRecordingRulesEvaluationEnabled bool `yaml:"ruler_recording_rules_evaluation_enabled" json:"ruler_recording_rules_evaluation_enabled" category:"experimental"`
 RulerAlertingRulesEvaluationEnabled bool `yaml:"ruler_alerting_rules_evaluation_enabled" json:"ruler_alerting_rules_evaluation_enabled" category:"experimental"`
 RulerSyncRulesOnChangesEnabled bool `yaml:"ruler_sync_rules_on_changes_enabled" json:"ruler_sync_rules_on_changes_enabled" category:"advanced"`
+ RulerMaxConcurrentRuleEvaluations int64 `yaml:"ruler_max_concurrent_rule_evaluations" json:"ruler_max_concurrent_rule_evaluations" category:"advanced"`


I would mark as experimental and list it in docs/sources/mimir/configure/about-versioning.md.

In addition to this, new experimental options should be disabled by default until we're confident in the option and default value.

gotjosh requested review from a team and jdbaldry as code owners May 15, 2024 15:06

Update changelog

b656a68

Signed-off-by: gotjosh <[email protected]>

gotjosh marked this pull request as draft May 16, 2024 08:47

pracucci self-requested a review May 19, 2024 07:55

pracucci reviewed May 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce global and per-tenant flags to control rule evaluation concurrency #8146

Introduce global and per-tenant flags to control rule evaluation concurrency #8146

gotjosh commented May 15, 2024

gotjosh commented May 16, 2024

pracucci left a comment

pracucci May 19, 2024

pracucci May 19, 2024

pracucci May 19, 2024

56quarters May 20, 2024

Introduce global and per-tenant flags to control rule evaluation concurrency #8146

Are you sure you want to change the base?

Introduce global and per-tenant flags to control rule evaluation concurrency #8146

Conversation

gotjosh commented May 15, 2024

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

gotjosh commented May 16, 2024

pracucci left a comment

Choose a reason for hiding this comment

pracucci May 19, 2024

Choose a reason for hiding this comment

pracucci May 19, 2024

Choose a reason for hiding this comment

pracucci May 19, 2024

Choose a reason for hiding this comment

56quarters May 20, 2024

Choose a reason for hiding this comment