Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce global and per-tenant flags to control rule evaluation concurrency #8146

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

gotjosh
Copy link
Contributor

@gotjosh gotjosh commented May 15, 2024

What this PR does

This change introduces a global boolean flag named -ruler.concurrent-rule-evaluation to control if Mimir runs independent rules concurrently. In addition to this flag, a per tenant configuration option of -ruler.max-concurrent-rule-evaluations is also introduced to control the amount of concurrency we can have per tenant.

By default, the new feature is disabled globally, and it can also be disabled per tenant by using a value of 0 as part of -ruler.max-concurrent-rule-evaluations.

Which issue(s) this PR fixes or relates to

Part of https://github.com/grafana/mimir-squad/issues/2047

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

NB: I couldn't think of a way to test this without incurring in a significant effort to set a test it, but I'm happy to spend the time if we think it's worth it.

…urrency

This change introduces a global boolean flag named `-ruler.concurrent-rule-evaluation` to control if Mimir runs independent rules concurrently.
In addition to this flag, a per tenant configuration option of `-ruler.max-concurrent-rule-evaluations` is also introduced to control the amount of concurrency we can have per tenant.

By default, the new feature is disabled globally and it can also be disabbled per tenant by using a value of `0` as part of `-ruler.max-concurrent-rule-evaluations`.

Signed-off-by: gotjosh <[email protected]>
@gotjosh gotjosh requested review from a team and jdbaldry as code owners May 15, 2024 15:06
Signed-off-by: gotjosh <[email protected]>
@gotjosh gotjosh marked this pull request as draft May 16, 2024 08:47
@gotjosh
Copy link
Contributor Author

gotjosh commented May 16, 2024

Going back to draft, as there are two things I need to do:

  1. Put a global limit around concurrency
  2. Mark the flags as experimental

@pracucci pracucci self-requested a review May 19, 2024 07:55
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Josh for working on this. This PR does what is it says and I'm not super sure it does what we need. I see two main issues.

First of all, in a multi-tenant Mimir cluster, the concurrency is unbounded because the max concurrency is configurable on a per-tenant basis but there's no per-ruler instance max concurrency. I think this is something we should do to ensure that each ruler instance will not fire an unbounded number of concurrent queries (we still want the ruler to keep spreading queries over time as much as possible).

Second, and more tricky, the queries to run concurrently get selected randomly. What I mean is that given the concurrency is limited, there's no algorithm to decide which query should be executed concurrently and which shouldn't, among all the independent queries (the ones for which is feasible to run concurrently). Our goal is to make to sure we never miss rule group evaluations. We don't care to run concurrently queries for a rule group that evaluated every 1m and all their queries take 10s to run, because we're well below the budget. On the contrary, we want to run concurrently the queries for rule groups that are at risk of missed evaluation. I'm wondering if we can track how long it takes to evaluate each rule group and enable concurrency only for rule groups that take more than 50% of their evaluation period, as a gauge to only do it for rule groups that are at risk of misses.

@@ -316,6 +317,9 @@ func DefaultTenantManagerFactory(
// Wrap the queryable with our custom logic.
wrappedQueryable := WrapQueryableWithReadConsistency(queryable, logger)

// Determine if we need to enable concurrent evaluations based on the global flag and per-tenant limits.
concurrentEvaluationEnabled := cfg.EnableConcurrentRuleEvaluation && overrides.RulerMaxConcurrentRuleEvaluations(userID) > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need EnableConcurrentRuleEvaluation  at all? I would just automatically enable it when the max is > 0.

@@ -134,6 +134,8 @@ type Config struct {
// Allow to override timers for testing purposes.
RingCheckPeriod time.Duration `yaml:"-"`
rulerSyncQueuePollFrequency time.Duration `yaml:"-"`

EnableConcurrentRuleEvaluation bool `yaml:"enable_concurrent_rule_evaluation" category:"advanced"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide too keep this flag (see other comment) then should be marked as experimental and then listed in docs/sources/mimir/configure/about-versioning.md.

@@ -181,6 +181,7 @@ type Limits struct {
RulerRecordingRulesEvaluationEnabled bool `yaml:"ruler_recording_rules_evaluation_enabled" json:"ruler_recording_rules_evaluation_enabled" category:"experimental"`
RulerAlertingRulesEvaluationEnabled bool `yaml:"ruler_alerting_rules_evaluation_enabled" json:"ruler_alerting_rules_evaluation_enabled" category:"experimental"`
RulerSyncRulesOnChangesEnabled bool `yaml:"ruler_sync_rules_on_changes_enabled" json:"ruler_sync_rules_on_changes_enabled" category:"advanced"`
RulerMaxConcurrentRuleEvaluations int64 `yaml:"ruler_max_concurrent_rule_evaluations" json:"ruler_max_concurrent_rule_evaluations" category:"advanced"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would mark as experimental and list it in docs/sources/mimir/configure/about-versioning.md.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to this, new experimental options should be disabled by default until we're confident in the option and default value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants