Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to resource scale the default GMP rules-evaluator? #971

Closed
clearclaw opened this issue May 14, 2024 · 9 comments
Closed

How to resource scale the default GMP rules-evaluator? #971

clearclaw opened this issue May 14, 2024 · 9 comments
Assignees

Comments

@clearclaw
Copy link

clearclaw commented May 14, 2024

Since I added a few dozen rewrite rules, the GMP-default rules-evaluator is frequently OOMKilling. Editing the deployment to up the resources gets reset to defaults by the nanny and I'm not seeing a ConfigMap/way to override the defaults. What to do?

I'm currently running a few dozen rewrite rules (some admittedly busy) and expect to be in the middle to upper hundreds of rewrite rules (eg for SLOs) as we go to production. Trying to figure how to get this to scale?

image

@pintohutch
Copy link
Collaborator

Hi @clearclaw,

Thanks for reaching out, and apologies you are hitting this.

I imagine you are using the built-in rule-evaluator as part of the managed collection stack.

This component is hardcoded to have a 1G memory limit, which is what you are likely hitting.

Out of curiousity, how many Rules resources are you using? And are your rule queries typically consolidated into a few groups? Or spread among many groups?

@clearclaw
Copy link
Author

clearclaw commented May 14, 2024

Hi @pintohutch !

Yep, GMP and built-in evaluator. Current breaking load is a ~dozen groups each of 10-25 rules.

I'm guessing by your question that rules groups are executed in a single transactional context and thus are an indivisible resource unit? I can look at breaking up some of the larger groups, but as this rolls out we'll "naturally" have 10^2 groups with (zipf curve) from mostly ~5 rules to a ~handful of 20-30 rules groups at the upper end. Am I looking at unhappiness?

I'm also guessing that the current GMP-default replicaset of 2 evaluators doesn't scale...?

@clearclaw
Copy link
Author

@pintohutch What are the primary factors in memory consumption in the default rule-evaluator?

@pintohutch
Copy link
Collaborator

Hey @clearclaw,

I'm guessing by your question that rules groups are executed in a single transactional context and thus are an indivisible resource unit?

Sort of, per the docs:

Rules within a group are run sequentially at a regular interval, with the same evaluation time.

Essentially, the more distinct rule_groups you have, the more parallel executions there are. If you have a high evaluation interval (e.g. 5m+) you may be able to group more of your rules together so you have less concurrent evaluations, and presumably a smaller resource hit.

I can look at breaking up some of the larger groups, but as this rolls out we'll "naturally" have 10^2 groups with (zipf curve) from mostly ~5 rules to a ~handful of 20-30 rules groups at the upper end. Am I looking at unhappiness?

Manually breaking up rule groups does not sound fun :), alternatively...

The rule-evaluator could be a good candidate for a VPA. If you're using a GKE standard cluster, you need to ensure it is enabled first. You can adjust the example to suit your needs. Note: we have not extensively tested this ourselves, and would love any feedback you have in using it.

What are the primary factors in memory consumption in the default rule-evaluator?

Empirically, we've seen memory usage increase with the number of rule groups, as well as the complexity of the query (i.e. how long the rule-evaluator gRPC client has to hold the connection).

Hope that helps.

@lyanco
Copy link
Collaborator

lyanco commented May 14, 2024

Could you possibly paste a smattering of your rules? Long-horizon queries, especially those that go back further than 25 hours in time, can be much slower than those that are within the 25 hour horizon. This slowness could be causing your rule evaluator to wait longer, consuming more resources and causing this issue.

@pintohutch pintohutch assigned bwplotka and unassigned pintohutch May 15, 2024
@clearclaw
Copy link
Author

(progress report)

Thanks Daniel. The VPA is great. Memory is consistently riding between 1.4GB and 1.8GB, but it is solid as a rock (and I'm amused that the nanny is not resetting it). Which is cool, as frantic prep for company demo now.

Lee, I'll see about getting you a set of sample rules but am distracted by impending demo. Meanwhile, the primary offender appears to have been a ruleset attempting to recast Istio's istio_request_duration_milliseconds into seconds (for use with more tools like Pyrra that insist on a timebase of seconds). Much has changed today, because 24 hours and demo and stuff, but our primary loading is going to be the GMP Rules versions of Pyrra's generated PrometheusRules for SLOs for a hundredish microservices, gateways and related bits. Apologies for the delay.

@pintohutch
Copy link
Collaborator

Thanks Daniel. The VPA is great. Memory is consistently riding between 1.4GB and 1.8GB, but it is solid as a rock (and I'm amused that the nanny is not resetting it). Which is cool, as frantic prep for company demo now.

Fantastic! Maybe we should have built-in autoscaling supported then - created #975.

@clearclaw
Copy link
Author

clearclaw commented May 17, 2024 via email

@bernot-dev bernot-dev assigned bernot-dev and bwplotka and unassigned bwplotka and bernot-dev May 17, 2024
@bwplotka
Copy link
Collaborator

Thanks! We can also check what we can do about extra request/query parameter. But validating all parameters might be a better UX on our side. Let us know (ideally in separate issue) if this will become a blocker. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants