-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to resource scale the default GMP rules-evaluator? #971
Comments
Hi @clearclaw, Thanks for reaching out, and apologies you are hitting this. I imagine you are using the built-in rule-evaluator as part of the managed collection stack. This component is hardcoded to have a 1G memory limit, which is what you are likely hitting. Out of curiousity, how many |
Hi @pintohutch ! Yep, GMP and built-in evaluator. Current breaking load is a ~dozen groups each of 10-25 rules. I'm guessing by your question that rules groups are executed in a single transactional context and thus are an indivisible resource unit? I can look at breaking up some of the larger groups, but as this rolls out we'll "naturally" have 10^2 groups with (zipf curve) from mostly ~5 rules to a ~handful of 20-30 rules groups at the upper end. Am I looking at unhappiness? I'm also guessing that the current GMP-default replicaset of 2 evaluators doesn't scale...? |
@pintohutch What are the primary factors in memory consumption in the default rule-evaluator? |
Hey @clearclaw,
Sort of, per the docs:
Essentially, the more distinct
Manually breaking up rule groups does not sound fun :), alternatively... The rule-evaluator could be a good candidate for a VPA. If you're using a GKE standard cluster, you need to ensure it is enabled first. You can adjust the example to suit your needs. Note: we have not extensively tested this ourselves, and would love any feedback you have in using it.
Empirically, we've seen memory usage increase with the number of rule groups, as well as the complexity of the query (i.e. how long the rule-evaluator gRPC client has to hold the connection). Hope that helps. |
Could you possibly paste a smattering of your rules? Long-horizon queries, especially those that go back further than 25 hours in time, can be much slower than those that are within the 25 hour horizon. This slowness could be causing your rule evaluator to wait longer, consuming more resources and causing this issue. |
(progress report) Thanks Daniel. The VPA is great. Memory is consistently riding between 1.4GB and 1.8GB, but it is solid as a rock (and I'm amused that the nanny is not resetting it). Which is cool, as frantic prep for company demo now. Lee, I'll see about getting you a set of sample rules but am distracted by impending demo. Meanwhile, the primary offender appears to have been a ruleset attempting to recast Istio's istio_request_duration_milliseconds into seconds (for use with more tools like Pyrra that insist on a timebase of seconds). Much has changed today, because 24 hours and demo and stuff, but our primary loading is going to be the GMP Rules versions of Pyrra's generated PrometheusRules for SLOs for a hundredish microservices, gateways and related bits. Apologies for the delay. |
Fantastic! Maybe we should have built-in autoscaling supported then - created #975. |
Coolness. In the next ~week I'll be dropping in middle hundreds of Pyrra
SLOs and their supporting recording rules. I'll report back how/if that
flies.
Dunno this is your bag, so just mentioning in case it is as the issue I
raised with GCP submarined:
pyrra-dev/pyrra#1062 Short version is that
upstream Prometheus is generous in what it accepts and GMP/Monarch isn't (I
assume from a protobuf reduction), and that breaks stuff. Current result
is that I'm running an NGINX/Lua proxy to rewrite queries to Prometheus,
and that's, umm, LessGood.
…-- JCL then mutters something about please-pretty-please patch CRDs for
OperatorConfig for GitOps deploys.
On Thu, May 16, 2024 at 7:20 AM Daniel Clark ***@***.***> wrote:
Thanks Daniel. The VPA is great. Memory is consistently riding between
1.4GB and 1.8GB, but it is solid as a rock (and I'm amused that the nanny
is not resetting it). Which is cool, as frantic prep for company demo now.
Fantastic! Maybe we should have built-in autoscaling supported then -
created #975
<#975>.
—
Reply to this email directly, view it on GitHub
<#971 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJYHIPA6V7RB4QFPM52OGLZCS6CXAVCNFSM6AAAAABHVJWHBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJVGM4DCMZZG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks! We can also check what we can do about extra request/query parameter. But validating all parameters might be a better UX on our side. Let us know (ideally in separate issue) if this will become a blocker. Thanks! |
Since I added a few dozen rewrite rules, the GMP-default rules-evaluator is frequently OOMKilling. Editing the deployment to up the resources gets reset to defaults by the nanny and I'm not seeing a ConfigMap/way to override the defaults. What to do?
I'm currently running a few dozen rewrite rules (some admittedly busy) and expect to be in the middle to upper hundreds of rewrite rules (eg for SLOs) as we go to production. Trying to figure how to get this to scale?
The text was updated successfully, but these errors were encountered: