-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Alerts for the compaction job metrics #603
Comments
We received many alerts in gardener AC canary landscape which says "Pod 60f2a9-compact-job-cb9tv is not ready for more than 30 minutes". Is it related to this issue? |
@DelinaDeng It shouldn't be related to this issue. The issue you mentioned seems due to pods not running the container at all due to unavailability of resources or scheduling problem. We can check farther if you give us the cluster details |
As per an out of band discussion with @shreyas-s-rao , we decided to consider |
After more discussion, it was decided that raising an alert for every compaction job's failure would cause a very large number of alerts to be raised, since the job could fail for a multitude of reasons. To get a more holistic understanding of the health of the shoots in the seed, alerts are to be raised at a seed level when more than X% (for example, 10%) of the compaction jobs deployed in the seed fail. |
Feature (What you would like to be added):
We are exposing metrics for compaction job in ETCD druid by #569 . Now we need to whitelist the metrics in g/g and raise correct alerts based on the metrics. We have decided to raise alerts when
job_duration_seconds
andjobs_total
with failed labels cross certain thresholds.Motivation (Why is this needed?):
Approach/Hint to the implement solution (optional):
Threshold for
jobs_total
with failed label in a seed : 10% of aggregatedjobs_total
(This alert would be raised per seed)Threshold for
jobs_total
with failed label in a shoot: 10% ofjobs_total
(This alert would be raised per shoot)To raise the alert at seed level, Aggregate prometheus can be used. The metrics
jobs_total
with failed label is scraped by cache prometheus from etcd-druid. Aggregate prometheus can aggregate metricsjobs_total
from cache prometheus. Seed level alerts can be raised on this aggregated metrics from aggregated prometheus.To raise the alert at shoot level, alerts can be raised in control plane prometheus. Control plane prometheus already federate shoot specific
jobs_total
from cache prometheus. So, to raise the alert forjobs_total
at shoot level, we need to add an alert forjobs_total
in hereAnother idea is that we aggregate the alert data that is already raised on shoot control plane prometheus. Alerts from the shoot control plane prometheus are passed to aggregate prometheus in garden namespace. These are just alert data. We can aggregate these alert data streaming from multiple shoots to aggregate prometheus and send alert for
jobs_total
at seed level.The text was updated successfully, but these errors were encountered: