-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework seed alerts to consider all possible conditions #9750
Rework seed alerts to consider all possible conditions #9750
Conversation
Skipping CI for Draft Pull Request. |
@istvanballok I have the feeling a significant number of these alerts get auto-resolved after 5 minutes. Maybe we could discuss increasing the for clause to 15 minutes. |
912852b
to
015111f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm with some minor comments for consideration
pkg/component/observability/monitoring/prometheus/garden/testdata/seed.prometheusrule.test.yaml
Outdated
Show resolved
Hide resolved
pkg/component/observability/monitoring/prometheus/garden/testdata/seed.prometheusrule.test.yaml
Show resolved
Hide resolved
- Topology `shoot` to refer a seed is incorrect. - Change topology from `garden` to `seed` for those alerts checking seed conditions.
We do not want to suppress alerts on seeds that are managed by the Gardener team, even if the seeds are not compliant or the errors are user errors.
Without it, flapping conditions would probably go undetected because the alerts would be reset each time. For example, when the seed is reconciling its artifacts, some of its managed resources can enter the progressing state for a brief period, and to avoid alerts from being restarted, the `max_over_time` function is now introduced to smooth out this behaviour. This is combined with the subquery syntax [1] to query over the last minutes only. This change is also reducing the time before firing an alert in the `for` entry from 30m to 10m for some alerts to alert more aggressively if any of those conditions fail. An exception is the alert `SeedAPIServerUnavailable`, which was already firing after 2m, but now changed to 3m to preserve the same behaviour after the change introducing `max_over_time` to smooth out possible flappings. [1] https://prometheus.io/docs/prometheus/latest/querying/basics/#subquery
The usage of a `<aggregate>_over_time` function is only required to check if there is data within the queried time window that triggers the alert. If the query return data, then the value is always true. Otherwise, the query returns nothing. Therefore, it is meaningless to calculate the maximum over a set of "trues". Instead, using `last_over_time` is a better approach to check if the query returned data.
`mute_on_weekend` should be used for shoot alerts that are customer issues and for which the Gardener team can't really help. However, the Gardener team has full control of the seeds so alerts on those should always be reacted upon: if seed are having issues, then shoots can't be correctly managed.
Add the list of active alerts as in other alerts by `garden_shoot_condition`
- Use common text for all alerts regarding seed conditions - Fix description for the APIServerAvailable alert by adding the list of active alerts as in other alerts by `garden_shoot_condition`
The different condition states are mapped to numbers by the `gardener-metrics-exporter`: ``` 2: Progressing 1: True 0: False -1: Unknown ```
This commit and the next discard most of previous work. The fundamental change is that all seed condition are grouped into one alert only, instead of having a specific alert for condition. For the sake of change diff, this change is split in two commits: this one removes previous alerts, and the next adds the new single alert.
015111f
to
2c7a256
Compare
This and previous commit discard most of previous work. The fundamental change is that all seed condition are now grouped into one alert only, instead of having a specific alert for condition. For the sake of change diff, this change is split in two commits: the previous one removed previous alerts, and this adds the new single alert.
This unit test shows that, if two different seeds, have failing conditions then two different alerts pop up as well.
Mute this alert on weekends as we gain experience on how noisy it will be on live and canary. The goal, though, is that we eventually unmute it.
After testing this change on dev, we found alerting becomes a bit too noisy so we choose to increase the for clause up to 10 minutes again.
For better alert formatting, alert descriptions are not supposed to contain an empty line at the very end.
2c7a256
to
96877ac
Compare
- Unroll the time series in the tests so it's simpler. - Add comment on how conditions states are mapped into numbers.
96877ac
to
ae4870d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
LGTM label has been added. Git tree hash: a290f4d5d604d4a44ef067496d1d23643dc906b7
|
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: istvanballok, rfranzke The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
How to categorize this PR?
/area monitoring
/kind enhancement
What this PR does / why we need it:
The list of current alerts for the seeds is incomplete. Furthermore, we detected some of the current alerts have misconfigurations that prevent them from working correctly. For example, some of them do not deal properly with flapping, or some others are set the
shoot
topology, which are not sent to the right receiver in the Alertmanager configuration.This PR proposes grouping all seed conditions in one single alert. The benefits are twofold. First, the list of alerts is now complete. Second, having a single alert simplifies alert maintenance and reduces the risk of misconfigurations due to alert duplication.
This PR also takes into account that seeds can be shoots as well and therefore queries both metrics
garden_seed_condition
andgarden_shoot_condition
. The seed conditions also show up in the shoot resource when the seed is a shoot but this does not pose a problem. On the contrary, by querying these two condition metrics, both managed and unmanaged seeds (e.g., soils) are tackled by the same alert.The new alert is muted on the weekends for now. We first want to get an idea of how it behaves on canary and live. We might choose to silent it in the Alertmanager directly if it becomes too noisy but the ultimate goal is to unmute it in a follow-up PR.
Finally, please note we attempted to have individual alerts until commit
0a8a328
, when we iterated once again over the work in progress and decided to group all alerts. In consequence, this commit starts over with a clean slate. Nonetheless, we preserved previous commit history.Special notes for your reviewer:
/cc @istvanballok @rickardsjp
Release note: