Rework seed alerts to consider all possible conditions #9750

vicwicker · 2024-05-14T15:32:51Z

How to categorize this PR?

/area monitoring
/kind enhancement

What this PR does / why we need it:

The list of current alerts for the seeds is incomplete. Furthermore, we detected some of the current alerts have misconfigurations that prevent them from working correctly. For example, some of them do not deal properly with flapping, or some others are set the shoot topology, which are not sent to the right receiver in the Alertmanager configuration.

This PR proposes grouping all seed conditions in one single alert. The benefits are twofold. First, the list of alerts is now complete. Second, having a single alert simplifies alert maintenance and reduces the risk of misconfigurations due to alert duplication.

This PR also takes into account that seeds can be shoots as well and therefore queries both metrics garden_seed_condition and garden_shoot_condition. The seed conditions also show up in the shoot resource when the seed is a shoot but this does not pose a problem. On the contrary, by querying these two condition metrics, both managed and unmanaged seeds (e.g., soils) are tackled by the same alert.

The new alert is muted on the weekends for now. We first want to get an idea of how it behaves on canary and live. We might choose to silent it in the Alertmanager directly if it becomes too noisy but the ultimate goal is to unmute it in a follow-up PR.

Finally, please note we attempted to have individual alerts until commit 0a8a328, when we iterated once again over the work in progress and decided to group all alerts. In consequence, this commit starts over with a clean slate. Nonetheless, we preserved previous commit history.

Special notes for your reviewer:

/cc @istvanballok @rickardsjp

Release note:

Introduce a unified single alert for all seed conditions. Previous seed alerts `GardenletDown`, `GardenletUnknown`,  `SeedAPIServerUnavailable`, `SeedControlPlaneUnhealthy` and `SeedSystemComponentsUnhealthy` are removed.

gardener-prow · 2024-05-14T15:32:55Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

vicwicker · 2024-05-21T08:29:24Z

@istvanballok I have the feeling a significant number of these alerts get auto-resolved after 5 minutes. Maybe we could discuss increasing the for clause to 15 minutes.

istvanballok

/lgtm with some minor comments for consideration

pkg/component/observability/monitoring/prometheus/garden/testdata/seed.prometheusrule.test.yaml

- Topology `shoot` to refer a seed is incorrect. - Change topology from `garden` to `seed` for those alerts checking seed conditions.

We do not want to suppress alerts on seeds that are managed by the Gardener team, even if the seeds are not compliant or the errors are user errors.

Without it, flapping conditions would probably go undetected because the alerts would be reset each time. For example, when the seed is reconciling its artifacts, some of its managed resources can enter the progressing state for a brief period, and to avoid alerts from being restarted, the `max_over_time` function is now introduced to smooth out this behaviour. This is combined with the subquery syntax [1] to query over the last minutes only. This change is also reducing the time before firing an alert in the `for` entry from 30m to 10m for some alerts to alert more aggressively if any of those conditions fail. An exception is the alert `SeedAPIServerUnavailable`, which was already firing after 2m, but now changed to 3m to preserve the same behaviour after the change introducing `max_over_time` to smooth out possible flappings. [1] https://prometheus.io/docs/prometheus/latest/querying/basics/#subquery

The usage of a `<aggregate>_over_time` function is only required to check if there is data within the queried time window that triggers the alert. If the query return data, then the value is always true. Otherwise, the query returns nothing. Therefore, it is meaningless to calculate the maximum over a set of "trues". Instead, using `last_over_time` is a better approach to check if the query returned data.

`mute_on_weekend` should be used for shoot alerts that are customer issues and for which the Gardener team can't really help. However, the Gardener team has full control of the seeds so alerts on those should always be reacted upon: if seed are having issues, then shoots can't be correctly managed.

Add the list of active alerts as in other alerts by `garden_shoot_condition`

- Use common text for all alerts regarding seed conditions - Fix description for the APIServerAvailable alert by adding the list of active alerts as in other alerts by `garden_shoot_condition`

The different condition states are mapped to numbers by the `gardener-metrics-exporter`: ``` 2: Progressing 1: True 0: False -1: Unknown ```

This commit and the next discard most of previous work. The fundamental change is that all seed condition are grouped into one alert only, instead of having a specific alert for condition. For the sake of change diff, this change is split in two commits: this one removes previous alerts, and the next adds the new single alert.

This and previous commit discard most of previous work. The fundamental change is that all seed condition are now grouped into one alert only, instead of having a specific alert for condition. For the sake of change diff, this change is split in two commits: the previous one removed previous alerts, and this adds the new single alert.

This unit test shows that, if two different seeds, have failing conditions then two different alerts pop up as well.

Mute this alert on weekends as we gain experience on how noisy it will be on live and canary. The goal, though, is that we eventually unmute it.

After testing this change on dev, we found alerting becomes a bit too noisy so we choose to increase the for clause up to 10 minutes again.

For better alert formatting, alert descriptions are not supposed to contain an empty line at the very end.

- Unroll the time series in the tests so it's simpler. - Add comment on how conditions states are mapped into numbers.

istvanballok

/lgtm

gardener-prow · 2024-05-27T09:35:27Z

LGTM label has been added.

Git tree hash: a290f4d5d604d4a44ef067496d1d23643dc906b7

rfranzke · 2024-05-27T10:19:01Z

/approve

gardener-prow · 2024-05-27T10:19:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: istvanballok, rfranzke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rfranzke]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gardener-prow bot requested review from istvanballok and rickardsjp May 14, 2024 15:32

vicwicker force-pushed the add-alert-prometheus-seed branch 2 times, most recently from 912852b to 015111f Compare May 24, 2024 13:01

vicwicker marked this pull request as ready for review May 24, 2024 13:01

gardener-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 24, 2024

istvanballok reviewed May 24, 2024

View reviewed changes

pkg/component/observability/monitoring/prometheus/garden/testdata/seed.prometheusrule.test.yaml Outdated Show resolved Hide resolved

pkg/component/observability/monitoring/prometheus/garden/testdata/seed.prometheusrule.test.yaml Show resolved Hide resolved

vicwicker added 10 commits May 27, 2024 08:58

Revisit topology for current alerts

36a102c

- Topology `shoot` to refer a seed is incorrect. - Change topology from `garden` to `seed` for those alerts checking seed conditions.

Remove unnecessary labels is_compliant and has_user_errors

18810e0

We do not want to suppress alerts on seeds that are managed by the Gardener team, even if the seeds are not compliant or the errors are user errors.

Fix description for the APIServerAvailable alert

5d92fd8

Add the list of active alerts as in other alerts by `garden_shoot_condition`

Add new alerts to cover all possible seed conditions

ac7a1bf

Homogenize summary and description for seed condition alerts

8439acc

- Use common text for all alerts regarding seed conditions - Fix description for the APIServerAvailable alert by adding the list of active alerts as in other alerts by `garden_shoot_condition`

Add promtool tests for new seed condition alerts

08f99a4

The different condition states are mapped to numbers by the `gardener-metrics-exporter`: ``` 2: Progressing 1: True 0: False -1: Unknown ```

vicwicker force-pushed the add-alert-prometheus-seed branch from 015111f to 2c7a256 Compare May 27, 2024 08:39

vicwicker added 5 commits May 27, 2024 10:42

Add unit test for multiple shoots and alerts

16d4e22

This unit test shows that, if two different seeds, have failing conditions then two different alerts pop up as well.

Mute new alert on seed conditions

58fbca6

Mute this alert on weekends as we gain experience on how noisy it will be on live and canary. The goal, though, is that we eventually unmute it.

Alert after 10 minutes instead of 6 minutes

a200ec5

After testing this change on dev, we found alerting becomes a bit too noisy so we choose to increase the for clause up to 10 minutes again.

Remove empty line at the end of the alert description

62df34b

For better alert formatting, alert descriptions are not supposed to contain an empty line at the very end.

vicwicker force-pushed the add-alert-prometheus-seed branch from 2c7a256 to 96877ac Compare May 27, 2024 08:42

[review] Address review comments

ae4870d

- Unroll the time series in the tests so it's simpler. - Add comment on how conditions states are mapped into numbers.

vicwicker force-pushed the add-alert-prometheus-seed branch from 96877ac to ae4870d Compare May 27, 2024 09:30

istvanballok approved these changes May 27, 2024

View reviewed changes

gardener-prow bot assigned istvanballok May 27, 2024

gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label May 27, 2024

gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2024

gardener-prow bot merged commit b48c204 into gardener:master May 27, 2024
18 checks passed

vicwicker deleted the add-alert-prometheus-seed branch May 27, 2024 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework seed alerts to consider all possible conditions #9750

Rework seed alerts to consider all possible conditions #9750

vicwicker commented May 14, 2024 •

edited

gardener-prow bot commented May 14, 2024

vicwicker commented May 21, 2024

istvanballok left a comment

istvanballok left a comment

gardener-prow bot commented May 27, 2024

rfranzke commented May 27, 2024

gardener-prow bot commented May 27, 2024

Rework seed alerts to consider all possible conditions #9750

Rework seed alerts to consider all possible conditions #9750

Conversation

vicwicker commented May 14, 2024 • edited

gardener-prow bot commented May 14, 2024

vicwicker commented May 21, 2024

istvanballok left a comment

Choose a reason for hiding this comment

istvanballok left a comment

Choose a reason for hiding this comment

gardener-prow bot commented May 27, 2024

rfranzke commented May 27, 2024

gardener-prow bot commented May 27, 2024

vicwicker commented May 14, 2024 •

edited