Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework seed alerts to consider all possible conditions #9750

Merged
merged 16 commits into from
May 27, 2024

Conversation

vicwicker
Copy link
Contributor

@vicwicker vicwicker commented May 14, 2024

How to categorize this PR?

/area monitoring
/kind enhancement

What this PR does / why we need it:

The list of current alerts for the seeds is incomplete. Furthermore, we detected some of the current alerts have misconfigurations that prevent them from working correctly. For example, some of them do not deal properly with flapping, or some others are set the shoot topology, which are not sent to the right receiver in the Alertmanager configuration.

This PR proposes grouping all seed conditions in one single alert. The benefits are twofold. First, the list of alerts is now complete. Second, having a single alert simplifies alert maintenance and reduces the risk of misconfigurations due to alert duplication.

This PR also takes into account that seeds can be shoots as well and therefore queries both metrics garden_seed_condition and garden_shoot_condition. The seed conditions also show up in the shoot resource when the seed is a shoot but this does not pose a problem. On the contrary, by querying these two condition metrics, both managed and unmanaged seeds (e.g., soils) are tackled by the same alert.

The new alert is muted on the weekends for now. We first want to get an idea of how it behaves on canary and live. We might choose to silent it in the Alertmanager directly if it becomes too noisy but the ultimate goal is to unmute it in a follow-up PR.

Finally, please note we attempted to have individual alerts until commit 0a8a328, when we iterated once again over the work in progress and decided to group all alerts. In consequence, this commit starts over with a clean slate. Nonetheless, we preserved previous commit history.

Special notes for your reviewer:

/cc @istvanballok @rickardsjp

Release note:

Introduce a unified single alert for all seed conditions. Previous seed alerts `GardenletDown`, `GardenletUnknown`,  `SeedAPIServerUnavailable`, `SeedControlPlaneUnhealthy` and `SeedSystemComponentsUnhealthy` are removed.

Copy link
Contributor

gardener-prow bot commented May 14, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@gardener-prow gardener-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 14, 2024
@vicwicker
Copy link
Contributor Author

@istvanballok I have the feeling a significant number of these alerts get auto-resolved after 5 minutes. Maybe we could discuss increasing the for clause to 15 minutes.

@vicwicker vicwicker force-pushed the add-alert-prometheus-seed branch 2 times, most recently from 912852b to 015111f Compare May 24, 2024 13:01
@vicwicker vicwicker marked this pull request as ready for review May 24, 2024 13:01
@gardener-prow gardener-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 24, 2024
Copy link
Contributor

@istvanballok istvanballok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm with some minor comments for consideration

- Topology `shoot` to refer a seed is incorrect.
- Change topology from `garden` to `seed` for
  those alerts checking seed conditions.
We do not want to suppress alerts on seeds that are managed by the
Gardener team, even if the seeds are not compliant or the errors are
user errors.
Without it, flapping conditions would probably go undetected because
the alerts would be reset each time.

For example, when the seed is reconciling its artifacts, some of its
managed resources can enter the progressing state for a brief period,
and to avoid alerts from being restarted, the `max_over_time` function
is now introduced to smooth out this behaviour. This is combined with
the subquery syntax [1] to query over the last minutes only.

This change is also reducing the time before firing an alert in the `for`
entry from 30m to 10m for some alerts to alert more aggressively if any of
those conditions fail. An exception is the alert `SeedAPIServerUnavailable`,
which was already firing after 2m, but now changed to 3m to preserve the
same behaviour after the change introducing `max_over_time` to smooth
out possible flappings.

[1] https://prometheus.io/docs/prometheus/latest/querying/basics/#subquery
The usage of a `<aggregate>_over_time` function is only required to
check if there is data within the queried time window that triggers
the alert. If the query return data, then the value is always true.
Otherwise, the query returns nothing.

Therefore, it is meaningless to calculate the maximum over a set
of "trues". Instead, using `last_over_time` is a better approach
to check if the query returned data.
`mute_on_weekend` should be used for shoot alerts that are
customer issues and for which the Gardener team can't really
help. However, the Gardener team has full control of the
seeds so alerts on those should always be reacted upon: if
seed are having issues, then shoots can't be correctly managed.
Add the list of active alerts as in other alerts by `garden_shoot_condition`
- Use common text for all alerts regarding seed conditions
- Fix description for the APIServerAvailable alert by adding
  the list of active alerts as in other alerts by `garden_shoot_condition`
The different condition states are mapped to numbers by the
`gardener-metrics-exporter`:

```
2: Progressing
1: True
0: False
-1: Unknown
```
This commit and the next discard most of previous work. The fundamental
change is that all seed condition are grouped into one alert only,
instead of having a specific alert for condition.

For the sake of change diff, this change is split in two commits: this
one removes previous alerts, and the next adds the new single alert.
This and previous commit discard most of previous work. The fundamental
change is that all seed condition are now grouped into one alert only,
instead of having a specific alert for condition.

For the sake of change diff, this change is split in two commits: the
previous one removed previous alerts, and this adds the new single alert.
This unit test shows that, if two different seeds, have failing
conditions then two different alerts pop up as well.
Mute this alert on weekends as we gain experience on how noisy
it will be on live and canary. The goal, though, is that we
eventually unmute it.
After testing this change on dev, we found alerting becomes
a bit too noisy so we choose to increase the for clause up
to 10 minutes again.
For better alert formatting, alert descriptions are not supposed to
contain an empty line at the very end.
- Unroll the time series in the tests so it's simpler.
- Add comment on how conditions states are mapped into numbers.
Copy link
Contributor

@istvanballok istvanballok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label May 27, 2024
Copy link
Contributor

gardener-prow bot commented May 27, 2024

LGTM label has been added.

Git tree hash: a290f4d5d604d4a44ef067496d1d23643dc906b7

@rfranzke
Copy link
Member

/approve

Copy link
Contributor

gardener-prow bot commented May 27, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: istvanballok, rfranzke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2024
@gardener-prow gardener-prow bot merged commit b48c204 into gardener:master May 27, 2024
18 checks passed
@vicwicker vicwicker deleted the add-alert-prometheus-seed branch May 27, 2024 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/monitoring Monitoring (including availability monitoring and alerting) related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants