Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#3513: Add GroupMarker interface #3792

Merged

Conversation

grobinson-grafana
Copy link
Contributor

@grobinson-grafana grobinson-grafana commented Apr 2, 2024

This commit adds a new GroupMarker interface that marks the status of groups. For example, whether a group is muted because or one or more active or mute time intervals. It renames the existing Marker interface to AlertMarker to avoid confusion.

It is based on #3791.

@grobinson-grafana grobinson-grafana changed the title Add GroupMarker interface #3513: Add GroupMarker interface Apr 3, 2024
types/types.go Outdated
// Muted returns true if the group is muted, otherwise false. If the group
// is muted then it also returns the names of the time intervals that muted
// it.
Muted(groupKey string) ([]string, bool)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original version of GroupMarker I had:

type GroupMarker interface {
	// Muted returns true if the group is muted, otherwise false. If the group
	// is muted then it also returns the names of the time intervals that muted
	// it.
	Muted(groupKey string, fingerprint model.Fingerprint) ([]string, bool)
	...
}

but then I realized we didn't need to store the fingerprint at all.

The reason here is that active and mute timings work against routes. That means either all alerts in an aggregation group are suppressed because of time intervals, or none of them are. It is not possible to have two alerts A and B in the same aggregation group, where A is muted from a mute time interval and B is not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have found an issue where because groupKey is not guaranteed to be unique, it's possible for two (or more) different groups to have the same groupKey. In such cases, where one group is muted and another group is not, both will be marked as muted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following YAML shows a configuration containing two routes with the same groupKey. This happens because the groupKey is calculated using:

  1. The matchers in the route
  2. The group_by labels (default is empty)

And both routes have the same matchers and group_by labels.

receivers:
  - name: test1
  - name: test2
route:
  receiver: test1
  routes:
    - receiver: test1
      matchers:
        - foo=bar
    - receiver: test2
      matchers:
        - foo=bar
      mute_time_intervals:
        - name: weekends

The reason a user might have such a configuration is to mute notifications on the weekends, but still send webhooks to an issue tracker.

@grobinson-grafana grobinson-grafana force-pushed the grobinson/mark-muted-alerts-2 branch 2 times, most recently from 7fb35af to f3b9659 Compare April 11, 2024 14:26
@grobinson-grafana grobinson-grafana marked this pull request as draft April 11, 2024 15:01
@grobinson-grafana grobinson-grafana marked this pull request as ready for review April 24, 2024 11:26
// GroupMarker helps to mark groups as active or muted.
// All methods are goroutine-safe.
//
// TODO(grobinson): routeID is used in Muted and SetMuted because groupKey
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a TODO in the source code to make it clear that this is not how I want the interface to look, but fixing this is blocked due to #3817 that we will fix in another PR.

// All methods are goroutine-safe.
type Marker interface {
type AlertMarker interface {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this interface to AlertMarker so it could be better differentiated from GroupMarker.

func (m *MemMarker) Muted(routeID, groupKey string) ([]string, bool) {
m.mtx.Lock()
defer m.mtx.Unlock()
status, ok := m.groups[routeID+groupKey]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since routeID will be removed in the future I have chosen to just concatenate the two strings together.

This commit adds a new GroupMarker interface that marks the status
of groups. For example, whether an alert is muted because or one
or more active or mute time intervals.

It renames the existing Marker interface to AlertMarker to avoid
confusion.

Signed-off-by: George Robinson <[email protected]>
Signed-off-by: George Robinson <[email protected]>
Signed-off-by: George Robinson <[email protected]>
This commit changes memMarker to MemMarker as it now implements
both the AlertMarker and GroupMarker interfaces. We can return
*memMarker, but it causes lint to fail.

Signed-off-by: George Robinson <[email protected]>
This commit fixes a bug in SetMuted where a marker could not be
removed. The method now works as documented in the interface.

Signed-off-by: George Robinson <[email protected]>
Signed-off-by: George Robinson <[email protected]>
I realized that since active and mute timings are applied to
whole group rather than individual alerts within a group,
we can remove fingerprints from the GroupMarker interface.
This will make the code much simpler and also reduce the amount
of data that needs to be tracked.

Signed-off-by: George Robinson <[email protected]>
Signed-off-by: George Robinson <[email protected]>
Signed-off-by: George Robinson <[email protected]>
// Muted returns true if the group is muted, otherwise false. If the group
// is muted then it also returns the names of the time intervals that muted
// it.
Muted(routeID, groupKey string) ([]string, bool)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added routeID to avoid the issue of non unique group keys from causing groups to be incorrectly marked as muted. See #3817 for more information.

Copy link
Member

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, but please see my comment.

Comment on lines +55 to +56
// groupStatus stores the state of the group, and, as applicable, the names
// of all active and mute time intervals that are muting it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This claims to store the state of the group but its only attribute is mutedBy is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I copied it from here:

// AlertStatus stores the state of an alert and, as applicable, the IDs of
// silences silencing the alert and of other alerts inhibiting the alert.

status = &groupStatus{}
m.groups[routeID+groupKey] = status
}
status.mutedBy = timeIntervalNames
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to aggregate all the time-intervals that this has been muted by?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't – the reason is that the Mutes method we just merged returns them all:

func (i *Intervener) Mutes(names []string, now time.Time) (bool, []string, error) {
var in []string
for _, name := range names {
interval, ok := i.intervals[name]
if !ok {
return false, nil, fmt.Errorf("time interval %s doesn't exist in config", name)
}
for _, ti := range interval {
if ti.ContainsTime(now.UTC()) {
in = append(in, name)
}
}
}
return len(in) > 0, in, nil
}
.

func NewMarker(r prometheus.Registerer) *MemMarker {
m := &MemMarker{
alerts: map[model.Fingerprint]*AlertStatus{},
groups: map[string]*groupStatus{},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this gargage collected? As in, how does the marker now that it no longer needs to store the status of an alert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Copy link
Member

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for clarifying my doubts!

@gotjosh gotjosh merged commit d31a249 into prometheus:main Apr 30, 2024
11 checks passed
TheMeier pushed a commit to TheMeier/alertmanager that referenced this pull request May 3, 2024
* Add GroupMarker interface

This commit adds a new GroupMarker interface that marks the status
of groups. For example, whether an alert is muted because or one
or more active or mute time intervals.

It renames the existing Marker interface to AlertMarker to avoid
confusion.

Signed-off-by: George Robinson <[email protected]>

---------

Signed-off-by: George Robinson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants