Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux alerts #396

Open
TheKangaroo opened this issue Jan 5, 2024 · 3 comments
Open

flux alerts #396

TheKangaroo opened this issue Jan 5, 2024 · 3 comments

Comments

@TheKangaroo
Copy link

I would like to add some alerts for flux.
As discussed here, my alerts rely on a custom kube-state-metrics config, so I'm not sure if this is something that's helpful for others.
Basically, I rely on the kube-state-metrics config from their monitoring-example repo.

Perhaps it's possible to add a usage description to the alerts?
If so, let me know and I'll send you a PR :)

@jeff-french
Copy link

@TheKangaroo If the custom KSM config means the rules aren't a good fit here, would mind sharing them in a Gist (or anywhere else)? I'm about to write alerts for Flux in the week or so and would love to have a jump start!

@TheKangaroo
Copy link
Author

Sure these are basically our alerts (helm template). We'll improve them over time but for now we started with this.
As I said earlier, they rely on the kube-state-metrics config in the flux monitoring-example repo.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: flux.rules
  labels:
    app.kubernetes.io/part-of: flux
    app.kubernetes.io/component: monitoring
spec:
  groups:
  - name: flux.rules
    rules:
    - alert: FluxKustomizationFailing
      annotations:
        description: Flux Kustomization {{`{{`}} $labels.name {{`}}`}} in namespace {{`{{`}} $labels.namespace {{`}}`}} failed.
        runbook_url: {{ .Values.defaultRules.runbookUrl }}/{{ .Template.Name }}
        summary: Errors while reconcile Flux Kustomization(s)
      expr: gotk_resource_info{customresource_kind=~"Kustomization",
        ready="False"}
      for: 5m
      labels:
        severity: warning
{{- if .Values.defaultRules.additionalRuleLabels }}
{{ toYaml .Values.defaultRules.additionalRuleLabels | indent 8 }}
{{- end }}
    - alert: FluxHelmReleaseFailing
      annotations:
        description: Flux HelmRelease {{`{{`}} $labels.name {{`}}`}} in namespace {{`{{`}} $labels.namespace {{`}}`}} failed.
        runbook_url: {{ .Values.defaultRules.runbookUrl }}/{{ .Template.Name }}
        summary: Errors while reconcile Flux HelmRelease(s)
      expr: gotk_resource_info{customresource_kind=~"HelmRelease",
        ready="False"}
      for: 5m
      labels:
        severity: warning
{{- if .Values.defaultRules.additionalRuleLabels }}
{{ toYaml .Values.defaultRules.additionalRuleLabels | indent 8 }}
{{- end }}
    - alert: FluxSourceFailing
      annotations:
        description: Flux Source {{`{{`}} $labels.name {{`}}`}} in namespace {{`{{`}} $labels.namespace {{`}}`}} failed.
        runbook_url: {{ .Values.defaultRules.runbookUrl }}/{{ .Template.Name }}
        summary: Errors while reconcile Flux Source(s)
      expr: gotk_resource_info{customresource_kind=~"GitRepository|HelmRepository|Bucket|OCIRepository",
        ready="False"}
      for: 5m
      labels:
        severity: warning
{{- if .Values.defaultRules.additionalRuleLabels }}
{{ toYaml .Values.defaultRules.additionalRuleLabels | indent 8 }}
{{- end }}
    - alert: FluxResourceSuspended
      annotations:
        description: Flux Resource {{`{{`}} $labels.name {{`}}`}} in namespace {{`{{`}} $labels.namespace {{`}}`}} suspended.
        runbook_url: {{ .Values.defaultRules.runbookUrl }}/{{ .Template.Name }}
        summary: Flux Resource(s) are suspended for an extended period of time.
      expr: gotk_resource_info{suspended="true"}
      for: 2h
      labels:
        severity: none
{{- if .Values.defaultRules.additionalRuleLabels }}
{{ toYaml .Values.defaultRules.additionalRuleLabels | indent 8 }}
{{- end }}

We send alerts for failing resources like GitRepo, Kustomization, HelmCharts and HelmReleases.
We added the "suspended" alert with a timeout of 2h in case someone is troubleshooting something and forgets to resume a flux resource after that.

@jeff-french
Copy link

Awesome! Thanks for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants