Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

secrets: Improve debuggability & reliability of misconfigured *Monitoring CRs with secrets. #917

Open
bwplotka opened this issue Mar 27, 2024 · 0 comments
Assignees

Comments

@bwplotka
Copy link
Collaborator

(This relates to unreleased feature from #776 PR)

When the secret is configured in e.g. PodMonitoring but not found by the Prometheus we get nice Target Page error:

image

Hopefully this works with Target Status feature too. I think it does not fail the Prometheus config apply, but didn't check.

However, when user forgets to add permissions for the existing, well-referenced secret, the Prometheus scrape config parsing (and reloading) fails, we get cryptic unknown error and status page shows 401 unauthorized.

Full log:

{"caller":"main.go:1326","err":"unable to watch secret default/go-synthetic-basic-auth: unknown (get secrets)","level":"error","msg":"Failed to apply configuration","ts":"2024-03-26T21:24:20.265Z"}
{"caller":"main.go:1043","err":"one or more errors occurred while applying the new configuration (--config.file=\"/prometheus/config_out/config.yaml\")","level":"error","msg":"Error reloading config","ts":"2024-03-26T21:24:20.266Z

Consequences for failing config reloading are not as bad as I initially thought, it's only per reloader per job functionality got stopped in some state, but perhaps there is a way to have consistent status page error instead of failing applying.

I have rdy GKE cluster with your changes applied (will have it running for some time) if you want to check e.g. @TheSpiritXIII

AC

  • Ideally permission error does not fail configuration apply but behave similar to not found secret or not found port etc.
  • Ideally permission error results in more descriptive error log/status than "unknown"
  • Double check target status feature for not found / no permission errors related to secrets

Nice to have:

  • Ideally operator logs (or provides in status or via webhook) the exact RBAC role + binding to apply when missing. This is hard to do a bit on webhook, easy to log on collector though. The latter however is bit deep to find by customers. Putting two small-ish YAMLs through target status might be odd two (maybe fine?). For this case we might want to put it in "analysis/troubleshooting" CLI/functionality we discussed one day..
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants