Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServiceMonitor using secrets that are created later #6018

Open
jbnjohnathan opened this issue Oct 18, 2023 · 7 comments
Open

ServiceMonitor using secrets that are created later #6018

jbnjohnathan opened this issue Oct 18, 2023 · 7 comments

Comments

@jbnjohnathan
Copy link

jbnjohnathan commented Oct 18, 2023

What did you do?
I am deploying Elastic using the elastic operator.
When deploying the Elastic custom resource the operator creates elastic pods, as well as secrets that contain certificates and the password to the instance.
I want to scrape metrics from the Elastic instance, so I deploy a serviceMonitor to instruct the prometheus operator to do this.
The serviceMonitor needs the username, password and CA certificate in order to connect to the elastic instance.
This information is contained within secrets that is created by the elastic operator.

I am deploying both the elastic custom resource and the ServiceMonitor using helm.
After applying the elastic custom resource it takes a little while for the operator to deploy the elastic pods and create the secrets containing the certificated and password for the instance.
This means that the ServiceMonitor is created before these secrets exists.
When this happens prometheus disables the ServiceMonitor and does not re-try it.

level=warn ts=2023-10-17T11:32:39.147133443Z caller=operator.go:2255 component=prometheusoperator msg="skipping servicemonitor" error="failed to get basic auth username: unable to get secret \"prometheus-elastic-basic-auth-username\": secrets \"prometheus-elastic-basic-auth-username\" not found" servicemonitor=NAMESPACE/elk-prometheus-metrics namespace=CLUSTER-NAMESPACE prometheus=user-workload

A possible solution to this is to deploy the secrets using helm, but empty. The operator will then update the secrets with the correct information later.
I did this, but got another error

ts=2023-10-18T09:53:44.698Z caller=manager.go:216 level=error component="scrape manager" msg="error creating new scrape pool" 
err="error creating HTTP client: unable to load specified CA cert /etc/prometheus/certs/secret_NAMESPACE_-s-http-ca-internal_tls.crt: 
open /etc/prometheus/certs/secret_NAMESPACE-es-http-ca-internal_tls.crt: no such file or directory" scrape_pool=serviceMonitor/NAMESPACE/elk-prometheus-metrics-data/0

Did you expect to see some different?

First I expected prometheus to re-try adding the ServiceMonitor even after the secret did not exist at the first attempt.
This would be the best solution and would require no work-around on my part.

I then have a question. If I add empty secrets instead, will prometheus re-load the info from the secrets when the content changes? It looks like prometheus mounts the secret into itself as a file and reads it. When the secret updates it will not automatically reflect on the prometheus server as it needs to reload the secret. Is there a mechanism for this?
And does this mechanism work, even if it fails to load the certificate the first attempt because it is empty?

Environment

  • Prometheus Operator version:

operator: 0.60.1
prometheus: 2.39.1

  • Kubernetes version information:

v1.25.4+a34b9e9

  • Kubernetes cluster kind:

    OpenShift

  • Manifests:

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: elk-prometheus-metrics-master
  labels:
    app: elk
spec:
  selector:
    matchLabels:
      common.k8s.elastic.co/type: elasticsearch
      elasticsearch.k8s.elastic.co/statefulset-name: {{ .Release.Name }}-es-master
  endpoints:
    - path: "/_prometheus/metrics"
      port: https
      scheme: https
      basicAuth:
        password:
          name: {{ .Release.Name }}-es-elastic-user
          key: elastic
        username:
          name: prometheus-elastic-basic-auth-username
          key: username
      tlsConfig:
        ca: 
          secret: 
            name: {{ .Release.Name }}-es-http-ca-internal
            key: tls.crt
        serverName: {{ .Release.Name }}-es-http.{{ .Release.Namespace }}.es.local

Anything else we need to know?:

@nicolastakashi
Copy link
Contributor

Yeah!
IMHO the Operator should fix it on the next reconcile loop, for me this is a bug we need to fix.

@nicolastakashi
Copy link
Contributor

Hi @jbnjohnathan
I went through the code and I used the Service Monitor you provided in the issue description, and everything seems to work fine.

I created the Service Monitor before the secrets, and then I created secret by secret and the Prometheus Operator reconciles every secret added.

As you can see here:

func (c *Operator) handleSecretAdd(obj interface{}) {

Prometheus Operator is watching any secret that is added, updated, or removed from the cluster and then adds the object in that case Prometheus (Server or Agent) to the reconcile queue again.

You can test it by yourself and check the following expression:

sum(rate(prometheus_operator_triggered_total{triggered_by="Secret"}[5m])) by (action)

In case you didn't see any change on the expression above, check if you have the secret-field-selector flag configured on your operator, this might be filtering out some secrets.

I also noticed you're using a version 0.60 and the latest one is the 0.68.
I'd suggest you upgrade the Operator version in case this issue is still happening.

Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity in the last 60 days. Thank you for your contributions.

@github-actions github-actions bot added the stale label Dec 28, 2023
@arghosh93
Copy link

arghosh93 commented Jan 17, 2024

I am facing a similar problem where target is not active for a servicemonitor.

level=warn ts=2024-01-17T15:17:04.607817181Z caller=resource_selector.go:174 component=prometheusoperator msg="skipping servicemonitor" error="failed to get cert: unable to get secret \"ovn-control-plane-metrics-cert\": secrets \"ovn-control-plane-metrics-cert\" not found" servicemonitor=arghosh-arghosh/monitor-ovn-control-plane-metrics namespace=openshift-user-workload-monitoring prometheus=user-workload

The target shown below error before it disappeared:

Get "https://10.129.2.248:9108/metrics": open /etc/prometheus/certs/configmap_arghosh-arghosh_openshift-service-ca.crt_service-ca.crt: no such file or directory

However the secret exists and in prometheus POD I can see above mentioned file which prometheusoperator is complaining about.

[arghosh@arghosh-thinkpadp1gen3 ~]$ oc get secret ovn-control-plane-metrics-cert
NAME                             TYPE                DATA   AGE
ovn-control-plane-metrics-cert   kubernetes.io/tls   2      50m

sh-4.4$ ls /etc/prometheus/certs/configmap_arghosh-arghosh_openshift-service-ca.crt_service-ca.crt
/etc/prometheus/certs/configmap_arghosh-arghosh_openshift-service-ca.crt_service-ca.crt

Below is the servicemonitor definition:

spec:
  endpoints:
  - bearerTokenSecret:
      key: ""
    interval: 30s
    scheme: https
    tlsConfig:
      ca:
        configMap:
          key: service-ca.crt
          name: openshift-service-ca.crt
      cert:
        secret:
          key: tls.crt
          name: ovn-control-plane-metrics-cert
      keySecret:
        key: tls.key
        name: ovn-control-plane-metrics-cert
      serverName: ovn-kubernetes-control-plane.arghosh-arghosh.svc
  jobLabel: app
  namespaceSelector:
    matchNames:
    - arghosh-arghosh
  selector:
    matchLabels:
      app: ovnkube-control-plane

@arghosh93
Copy link

sh-4.4$ /usr/bin/operator -version
prometheus-operator, version 0.67.1 (branch: rhaos-4.14-rhel-8, revision: 553f776)
build user: root
build date: 20231009-17:54:20

@simonpasquier
Copy link
Contributor

Looks similar to #6309, I commented there for a possible solution: #6309 (comment)

@simonpasquier simonpasquier pinned this issue Mar 19, 2024
@simonpasquier
Copy link
Contributor

The cause of the issue is that the addition or update of a secret/configmap only triggers a reconciliation of Prometheus/PrometheusAgent objects living in the same namespace as the secret. But it doesn't reconcile when the secret/configmap's namespace is different from the Prometheus/PrometheusAgent namespace.

func (c *Operator) handleSecretAdd(obj interface{}) {
o, ok := c.accessor.ObjectMetadata(obj)
if !ok {
return
}
level.Debug(c.logger).Log("msg", "Secret added")
c.metrics.TriggerByCounter("Secret", operator.AddEvent).Inc()
c.enqueueForPrometheusNamespace(o.GetNamespace())
}

The fix should be to call both c.enqueueForPrometheusNamespace and c.enqueueForMonitoringNamespace with the secret's namespace. We can avoid duplicate triggers when we know that the controller watches the same namespaces for Prometheus & monitoring objects.

Having said that, we should avoid thundering herd problem: reconcliing on every secret/configmap update could increase the number of operations significantly. The operator should keep an index of all secrets/configmaps being referenced by ServiceMonitors, PodMonitors, ... and only trigger the reconciliation if there's a match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants