ServiceMonitor using secrets that are created later #6018

jbnjohnathan · 2023-10-18T11:14:00Z

What did you do?
I am deploying Elastic using the elastic operator.
When deploying the Elastic custom resource the operator creates elastic pods, as well as secrets that contain certificates and the password to the instance.
I want to scrape metrics from the Elastic instance, so I deploy a serviceMonitor to instruct the prometheus operator to do this.
The serviceMonitor needs the username, password and CA certificate in order to connect to the elastic instance.
This information is contained within secrets that is created by the elastic operator.

I am deploying both the elastic custom resource and the ServiceMonitor using helm.
After applying the elastic custom resource it takes a little while for the operator to deploy the elastic pods and create the secrets containing the certificated and password for the instance.
This means that the ServiceMonitor is created before these secrets exists.
When this happens prometheus disables the ServiceMonitor and does not re-try it.

level=warn ts=2023-10-17T11:32:39.147133443Z caller=operator.go:2255 component=prometheusoperator msg="skipping servicemonitor" error="failed to get basic auth username: unable to get secret \"prometheus-elastic-basic-auth-username\": secrets \"prometheus-elastic-basic-auth-username\" not found" servicemonitor=NAMESPACE/elk-prometheus-metrics namespace=CLUSTER-NAMESPACE prometheus=user-workload

A possible solution to this is to deploy the secrets using helm, but empty. The operator will then update the secrets with the correct information later.
I did this, but got another error

ts=2023-10-18T09:53:44.698Z caller=manager.go:216 level=error component="scrape manager" msg="error creating new scrape pool" 
err="error creating HTTP client: unable to load specified CA cert /etc/prometheus/certs/secret_NAMESPACE_-s-http-ca-internal_tls.crt: 
open /etc/prometheus/certs/secret_NAMESPACE-es-http-ca-internal_tls.crt: no such file or directory" scrape_pool=serviceMonitor/NAMESPACE/elk-prometheus-metrics-data/0

Did you expect to see some different?

First I expected prometheus to re-try adding the ServiceMonitor even after the secret did not exist at the first attempt.
This would be the best solution and would require no work-around on my part.

I then have a question. If I add empty secrets instead, will prometheus re-load the info from the secrets when the content changes? It looks like prometheus mounts the secret into itself as a file and reads it. When the secret updates it will not automatically reflect on the prometheus server as it needs to reload the secret. Is there a mechanism for this?
And does this mechanism work, even if it fails to load the certificate the first attempt because it is empty?

Environment

Prometheus Operator version:

operator: 0.60.1
prometheus: 2.39.1

Kubernetes version information:

v1.25.4+a34b9e9

Kubernetes cluster kind:

OpenShift
Manifests:

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: elk-prometheus-metrics-master
  labels:
    app: elk
spec:
  selector:
    matchLabels:
      common.k8s.elastic.co/type: elasticsearch
      elasticsearch.k8s.elastic.co/statefulset-name: {{ .Release.Name }}-es-master
  endpoints:
    - path: "/_prometheus/metrics"
      port: https
      scheme: https
      basicAuth:
        password:
          name: {{ .Release.Name }}-es-elastic-user
          key: elastic
        username:
          name: prometheus-elastic-basic-auth-username
          key: username
      tlsConfig:
        ca: 
          secret: 
            name: {{ .Release.Name }}-es-http-ca-internal
            key: tls.crt
        serverName: {{ .Release.Name }}-es-http.{{ .Release.Namespace }}.es.local

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

nicolastakashi · 2023-10-26T22:10:26Z

Yeah!
IMHO the Operator should fix it on the next reconcile loop, for me this is a bug we need to fix.

nicolastakashi · 2023-10-28T11:20:56Z

Hi @jbnjohnathan
I went through the code and I used the Service Monitor you provided in the issue description, and everything seems to work fine.

I created the Service Monitor before the secrets, and then I created secret by secret and the Prometheus Operator reconciles every secret added.

As you can see here:

prometheus-operator/pkg/prometheus/server/operator.go

Line 900 in 18a2bad

func (c *Operator) handleSecretAdd(obj interface{}) {

Prometheus Operator is watching any secret that is added, updated, or removed from the cluster and then adds the object in that case Prometheus (Server or Agent) to the reconcile queue again.

You can test it by yourself and check the following expression:

sum(rate(prometheus_operator_triggered_total{triggered_by="Secret"}[5m])) by (action)

In case you didn't see any change on the expression above, check if you have the secret-field-selector flag configured on your operator, this might be filtering out some secrets.

I also noticed you're using a version 0.60 and the latest one is the 0.68.
I'd suggest you upgrade the Operator version in case this issue is still happening.

github-actions · 2023-12-28T01:47:50Z

This issue has been automatically marked as stale because it has not had any activity in the last 60 days. Thank you for your contributions.

arghosh93 · 2024-01-17T16:11:07Z

I am facing a similar problem where target is not active for a servicemonitor.

level=warn ts=2024-01-17T15:17:04.607817181Z caller=resource_selector.go:174 component=prometheusoperator msg="skipping servicemonitor" error="failed to get cert: unable to get secret \"ovn-control-plane-metrics-cert\": secrets \"ovn-control-plane-metrics-cert\" not found" servicemonitor=arghosh-arghosh/monitor-ovn-control-plane-metrics namespace=openshift-user-workload-monitoring prometheus=user-workload

The target shown below error before it disappeared:

Get "https://10.129.2.248:9108/metrics": open /etc/prometheus/certs/configmap_arghosh-arghosh_openshift-service-ca.crt_service-ca.crt: no such file or directory

However the secret exists and in prometheus POD I can see above mentioned file which prometheusoperator is complaining about.

[arghosh@arghosh-thinkpadp1gen3 ~]$ oc get secret ovn-control-plane-metrics-cert
NAME                             TYPE                DATA   AGE
ovn-control-plane-metrics-cert   kubernetes.io/tls   2      50m

sh-4.4$ ls /etc/prometheus/certs/configmap_arghosh-arghosh_openshift-service-ca.crt_service-ca.crt
/etc/prometheus/certs/configmap_arghosh-arghosh_openshift-service-ca.crt_service-ca.crt

Below is the servicemonitor definition:

spec:
  endpoints:
  - bearerTokenSecret:
      key: ""
    interval: 30s
    scheme: https
    tlsConfig:
      ca:
        configMap:
          key: service-ca.crt
          name: openshift-service-ca.crt
      cert:
        secret:
          key: tls.crt
          name: ovn-control-plane-metrics-cert
      keySecret:
        key: tls.key
        name: ovn-control-plane-metrics-cert
      serverName: ovn-kubernetes-control-plane.arghosh-arghosh.svc
  jobLabel: app
  namespaceSelector:
    matchNames:
    - arghosh-arghosh
  selector:
    matchLabels:
      app: ovnkube-control-plane

arghosh93 · 2024-01-17T16:13:54Z

sh-4.4$ /usr/bin/operator -version
prometheus-operator, version 0.67.1 (branch: rhaos-4.14-rhel-8, revision: 553f776)
build user: root
build date: 20231009-17:54:20

simonpasquier · 2024-03-13T13:24:27Z

Looks similar to #6309, I commented there for a possible solution: #6309 (comment)

simonpasquier · 2024-05-23T13:10:01Z

The cause of the issue is that the addition or update of a secret/configmap only triggers a reconciliation of Prometheus/PrometheusAgent objects living in the same namespace as the secret. But it doesn't reconcile when the secret/configmap's namespace is different from the Prometheus/PrometheusAgent namespace.

prometheus-operator/pkg/prometheus/server/operator.go

Lines 703 to 712 in ed3aede

 func (c *Operator) handleSecretAdd(obj interface{}) { 

 o, ok := c.accessor.ObjectMetadata(obj) 

 if !ok { 

 return 

 } 

 level.Debug(c.logger).Log("msg", "Secret added") 

 c.metrics.TriggerByCounter("Secret", operator.AddEvent).Inc() 

 c.enqueueForPrometheusNamespace(o.GetNamespace()) 

 }

The fix should be to call both c.enqueueForPrometheusNamespace and c.enqueueForMonitoringNamespace with the secret's namespace. We can avoid duplicate triggers when we know that the controller watches the same namespaces for Prometheus & monitoring objects.

Having said that, we should avoid thundering herd problem: reconcliing on every secret/configmap update could increase the number of operations significantly. The operator should keep an index of all secrets/configmaps being referenced by ServiceMonitors, PodMonitors, ... and only trigger the reconciliation if there's a match.

jbnjohnathan added the kind/support label Oct 18, 2023

nicolastakashi added kind/bug help wanted kind/support and removed kind/support help wanted kind/bug labels Oct 26, 2023

github-actions bot added the stale label Dec 28, 2023

github-actions bot removed the stale label Jan 18, 2024

arghosh93 mentioned this issue Jan 22, 2024

OCPBUGS-25079: Prevent NoRunningOvnControlPlane alert getting fired continuously openshift/cluster-network-operator#2208

Merged

simonpasquier added kind/bug and removed kind/support labels Mar 13, 2024

simonpasquier mentioned this issue Mar 13, 2024

PrometheusAgent not reconciled when Secret/Configmap #6309

Closed

1 task

simonpasquier added the help wanted label Mar 13, 2024

simonpasquier pinned this issue Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ServiceMonitor using secrets that are created later #6018

ServiceMonitor using secrets that are created later #6018

jbnjohnathan commented Oct 18, 2023 •

edited

nicolastakashi commented Oct 26, 2023

nicolastakashi commented Oct 28, 2023

github-actions bot commented Dec 28, 2023

arghosh93 commented Jan 17, 2024 •

edited

arghosh93 commented Jan 17, 2024

simonpasquier commented Mar 13, 2024

simonpasquier commented May 23, 2024

ServiceMonitor using secrets that are created later #6018

ServiceMonitor using secrets that are created later #6018

Comments

jbnjohnathan commented Oct 18, 2023 • edited

nicolastakashi commented Oct 26, 2023

nicolastakashi commented Oct 28, 2023

github-actions bot commented Dec 28, 2023

arghosh93 commented Jan 17, 2024 • edited

arghosh93 commented Jan 17, 2024

simonpasquier commented Mar 13, 2024

simonpasquier commented May 23, 2024

jbnjohnathan commented Oct 18, 2023 •

edited

arghosh93 commented Jan 17, 2024 •

edited