Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto certificate renewal with restartOnTLSSecretUpdate and cert-manager fails #390

Open
nosvalds opened this issue Jan 11, 2022 · 8 comments · May be fixed by #461
Open

auto certificate renewal with restartOnTLSSecretUpdate and cert-manager fails #390

nosvalds opened this issue Jan 11, 2022 · 8 comments · May be fixed by #461
Assignees
Labels
cloud security TLS or Auth for Solr

Comments

@nosvalds
Copy link
Contributor

Asked in https://kubernetes.slack.com/archives/CQSNS615F/p1639651250121300, recommended to bring here.

I’m seeing an issue where my Solr cloud (3 replicas) doesn’t recover when cert-manager renews the certificate. It seems to be tied to my updateStrategy.
What I’m seeing:

  • Certificate is renewed by cert-manager at 60days (expiry is 90days - default for LetsEncrypt), restartOnTLSSecretUpdate setting starts to restart the cluster
solrTLS:  
    restartOnTLSSecretUpdate: true
    pkcs12Secret:
      name: solr-tls
      key: keystore.p12
    keyStorePasswordSecret:
      name: pkcs12-keystore-password
      key: password-key
  • Below settings ensure only 1 pod restarts at a time so we still have 2 active pods to serve requests
  updateStrategy:
    managed:
      maxPodsUnavailable: 1
      maxShardReplicasUnavailable: 1

After the 1 Pod restarts, its collections are shown as having a “Down” status, in the logs I see

2021-12-16 10:15:14.746 ERROR (recoveryExecutor-11-thread-3-processing-n:pod-1.default:443_solr x:transaction_10_shard1_replica_n5 c:transaction_10 s:shard1 r:core_node6) [c:transaction_10 s:shard1 r:core_node6 x:transaction_10_shard1_replica_n5] o.a.s.c.RecoveryStrategy Failed to connect leader https://pod-2.default:443/solr on recovery, try again
  • The cluster is stuck at this state, as the updateStrategy settings won’t allow the other 2 pods to restart since the 1 pod’s collections are “Down”
    The error seems to indicate that the pod that has restarted can’t communicate with the pods that haven’t restarted yet, presumably because the certificates.

  • A (bad) “workaround” I found is that if it set the below, then all 3 pods will restart and come up heathy. But this obviously has the downside of a short downtime.

updateStrategy:
    managed:
      maxPodsUnavailable: 3
      maxShardReplicasUnavailable: 3

I don’t think the active certs would have been expired, as Lets Encrypt certs have 90day duration and they renewed at 60days (which is the default for cert-manager of 2/3 of the duration). https://cert-manager.io/docs/usage/certificate/#renewal & https://cert-manager.io/docs/faq/#if-renewbefore-or-duration-is-not-defined-what-will-be-the-default-value

I wonder if the act of cert-manager renewing the certificates invalidates the active one though? I can’t find this specifically in the docs. This would be a problem then. It also would explain why I was still seeing this problem when triggering a manual renewal using the cert-manager cmctl https://cert-manager.io/docs/usage/cmctl/#renew cli. If this is the case, we would need restartOnTLSSecretUpdate to be able to ignore the updateStrategy. managed. max*Unavailable settings.

Sort of related issue: cert-manager/cert-manager#1168, but Solr has restartOnTLSSecretUpdate for our pods.

@thelabdude thelabdude self-assigned this Jan 11, 2022
@thelabdude
Copy link
Contributor

@nosvalds Is there a separate truststore.p12 file in the TLS secret? Maybe the problem here is the keystore.p12 is being used as the truststore and since it's a new cert, it's not trusting the leader's cert? This feels like a truststore thing to me, so can you share the env vars that were generated for a solr pod kubectl describe po <pod> ... I just need the SSL related ones.

@nosvalds
Copy link
Contributor Author

Here it is for 1 pod. They look the same across the 3 pods. This is in a non-broken state. If you need it in the broken state I'll need to setup a test cluster.

❯ k describe po iati-prod-solrcloud-1 | grep SSL
      SOLR_SSL_ENABLED:                       true
      SOLR_SSL_WANT_CLIENT_AUTH:              false
      SOLR_SSL_NEED_CLIENT_AUTH:              false
      SOLR_SSL_CHECK_PEER_NAME:               false
      SOLR_SSL_CLIENT_HOSTNAME_VERIFICATION:  false
      SOLR_SSL_KEY_STORE:                     /var/solr/tls/keystore.p12
      SOLR_SSL_KEY_STORE_PASSWORD:            <set to the key 'password-key' in secret 'pkcs12-keystore-password'>  Optional: false
      SOLR_SSL_TRUST_STORE:                   /var/solr/tls/keystore.p12
      SOLR_SSL_TRUST_STORE_PASSWORD:          <set to the key 'password-key' in secret 'pkcs12-keystore-password'>  Optional: false

@nosvalds
Copy link
Contributor Author

FYI if you want to see my manifests they are here: https://github.com/IATI/solr-k8s, main branch.

@thelabdude
Copy link
Contributor

thanks Nik ... Solr is getting configured with the keystore as the truststore, which is likely the cause of the problem here (since once the new cert is loaded, it's not trusted by the other nodes).

So there is a bug here in that the Solr operator should not set the truststore to the keystore and instead just have the JVM fallback to using cacerts. The source of this bug is that you want to use the keystore as truststore when using self-signed certs and I incorrectly applied that logic in all cases :-( I think an easy work-around for you for now is to just create a generic secret in your cluster(s) containing the Let's Encrypt Root CA for your truststore (pkcs12 format).

You can download the CA pem files from here: https://letsencrypt.org/certificates/

Here's a script that does what you need, change the password ... it also imports all the CA certs that come with your JVM, so you'll need to point to a JAVA_HOME for your env before running this script:

#!/bin/bash

# Pick a truststore password
TRUST_STORE=truststore.p12
TRUST_STORE_PASSWORD=test1234

# Create a PKCS12 file containing all the CA Certs already trusted your JVM
JAVA_HOME=TODO

keytool -importkeystore -srckeystore $JAVA_HOME/jre/lib/security/cacerts -srcstorepass changeit \
  -destkeystore $TRUST_STORE -deststoretype pkcs12 -deststorepass $TRUST_STORE_PASSWORD

# Download the Let's Encrypt PEM file
wget https://letsencrypt.org/certs/lets-encrypt-r3.pem

# Import the Let's Encrypt Intermediate CA Cert into the truststore
keytool -import -trustcacerts -alias letsencrypt -file lets-encrypt-r3.pem \
  -keystore $TRUST_STORE -storepass $TRUST_STORE_PASSWORD

# Verify the truststore is readable and the letsencrypt exists
keytool -list -keystore $TRUST_STORE -storepass $TRUST_STORE_PASSWORD -storetype pkcs12 | grep letsencrypt

# Create a k8s secret containing the truststore.p12 file and password literal
kubectl create secret generic lets-encrypt-truststore-secret \
  --from-file=truststore.p12=$TRUST_STORE \
  --from-literal=password-key=$TRUST_STORE_PASSWORD

Once the lets-encrypt-truststore-secret exists, update your SolrCloud CRD definition to point to that secret

spec:
  solrTLS:
    ...
    trustStorePasswordSecret:
      name: lets-encrypt-truststore-secret
      key: password-key
    trustStoreSecret:
      name: lets-encrypt-truststore-secret
      key: truststore.p12

sorry for the trouble here! Let me know if this work-around works for you for now and I'll get a fix into the next version of the operator that doesn't set the truststore to the keystore.

@nosvalds
Copy link
Contributor Author

Thanks @thelabdude !

I think this worked. Some modifications to the script to ensure I pulled the cacerts file from one of my Solr pods.

#!/bin/bash

# Pick a truststore password
TRUST_STORE=truststore.p12
TRUST_STORE_PASSWORD=test123

# get JAVA_HOME from pod
kubectl exec -it <pod-name> -- bash
echo $JAVA_HOME
exit

JAVA_HOME=TODO # from above echo

# copy cacerts from pod to local machine
kubectl cp pod-name:$JAVA_HOME/lib/security/cacerts ./cacerts

# Create a PKCS12 file containing all the CA Certs already trusted your JVM
keytool -importkeystore -srckeystore cacerts -srcstorepass changeit \
  -destkeystore $TRUST_STORE -deststoretype pkcs12 -deststorepass $TRUST_STORE_PASSWORD

# Download the Let's Encrypt PEM file
wget https://letsencrypt.org/certs/lets-encrypt-r3.pem

# Import the Let's Encrypt Intermediate CA Cert into the truststore
keytool -import -trustcacerts -alias letsencrypt -file lets-encrypt-r3.pem \
  -keystore $TRUST_STORE -storepass $TRUST_STORE_PASSWORD

# Verify the truststore is readable and the letsencrypt exists
keytool -list -keystore $TRUST_STORE -storepass $TRUST_STORE_PASSWORD -storetype pkcs12 | grep letsencrypt

# Create a k8s secret containing the truststore.p12 file and password literal
kubectl create secret generic lets-encrypt-truststore-secret \
  --from-file=truststore.p12=$TRUST_STORE \
  --from-literal=password-key=$TRUST_STORE_PASSWORD

Then I updated my SolrCloud CRD as directed.

Looks like the truststore was updated appropriately:

❯ k describe po iati-dev-solrcloud-0 | grep SSL
      SOLR_SSL_ENABLED:                       true
      SOLR_SSL_WANT_CLIENT_AUTH:              false
      SOLR_SSL_NEED_CLIENT_AUTH:              false
      SOLR_SSL_CHECK_PEER_NAME:               false
      SOLR_SSL_CLIENT_HOSTNAME_VERIFICATION:  false
      SOLR_SSL_KEY_STORE:                     /var/solr/tls/keystore.p12
      SOLR_SSL_KEY_STORE_PASSWORD:            <set to the key 'password-key' in secret 'pkcs12-keystore-password-dev'>  Optional: false
      SOLR_SSL_TRUST_STORE:                   /var/solr/tls-truststore/truststore.p12
      SOLR_SSL_TRUST_STORE_PASSWORD:          <set to the key 'password-key' in secret 'lets-encrypt-truststore-secret'>  Optional: false

I've only done this on my dev environment so far which is only 1 Solrcloud pod, so not able to fully test it's fixed yet.

@thelabdude
Copy link
Contributor

The fix for this is starting to feel more like it should go into 0.6.0 vs. 0.5.1.

My initial approach for this was to not set the SOLR_SSL_TRUST_STORE env var unless the SolrCloud config explicitly declares it via: spec.solrTLS.trustStoreSecret. I thought that would just let the JVM's default truststore apply but Solr has this in jetty-ssl.xml:

  <Set name="TrustStorePath"><Property name="solr.jetty.truststore" default="./etc/solr-ssl.keystore.jks"/></Set>

So the Solr would start failing with:

Caused by: java.lang.IllegalStateException: /opt/solr/server/./etc/solr-ssl.keystore.jks is not a valid keystore
	at org.eclipse.jetty.util.security.CertificateUtils.getKeyStore(CertificateUtils.java:50)
	at org.eclipse.jetty.util.ssl.SslContextFactory.loadTrustStore(SslContextFactory.java:1224)
	at org.eclipse.jetty.util.ssl.SslContextFactory.load(SslContextFactory.java:324)
	at org.eclipse.jetty.util.ssl.SslContextFactory.doStart(SslContextFactory.java:244)

So the current behavior of looking for a truststore.p12 in the keystore secret and using that if found or just using the keystore.p12 if not seems required with Solr having this default built into the jetty config. I didn't love the idea of changing default behavior in a bug-fix release anyway, so I think we just keep the current functionality.

So for now, I think the work-around Nik used (create your own truststore and put into a secret) is the best solution until 0.6.0.

In 0.6.0, I do want to add an option to "merge" the JVM's built-in truststore with a user-provided truststore using an initContainer, but that would ideally require a new option in the CRD (something like mergeJavaTruststore: /usr/local/openjdk-11/lib/security/cacerts) or more hacky pull from a user-provided env var:

spec:
  customSolrKubeOptions:
    podOptions:
      envVars:
       - name: JAVA_TRUST_STORE
         value: /usr/local/openjdk-11/lib/security/cacerts

For Nik's case, just using the Java default cacerts as the truststore for Solr should fix his issue with Let's Encrypt's certs renewing as modern Java (https://www.oracle.com/java/technologies/javase/8u141-relnotes.html) include Let's Encrypt's root ca cert (see: https://letsencrypt.org/docs/certificate-compatibility/). So in 0.6.0, my plan is to support this merge option and if there's no explicit user-provided truststore, Solr will just boot with the Java cacerts as the truststore.

@nosvalds
Copy link
Contributor Author

I've just got around to implementing the workaround on my production environment with 3 Solr Pods. After updating the truststore I manually triggered a renewal of the certificate kubectl cert-manager renew <cert> and the rolling restart worked as expected with no downtime. Thanks @thelabdude!

@thelabdude
Copy link
Contributor

Thank you for following up @nosvalds, glad to hear the work-around works for now.

@HoustonPutman HoustonPutman added cloud security TLS or Auth for Solr labels Apr 6, 2022
@HoustonPutman HoustonPutman added this to the main (v0.6.0) milestone Jul 12, 2022
@gerlowskija gerlowskija removed this from the main (v0.6.0) milestone Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud security TLS or Auth for Solr
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants