Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K3S startup stuck in a deadlock when a KMS provider is configured and the node is rebooted #10058

Open
jirenugo opened this issue May 2, 2024 · 4 comments

Comments

@jirenugo
Copy link

jirenugo commented May 2, 2024

Environmental Info:
K3s Version:

k3s version v1.29.4+k3s1 (94e29e2e)
go version go1.21.9

Node(s) CPU architecture, OS, and Version:

Linux TDC1792640621 5.15.0-1061-azure #70~20.04.1-Ubuntu SMP Mon Apr 8 15:38:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

  • Single node server

Describe the bug:

Steps To Reproduce:

  • Install K3s curl -sfL https://get.k3s.io | sh -s - server --cluster-init --write-kubeconfig-mode 644
  • Configure a KMS provider by placing a .yaml file under /etc/rancher/k3s/config.yaml.d/ with the following contents. The KMS provider itself runs as a pod on the cluster
kube-apiserver-arg:
  - "encryption-provider-config=/etc/rancher/k3s/encryption-config.yaml"
  - "encryption-provider-config-automatic-reload"
cluster-init: false
  • Restart k3s using systemctl restart k3s
  • Observe that the KMS provider runs as pod successfully, is correctly configured, and receives encryption calls from the api-server
  • Restart the node
  • Now attempt to start k3s sudo systemctl start k3s

Expected behavior:

k3s starts up successfully and starts the KMS pod

Actual behavior:

k3s is waiting for the KMS pod to come up to start the KMS pod because it attempts to decrypt a secret (/registry/secrets/kube-system/k3s-serving) that is now encrypted by the KMS provider

Are there any workarounds for this issue? Is it possible to configure k3s to store the bootstrap secrets as a different resource type so that they may be exempted from KMS encryption.

Additional context / logs:

Logs from the systemd service attempting to decrypt the secret protected by KMS:

● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
     Active: activating (start) since Wed 2024-05-01 22:09:30 UTC; 6min ago
       Docs: https://k3s.io
   Main PID: 1378 (k3s-server)
      Tasks: 75
     Memory: 682.4M
     CGroup: /system.slice/k3s.service
             ├─1378 /usr/local/bin/k3s server
             └─2167 containerd

May 01 22:15:58 TDC1792640621 k3s[1378]: I0501 22:15:58.418075    1378 controller.go:126] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
May 01 22:15:58 TDC1792640621 k3s[1378]: E0501 22:15:58.418123    1378 controller.go:102] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, >
May 01 22:15:58 TDC1792640621 k3s[1378]: , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
May 01 22:15:58 TDC1792640621 k3s[1378]: I0501 22:15:58.419228    1378 controller.go:109] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
May 01 22:15:58 TDC1792640621 k3s[1378]: E0501 22:15:58.569280    1378 transformer.go:163] "failed to decrypt data" err="got unexpected nil transformer"
May 01 22:15:58 TDC1792640621 k3s[1378]: W0501 22:15:58.569326    1378 reflector.go:539] storage/cacher.go:/secrets: failed to list *core.Secret: unable to transform key "/registry/secrets/kube-system/k3s-serving": got unexpected>
May 01 22:15:58 TDC1792640621 k3s[1378]: E0501 22:15:58.569335    1378 cacher.go:475] cacher (secrets): unexpected ListAndWatch error: failed to list *core.Secret: unable to transform key "/registry/secrets/kube-system/k3s-servin>
May 01 22:15:59 TDC1792640621 k3s[1378]: E0501 22:15:59.570628    1378 transformer.go:163] "failed to decrypt data" err="got unexpected nil transformer"
May 01 22:15:59 TDC1792640621 k3s[1378]: W0501 22:15:59.570669    1378 reflector.go:539] storage/cacher.go:/secrets: failed to list *core.Secret: unable to transform key "/registry/secrets/kube-system/k3s-serving": got unexpected>
May 01 22:15:59 TDC1792640621 k3s[1378]: E0501 22:15:59.570679    1378 cacher.go:475] cacher (secrets): unexpected ListAndWatch error: failed to list *core.Secret: unable to transform key "/registry/secrets/kube-system/k3s-servin>
@brandond
Copy link
Contributor

brandond commented May 2, 2024

The KMS provider itself runs as a pod on the cluster.

I'm not familiar with this deployment pattern for KMS providers - why are you trying to do this? It suffers from the obvious chicken-and-egg problem you're running into here, where the cluster can't start because it needs access to something that won't be available until after it's up.

You're trying to figure out how to lock your keys in the car but still open the door. I don't think there's a good way to make this work.

@jirenugo
Copy link
Author

jirenugo commented May 2, 2024

The KMS provider itself runs as a pod on the cluster.

This is not an uncommon pattern for KMS deployment.
Arguably k3s has a circular dependency on kubernetes secrets. It is unfortunate that this is not part of the conformance tests, at least as far as I can tell.

https://github.com/kubernetes-sigs/aws-encryption-provider
https://github.com/Azure/kubernetes-kms?tab=readme-ov-file
https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/barbican-kms-plugin/using-barbican-kms-plugin.md
https://github.com/Tencent/tke-kms-plugin/blob/90b71a5c7d78a564567040ebe1ce7135afe99ce5/deployment/tke-kms-plugin.yaml#L4

@brandond
Copy link
Contributor

brandond commented May 6, 2024

K3s uses secrets for a couple things internally:

  • Node password verification
  • Supervisor/apiserver certificate sync

Both of these should soft-fail and retry until secrets can be read. Where exactly does k3s startup stall?

I see that https://github.com/kubernetes-sigs/aws-encryption-provider for example suggests running the KMS as a static pod - are you doing that by placing the pod spec in a file in /var/lib/rancher/k3s/agent/pod-manifests/, or are you trying to deploy it via kubectl apply?

@jirenugo
Copy link
Author

jirenugo commented May 7, 2024

suggests running the KMS as a static pod

Yes. Static pods have the same issue.

Both of these should soft-fail and retry until secrets can be read. Where exactly does k3s startup stall?

I don't know. I attached the logs from the systemd service in the issue where it's trying to access /registry/secrets/kube-system/k3s-serving. Does that answer your question? Why does it hard fail on this secret? I can get more logs if you share instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Triage
Development

No branches or pull requests

2 participants