Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container should exit with non-zero if auto-config key changed #177

Open
barakbd opened this issue Mar 30, 2022 · 12 comments
Open

Container should exit with non-zero if auto-config key changed #177

barakbd opened this issue Mar 30, 2022 · 12 comments

Comments

@barakbd
Copy link

barakbd commented Mar 30, 2022

Describe the bug

Container does not exit when auto config key is incorrect due to a configuration reset.

To reproduce

  1. LDRP container starts with correct auto-config key
  2. Change auto config key in the UI (
  3. Request a new feature flag value

Expected behavior
The container should exit with non-zero code. See related issue: #165

Logs

│ 2022/03/30 00:14:02.313649 INFO: [AutoConfiguration] Will restart auto-configuration stream to get new data due to  │
│ 2022/03/30 00:14:02.313770 INFO: [AutoConfiguration] Reconnecting in 19.8087 secs                                   │
│ 2022/03/30 00:14:02.313777 INFO: [AutoConfiguration] Reconnecting in 22.0218 secs                                   │
│ 2022/03/30 00:14:23.013052 ERROR: [AutoConfiguration] Invalid auto-configuration key; cannot get environments       │

** Docker Image**
https://hub.docker.com/layers/launchdarkly/ld-relay/latest/images/sha256-8932028a51c815a016b73a265c5102ec9bfa99125464a9470e6e44768601a4df?context=explore

Notes

I noticed that ERROR log was generated even without making a connection a feature flag request from an SDK. Does the relay proxy send a heartbeat or does the server make a call to the LDRP? In other words, wow does the LDRP know the auto-config key changed?

@eli-darkly
Copy link
Contributor

I noticed that ERROR log was generated even without making a connection a feature flag request from an SDK. Does the relay proxy send a heartbeat or does the server make a call to the LDRP? In other words, wow does the LDRP know the auto-config key changed?

That's because the Relay Proxy maintains a streaming connection LaunchDarkly for auto-config information updates, not unlike the other streaming connection that is used to get flag data updates. This allows it to immediately detect changes like adding or removing an environment from your auto-config profile on the dashboard.

If you change the key, then LaunchDarkly immediately closes that connection— since you would presumably not want anyone who had previously connected with that old key to keep on receiving SDK keys and other information via the auto-config stream. That would be a security violation.

Anyway... regarding your suggestion that the Relay Proxy should quit if this happens, I think we'll have to discuss it internally. Personally I'm not convinced that this is right— I mean, on the one hand the Relay Proxy is not going to be able to keep functioning in any useful way if this has happened, but on the other hand, we have never before defined any circumstance in which it would just terminate the process after having previously had a successful startup, no matter what happened, so this might be surprising/undesirable behavior for some users.

What I'm not sure about is whether this condition is currently detectable via external monitoring. I mean, the error log message is meant to help get your attention in that regard, but if for instance the /status resource doesn't also reflect that there's a problem, it probably should. So I'll look into that.

@barakbd
Copy link
Author

barakbd commented Mar 30, 2022

The /status endpoint keeps returning 200 reporting and reporting the environments. It should not. BTW, we are using that endpoint as the livelinessProbe in K8S, so if it did not return 200, K8S would automatically kill the pod.
There should be 2 fixes:

  1. /status should not return 200
  2. The pod should exit with non-zero

@eli-darkly
Copy link
Contributor

eli-darkly commented Mar 30, 2022

Well, there are three different issues here.

First, yes, we want this failure mode to be discoverable in some way.

Second, regarding the HTTP semantics of the /status resource. It's been consistently the behavior of the Relay Proxy that the /status resource returns 200 as long as the Relay Proxy is actually able to receive and respond to requests on that endpoint. We do not use a status like 503 as a response from that resource, because it would be misleading in terms of HTTP semantics, and also because we want the caller to be able to see details of the service's operational status, not just "it's broken, we won't give you any status details beyond that". Existing users of the Relay Proxy rely on this behavior. So, if we were to provide a way for a monitoring service to make a simple "yes it's fine" vs. "it's bad, kill it" decision based solely on the HTTP response status, we would do that by adding a new endpoint, not by changing the semantics of the current endpoint.

However, since the Relay Proxy is capable of doing more than one thing, this raises a third question of how broken would it have to be before we say "it's bad, kill it." I think in fact I was wrong in my previous comment to say that "the Relay Proxy is not going to be able to keep functioning in any useful way if this has happened [i.e. if the auto-config key has been changed]". It still possesses valid SDK keys of whatever environments it was using up till that point (assuming those keys have not also been rotated). It has only lost the ability to automatically detect changes to the configuration. It is not self-evident that it ought to stop serving flags to existing clients in this case; maybe that's what you want, but it isn't necessarily what everyone would want. There is already a different way to indicate "I don't want existing services that possess this SDK key to be able to continue to get flags", and that is to rotate the SDK key.

@eli-darkly
Copy link
Contributor

So I think we will need to do some more internal discussion and the answer may be that this should be configurable behavior. Our customers have many different approaches to deployment and monitoring.

@barakbd
Copy link
Author

barakbd commented Mar 31, 2022

I tested the ability of the LDRP to relay feature flags once the auto-config key has changed, using this example repo:
https://github.com/launchdarkly/hello-js
I added color to the example code, so when the feature flag is "On" (which it is) text should change green and say "Showing your feature to [email protected]".

The LDRP is not able to serve feature flag values and requests return with an error.
image

@eli-darkly
Copy link
Contributor

Well, at this point I'm pretty confused. I'm the primary maintainer of the Relay Proxy code, but there's a lot of it so I certainly may have missed or forgotten something. And I haven't yet tried to replicate the test you described. But from my reading of the code right now... I do not see a way for it to do what you're saying it does. It looks to me like it only deactivates an environment if LaunchDarkly specifically told it that that environment was removed from the configuration, or that that SDK key (as in the environment key, not the auto-config key) was deactivated, and I don't see any code path from "auto-config stream was disconnected and then got a 401 error" to deactivating the environments, even if it would arguably be logical for it to do so. The code I'm looking at seems to match the intended behavior as I remembered it, which was "quit on startup if the auto-config key is bad, but after that, any errors on the auto-config stream just mean we won't get updates any more."

@barakbd
Copy link
Author

barakbd commented Mar 31, 2022

I am happy to get on a Zoom call to test together.
You can email me at [email protected] and we can schedule.

@eli-darkly
Copy link
Contributor

@barakbd Thank you but I do need to begin with some testing on my own here, as well as further review of the code— and also, it's not clear yet how high a priority this is, not to mention that there is still not agreement on the desired behavior. So I can't carve out time to collaborate closely like that.

@barakbd
Copy link
Author

barakbd commented Apr 7, 2022

Any updates please?
The issue is we need a way to alert when the key is invalid, as from my testing, the LDRP will stop working.
We could possibly rely on the logs, however, this is cumbersome. It would also make sense to have the container exit, as a new deployment, using the updated secret (auto config key) would be needed anyways. If the container exits, we can alert based on the k8s.container.ready metric (OTEL collector metric.
Hope this makes sense.

@louis-launchdarkly
Copy link
Contributor

Hello @barakbd, thank you for reaching out to us, I think it will be helpful to understand your use case a bit more through a support ticket, do you mind creating a support ticket with us?

@barakbd
Copy link
Author

barakbd commented Apr 11, 2022

Support ticket create: #20701

@barakbd
Copy link
Author

barakbd commented Apr 28, 2022

@eli-darkly - I just retested and you are correct. The LDRP continues to serve requests, until the container is killed and tres to restart, at which point it fails with a non-zero code. It seems this is the desired behavior, however I can see this causing confusion when someone tries to update the configuration (add/remove Project/Environment) and the LDRP would not respect the new configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants