Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LD Relay doesn't repopulate Redis when Redis is restarted #141

Open
khepin opened this issue Jun 2, 2021 · 11 comments
Open

LD Relay doesn't repopulate Redis when Redis is restarted #141

khepin opened this issue Jun 2, 2021 · 11 comments

Comments

@khepin
Copy link

khepin commented Jun 2, 2021

Describe the bug
We use LD and LD relay for PHP apps where the recommended approach is to have LD Relay persist the data in Redis and PHP connect to Redis.

When the Redis instance is restarted, it loses data and the PHP apps will then return false for every evaluation.

After Redis has been restarted, LDRelay does not repopulate the data in Redis and we had our flags all return false for 1 hour.

A restart of LD relay fixed the issue.

To reproduce

  • Start LD Relay with persistence to Redis
  • Restart Redis
  • The data is lost in Redis and doesn't get repopulated

Expected behavior

  • Start LD Relay with persistence to Redis
  • Restart Redis
  • LD Relay detects the connection reset to Redis / restart and repopulates the data in Redis

Logs
No useful logs were found

SDK version

Language version, developer tools
PHP 7.4

OS/platform

Additional context
Our cache TTL for redis was set at 30s but we were missing flag definitions for almost an hour until we rebooted LD relay which fixed the issue.

@eli-darkly
Copy link
Contributor

In the current implementation, there is no way for the Relay Proxy to detect reliably that Redis has restarted.

It can detect if Redis is unavailable at a time when it's actually trying to do something with Redis— that is, if it receives a flag update from LaunchDarkly and tries to save it to Redis, or if it gets a request for flags from an SDK at a time when the cache has expired so it has to read Redis, and Redis is unavailable at that moment, it will detect that as a database error. In that case, it enters a "database is unavailable" state where it keeps checking until the database is available again, and when it is, then it will re-save all the flags.

But if the Redis connection goes away briefly during a time when it wasn't active anyway, the Relay Proxy can't detect that— the Redis client doesn't (as far as I know) make such interruptions visible to the application, it handles them transparently. There is no signal to tell it "the database was available last time you checked, and it is also available now, but all the data has been lost in between then and now."

In any case, the current situation is that if you know you are going to restart Redis and lose your data, you need to also restart the Relay Proxy.

@khepin
Copy link
Author

khepin commented Jun 2, 2021

Are there any plans (or avenues to be explored) to move away from the current situation towards something more reliable?

@eli-darkly
Copy link
Contributor

eli-darkly commented Jun 2, 2021

This is as far as I know the first time the issue has been raised, so there haven't been any plans up till now and it's too soon for me to comment on how we might handle this. Again, at present it's not clear to what extent the Redis client library allows us to even detect this condition, so I'm not sure if this is an issue that's specific to the Relay Proxy or a more general issue inherent in using Redis without a persistent backing store.

@khepin
Copy link
Author

khepin commented Jun 2, 2021

From an initial look at the code I believe you're using the connection pool for redis which indeed would hide connection failures. That package also offers the ability to make a direct connection where a periodic healthcheck could be sent to check if the connection is still alive.

The following code for example detects connection errors by sending a periodic PING. If I then restart redis it'll detect EOF on the connection on the next attempt.

package main

import (
	"fmt"
	"time"

	"github.com/gomodule/redigo/redis"
)

func main() {
	err := <-signalOnConnectionFailure()
	fmt.Println(err)
}

func signalOnConnectionFailure() <-chan (error) {
	healthchecks := make(chan (error))
	go func() {
		c, err := redis.Dial("tcp", "redis:6379")
		if err != nil {
			healthchecks <- err
			close(healthchecks)
			return
		}

		for true {
			<-time.After(30 * time.Second)
			_, err := c.Do("PING")
			if err != nil {
				healthchecks <- err
				close(healthchecks)
				return
			}

			// connection healthy
		}
	}()
	return healthchecks
}

@bezmax
Copy link

bezmax commented Jun 3, 2021

ld-relay could just maintain/verify a marker entry in Redis. If the entry is gone - that means Redis needs to be repopulated.

@khepin
Copy link
Author

khepin commented Jun 3, 2021

Yeah that seems like a very valid approach as well.

@eli-darkly
Copy link
Contributor

Well, there already is a marker entry, it's just that up till now the data store implementation did not do any kind of constant polling of the database (except in the case I mentioned after it's already run into a server error) but only accessed it when there was an immediate need to read or write data. We were trying to avoid creating more server traffic than was strictly necessary for applications to use LD.

We may need to re-evaluate that but we'll need to do it carefully - there are some design decisions to be made. This whole thing is not just part of ld-relay - the entire data store implementation comes from the LD Go SDK which Relay uses. In applications other than Relay there can be different concerns, for instance this could be a Go application that is only reading flags from the database that were put there by Relay and should not ever be populating the database itself. So we would need to build in some extra context-awareness that there isn't a place for in the current internal API. There's also the question of how much of this should be specific to Redis vs. other data store integrations - Redis is somewhat unusual in terms of not necessarily being a persistent store (although many of our customers do use it with persistent storage enabled), and also with other databases like DynamoDB where even a small query is subject to throttling it'd be generally undesirable to poll in this way.

So, for now I can only say that this is a thing we're considering.

@eli-darkly
Copy link
Contributor

eli-darkly commented Jun 3, 2021

@christianladam Yes, that's the marker entry I mentioned, but due to the current internal design of the Go SDK we can't quite do what you're suggesting - Relay does not currently have direct access to the object that's implemented in redis_impl.go, it sees only a higher layer that adds caching around those queries, and caching would defeat the purpose of polling in this case. So, again, whatever we do would require some internal API rethinking.

@christianladam
Copy link

christianladam commented Jun 3, 2021

@christianladam Yes, that's the marker entry I mentioned, but due to the current internal design of the Go SDK we can't quite do what you're suggesting - Relay does not currently have direct access to the object that's implemented in redis_impl.go, it sees only a higher layer that adds caching around those queries, and caching would defeat the purpose of polling in this case. So, again, whatever we do would require some internal API rethinking.

I deleted my comment before your response but yeah there's additional work on the relay side itself required to expose this as you definitely don't want the LDClients outside the relay messing with the data population, as you already mentioned.

@eli-darkly
Copy link
Contributor

you definitely don't want the LDClients outside the relay messing with the data population, as you already mentioned

It's not quite as simple as that either. An SDK instance that's configured to use Redis could either be using it in a read-only way, where it expects that Relay (or some equivalent process) is responsible for populating the database, or as a backing store that belongs to the SDK (i.e. much the same way Relay would use it: the app puts flags in the database, so if the app has to restart and can't reach LD right away, the last known data is in there) in which case it's not read-only and we would probably want the SDK to have the same behavior that's been suggested here. Currently the SDK does have some ability to know which of those situations it's in, but that knowledge isn't quite available at the level where it would need to be for implementing this.

LaunchDarklyCI pushed a commit that referenced this issue Jun 3, 2021
(v6 - #9) create RelayCore and move most of the core logic into it
@louis-launchdarkly
Copy link
Contributor

An update - Recently the SDK team has been looking at the persistent store design, and we will keep this use case in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants