-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[redis] allow the client to reconnect on redis exceptions #1306
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luxe nice! I was wondering if you able to continue a worker after redis transient failure after this PR end to end?
jedis = jedisClusterFactory.get(); | ||
} catch (Exception e) { | ||
redisClientRebuildErrorCounter.inc(); | ||
System.out.println("Failed to rebuild redis client"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this plumb in log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
// This will block the overall thread until redis can be connected to. | ||
// It may be a useful strategy for gaining stability on a poorly performing network, | ||
// or a redis cluster that goes down. | ||
while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer configuration to be a number of a reconnects. If it's 0 or null then we don't retry, otherwise we retry up to that number of times. Bonus if there is some backoff here where we don't spam retries continuously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Switched to retry amount + duration between retries. We have a Retrier.java
class in this repo, but its too specific to grpc so I didn't use it. Might be nice to use something like https://resilience4j.readme.io/docs/retry in the future.
Added a testing section to the PR. I've observed that the servers/workers can keep running while redis is down. However, new workers will fail to start up when redis is down. We should address that use-case as well. |
Fixed. If you also start a worker without redis, it will keep trying to establish the client on startup. When redis is available, the worker will complete its startup. |
Yeah I think this will definitely make it easier to deal with redis timeouts. Recovering from
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely helps with some of the redis problems I was hitting, working good in general. The only possible thing I thought of was back off: perhaps could propose as a followup: if there was packet loss in a big spike backoff could help recovery, but overall this is a large improvement 🥳
} catch (Exception redisException) { | ||
// Record redis failure. | ||
redisErrorCounter.inc(); | ||
log.log( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: do we want SEVERE here or ERROR? I wasn’t sure if the semantics of SEVERE weren’t recoverable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched to ERROR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh there is no Level.ERROR
!
switched toLevel.WARNING
.
public <T> T call(JedisContext<T> withJedis) throws IOException { | ||
return callImpl(withJedis); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: do we need to make a separate method for this still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. callImpl folded into call
|
||
private void rebuildJedisCluser() { | ||
try { | ||
log.log(Level.SEVERE, "Rebuilding redis client"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this redundant with the log in thr upstream call? E.g. the caller already logged when it catches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, removed
Backoff is a good idea. We should do it as followup. Our redis cluster is currently taking thousands of requests per second. So when the client enters this "retry" state now its already significantly less traffic than normal. There's also a lot of places in buildfarm where we want to rety things, so I like the idea of having the same retry framework everywhere with configurable backoffs; ex: all the network calls between server/worker. The queues which have been nonblocking, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the new default configs in the full yaml example and update the docs.
Yeah this would help improve some of the failure modes I've put into recently 👍! Happy to help where I can on it |
Problem:
When workers have issues connecting to redis, it results in exceptions being thrown from the worker's backplane. Since the backplane is used in many places throughput the worker, this can lead to either the worker crashing, or worse, getting stuck in a broken state where it stops processing actions. We want the ability to make server/worker more resilient around redis issues.
Goal:
A worker should handle all redis exceptions in a single place and have the ability to re-connect the client. The cluster should not become broken when redis goes down. Servers / workers should wait for redis to become available again, reconnect, and continue operating as normal. A build client should not fail because redis was unavailable. Luckily, all interactions with redis occur in the
RedisClient.java
. Therefore, the majority of these goals should be achievable by handling issues within the client.Changes:
Testing:
Here's what I saw anecdotally from this change.
If redis goes down in the middle of a build, the following occurs:
When redis comes back the following occurs: