Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[redis] allow the client to reconnect on redis exceptions #1306

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

luxe
Copy link
Collaborator

@luxe luxe commented Mar 30, 2023

Problem:

When workers have issues connecting to redis, it results in exceptions being thrown from the worker's backplane. Since the backplane is used in many places throughput the worker, this can lead to either the worker crashing, or worse, getting stuck in a broken state where it stops processing actions. We want the ability to make server/worker more resilient around redis issues.

Goal:

A worker should handle all redis exceptions in a single place and have the ability to re-connect the client. The cluster should not become broken when redis goes down. Servers / workers should wait for redis to become available again, reconnect, and continue operating as normal. A build client should not fail because redis was unavailable. Luckily, all interactions with redis occur in the RedisClient.java. Therefore, the majority of these goals should be achievable by handling issues within the client.

Changes:

  • allow the client to reconnect on redis exceptions
  • prevent exceptions from surfacing to other code getting servers / workers stuck.
  • print metrics on redis failures so we can track their occurrence

Testing:

Here's what I saw anecdotally from this change.
If redis goes down in the middle of a build, the following occurs:

  • server stays up and keeps trying to reconnect to redis
  • worker stays up and keeps trying to reconnect to redis
  • build client does not fail, and its actions make no progress.

When redis comes back the following occurs:

  • server reconnects and continues normal operations
  • worker reconnects and continues normal operations
  • build client continues making progress and completes a successful build.

@luxe luxe requested a review from werkt as a code owner March 30, 2023 22:15
Copy link
Contributor

@jerrymarino jerrymarino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luxe nice! I was wondering if you able to continue a worker after redis transient failure after this PR end to end?

jedis = jedisClusterFactory.get();
} catch (Exception e) {
redisClientRebuildErrorCounter.inc();
System.out.println("Failed to rebuild redis client");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this plumb in log?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

// This will block the overall thread until redis can be connected to.
// It may be a useful strategy for gaining stability on a poorly performing network,
// or a redis cluster that goes down.
while (true) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer configuration to be a number of a reconnects. If it's 0 or null then we don't retry, otherwise we retry up to that number of times. Bonus if there is some backoff here where we don't spam retries continuously.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Switched to retry amount + duration between retries. We have a Retrier.java class in this repo, but its too specific to grpc so I didn't use it. Might be nice to use something like https://resilience4j.readme.io/docs/retry in the future.

@luxe
Copy link
Collaborator Author

luxe commented Apr 9, 2023

@luxe nice! I was wondering if you able to continue a worker after redis transient failure after this PR end to end?

Added a testing section to the PR. I've observed that the servers/workers can keep running while redis is down. However, new workers will fail to start up when redis is down. We should address that use-case as well.

@luxe
Copy link
Collaborator Author

luxe commented Apr 9, 2023

@luxe nice! I was wondering if you able to continue a worker after redis transient failure after this PR end to end?

Added a testing section to the PR. I've observed that the servers/workers can keep running while redis is down. However, new workers will fail to start up when redis is down. We should address that use-case as well.

Fixed. If you also start a worker without redis, it will keep trying to establish the client on startup. When redis is available, the worker will complete its startup.

@jerrymarino
Copy link
Contributor

jerrymarino commented Apr 10, 2023

Yeah I think this will definitely make it easier to deal with redis timeouts. Recovering from MatchStage hang I've got due to this elasticache event may be handled here if a critical redis call raises an exception inside of the MatchStage? FWIW it took about 1 hour for that update to go through so it might not have been a standard outcome. I'm going to try this PR a bit today too

Apr 05, 2023 1:15:22 PM build.buildfarm.worker.PipelineStage run
SEVERE: MatchStage::run(): stage terminated due to exception
redis.clients.jedis.exceptions.JedisDataException: UNBLOCKED force unblock from blocking operation, instance state changed (master -> replica?)
        at redis.clients.jedis.Protocol.processError(Protocol.java:132)
        at redis.clients.jedis.Protocol.process(Protocol.java:166)
        at redis.clients.jedis.Protocol.read(Protocol.java:220)
        at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:389)
        at redis.clients.jedis.Connection.getBinaryBulkReply(Connection.java:299)
        at redis.clients.jedis.Connection.getBulkReply(Connection.java:289)
        at redis.clients.jedis.Connection$1.call(Connection.java:283)
        at redis.clients.jedis.Connection$1.call(Connection.java:280)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1589)

Copy link
Contributor

@jerrymarino jerrymarino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely helps with some of the redis problems I was hitting, working good in general. The only possible thing I thought of was back off: perhaps could propose as a followup: if there was packet loss in a big spike backoff could help recovery, but overall this is a large improvement 🥳

} catch (Exception redisException) {
// Record redis failure.
redisErrorCounter.inc();
log.log(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: do we want SEVERE here or ERROR? I wasn’t sure if the semantics of SEVERE weren’t recoverable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to ERROR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh there is no Level.ERROR!
switched toLevel.WARNING.

public <T> T call(JedisContext<T> withJedis) throws IOException {
return callImpl(withJedis);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: do we need to make a separate method for this still?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. callImpl folded into call


private void rebuildJedisCluser() {
try {
log.log(Level.SEVERE, "Rebuilding redis client");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this redundant with the log in thr upstream call? E.g. the caller already logged when it catches?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, removed

@luxe
Copy link
Collaborator Author

luxe commented Apr 13, 2023

This definitely helps with some of the redis problems I was hitting, working good in general. The only possible thing I thought of was back off: perhaps could propose as a followup: if there was packet loss in a big spike backoff could help recovery, but overall this is a large improvement partying_face

Backoff is a good idea. We should do it as followup. Our redis cluster is currently taking thousands of requests per second. So when the client enters this "retry" state now its already significantly less traffic than normal.

There's also a lot of places in buildfarm where we want to rety things, so I like the idea of having the same retry framework everywhere with configurable backoffs; ex: all the network calls between server/worker. The queues which have been nonblocking, etc.

Copy link
Collaborator

@80degreeswest 80degreeswest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the new default configs in the full yaml example and update the docs.

@jerrymarino
Copy link
Contributor

There's also a lot of places in buildfarm where we want to rety things, so I like the idea of having the same retry framework everywhere with configurable backoffs; ex: all the network calls between server/worker. The queues which have been nonblocking, etc.

Yeah this would help improve some of the failure modes I've put into recently 👍! Happy to help where I can on it

jerrymarino pushed a commit to bazel-ios/bazel-buildfarm that referenced this pull request Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants