[redis] allow the client to reconnect on redis exceptions #1306

luxe · 2023-03-30T22:15:33Z

Problem:

When workers have issues connecting to redis, it results in exceptions being thrown from the worker's backplane. Since the backplane is used in many places throughput the worker, this can lead to either the worker crashing, or worse, getting stuck in a broken state where it stops processing actions. We want the ability to make server/worker more resilient around redis issues.

Goal:

A worker should handle all redis exceptions in a single place and have the ability to re-connect the client. The cluster should not become broken when redis goes down. Servers / workers should wait for redis to become available again, reconnect, and continue operating as normal. A build client should not fail because redis was unavailable. Luckily, all interactions with redis occur in the RedisClient.java. Therefore, the majority of these goals should be achievable by handling issues within the client.

Changes:

allow the client to reconnect on redis exceptions
prevent exceptions from surfacing to other code getting servers / workers stuck.
print metrics on redis failures so we can track their occurrence

Testing:

Here's what I saw anecdotally from this change.
If redis goes down in the middle of a build, the following occurs:

server stays up and keeps trying to reconnect to redis
worker stays up and keeps trying to reconnect to redis
build client does not fail, and its actions make no progress.

When redis comes back the following occurs:

server reconnects and continues normal operations
worker reconnects and continues normal operations
build client continues making progress and completes a successful build.

jerrymarino

@luxe nice! I was wondering if you able to continue a worker after redis transient failure after this PR end to end?

jerrymarino · 2023-04-06T21:26:02Z

src/main/java/build/buildfarm/common/redis/RedisClient.java

+ jedis = jedisClusterFactory.get();
+ } catch (Exception e) {
+ redisClientRebuildErrorCounter.inc();
+ System.out.println("Failed to rebuild redis client");


Should this plumb in log?

80degreeswest · 2023-04-08T12:36:07Z

src/main/java/build/buildfarm/common/redis/RedisClient.java

+ // This will block the overall thread until redis can be connected to.
+ // It may be a useful strategy for gaining stability on a poorly performing network,
+ // or a redis cluster that goes down.
+ while (true) {


I would prefer configuration to be a number of a reconnects. If it's 0 or null then we don't retry, otherwise we retry up to that number of times. Bonus if there is some backoff here where we don't spam retries continuously.

Agree. Switched to retry amount + duration between retries. We have a Retrier.java class in this repo, but its too specific to grpc so I didn't use it. Might be nice to use something like https://resilience4j.readme.io/docs/retry in the future.

luxe · 2023-04-09T03:57:30Z

@luxe nice! I was wondering if you able to continue a worker after redis transient failure after this PR end to end?

Added a testing section to the PR. I've observed that the servers/workers can keep running while redis is down. However, new workers will fail to start up when redis is down. We should address that use-case as well.

luxe · 2023-04-09T06:17:45Z

@luxe nice! I was wondering if you able to continue a worker after redis transient failure after this PR end to end?

Added a testing section to the PR. I've observed that the servers/workers can keep running while redis is down. However, new workers will fail to start up when redis is down. We should address that use-case as well.

Fixed. If you also start a worker without redis, it will keep trying to establish the client on startup. When redis is available, the worker will complete its startup.

jerrymarino · 2023-04-10T20:02:36Z

Yeah I think this will definitely make it easier to deal with redis timeouts. Recovering from MatchStage hang I've got due to this elasticache event may be handled here if a critical redis call raises an exception inside of the MatchStage? FWIW it took about 1 hour for that update to go through so it might not have been a standard outcome. I'm going to try this PR a bit today too

Apr 05, 2023 1:15:22 PM build.buildfarm.worker.PipelineStage run
SEVERE: MatchStage::run(): stage terminated due to exception
redis.clients.jedis.exceptions.JedisDataException: UNBLOCKED force unblock from blocking operation, instance state changed (master -> replica?)
        at redis.clients.jedis.Protocol.processError(Protocol.java:132)
        at redis.clients.jedis.Protocol.process(Protocol.java:166)
        at redis.clients.jedis.Protocol.read(Protocol.java:220)
        at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:389)
        at redis.clients.jedis.Connection.getBinaryBulkReply(Connection.java:299)
        at redis.clients.jedis.Connection.getBulkReply(Connection.java:289)
        at redis.clients.jedis.Connection$1.call(Connection.java:283)
        at redis.clients.jedis.Connection$1.call(Connection.java:280)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1589)

jerrymarino

This definitely helps with some of the redis problems I was hitting, working good in general. The only possible thing I thought of was back off: perhaps could propose as a followup: if there was packet loss in a big spike backoff could help recovery, but overall this is a large improvement 🥳

jerrymarino · 2023-04-11T15:21:45Z

src/main/java/build/buildfarm/common/redis/RedisClient.java

+ } catch (Exception redisException) {
+ // Record redis failure.
+ redisErrorCounter.inc();
+ log.log(


Minor: do we want SEVERE here or ERROR? I wasn’t sure if the semantics of SEVERE weren’t recoverable.

switched to ERROR

Oh there is no Level.ERROR!
switched toLevel.WARNING.

jerrymarino · 2023-04-11T15:23:06Z

src/main/java/build/buildfarm/common/redis/RedisClient.java

 public <T> T call(JedisContext<T> withJedis) throws IOException {
+ return callImpl(withJedis);


Minor: do we need to make a separate method for this still?

fixed. callImpl folded into call

jerrymarino · 2023-04-11T15:25:14Z

src/main/java/build/buildfarm/common/redis/RedisClient.java

+
+ private void rebuildJedisCluser() {
+ try {
+ log.log(Level.SEVERE, "Rebuilding redis client");


Is this redundant with the log in thr upstream call? E.g. the caller already logged when it catches?

agree, removed

luxe · 2023-04-13T20:37:40Z

This definitely helps with some of the redis problems I was hitting, working good in general. The only possible thing I thought of was back off: perhaps could propose as a followup: if there was packet loss in a big spike backoff could help recovery, but overall this is a large improvement partying_face

Backoff is a good idea. We should do it as followup. Our redis cluster is currently taking thousands of requests per second. So when the client enters this "retry" state now its already significantly less traffic than normal.

There's also a lot of places in buildfarm where we want to rety things, so I like the idea of having the same retry framework everywhere with configurable backoffs; ex: all the network calls between server/worker. The queues which have been nonblocking, etc.

80degreeswest

Can you add the new default configs in the full yaml example and update the docs.

jerrymarino · 2023-04-14T19:27:17Z

There's also a lot of places in buildfarm where we want to rety things, so I like the idea of having the same retry framework everywhere with configurable backoffs; ex: all the network calls between server/worker. The queues which have been nonblocking, etc.

Yeah this would help improve some of the failure modes I've put into recently 👍! Happy to help where I can on it

Squash of: bazelbuild#1306

luxe added 4 commits March 30, 2023 17:25

updates

7cf2087

updates

a90df00

updates

93c44ca

updates

4786d11

luxe requested a review from werkt as a code owner March 30, 2023 22:15

jerrymarino reviewed Apr 6, 2023

View reviewed changes

80degreeswest reviewed Apr 8, 2023

View reviewed changes

luxe mentioned this pull request Apr 9, 2023

[Worker][Resilience] Graceful term Zombie workers on PipelineStage failure #1309

Open

updates

e33903c

luxe added 4 commits April 9, 2023 01:21

Merge branch 'main' into luxe/redis_stability

f14fc93

updates

be77c5b

updates

fe8113a

updates

b8d2ae5

jerrymarino approved these changes Apr 11, 2023

View reviewed changes

updates

64442e9

updates

0d9bfa8

80degreeswest reviewed Apr 14, 2023

View reviewed changes

luxe added 2 commits April 17, 2023 10:42

Merge branch 'main' into luxe/redis_stability

ebce4bb

Merge branch 'main' into luxe/redis_stability

d0b8f79

jerrymarino pushed a commit to bazel-ios/bazel-buildfarm that referenced this pull request Jun 1, 2023

[redis] allow the client to reconnect on redis exceptions

2a64ddb

Squash of: bazelbuild#1306

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[redis] allow the client to reconnect on redis exceptions #1306

[redis] allow the client to reconnect on redis exceptions #1306

luxe commented Mar 30, 2023 •

edited

jerrymarino left a comment

jerrymarino Apr 6, 2023

luxe Apr 9, 2023

80degreeswest Apr 8, 2023

luxe Apr 9, 2023

luxe commented Apr 9, 2023 •

edited

luxe commented Apr 9, 2023

jerrymarino commented Apr 10, 2023 •

edited

jerrymarino left a comment

jerrymarino Apr 11, 2023

luxe Apr 13, 2023

luxe Apr 13, 2023

jerrymarino Apr 11, 2023

luxe Apr 13, 2023

jerrymarino Apr 11, 2023

luxe Apr 13, 2023

luxe commented Apr 13, 2023

80degreeswest left a comment

jerrymarino commented Apr 14, 2023

		public <T> T call(JedisContext<T> withJedis) throws IOException {
		return callImpl(withJedis);

[redis] allow the client to reconnect on redis exceptions #1306

Are you sure you want to change the base?

[redis] allow the client to reconnect on redis exceptions #1306

Conversation

luxe commented Mar 30, 2023 • edited

Problem:

Goal:

Changes:

Testing:

jerrymarino left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luxe commented Apr 9, 2023 • edited

luxe commented Apr 9, 2023

jerrymarino commented Apr 10, 2023 • edited

jerrymarino left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luxe commented Apr 13, 2023

80degreeswest left a comment

Choose a reason for hiding this comment

jerrymarino commented Apr 14, 2023

luxe commented Mar 30, 2023 •

edited

luxe commented Apr 9, 2023 •

edited

jerrymarino commented Apr 10, 2023 •

edited