Redis decode_responses rq incompatibility #6424

SchrosCat2013 · 2023-09-06T02:20:07Z

Redash is setting the flag decode_responses for its Redis connection. This is incompatible with the rq library - rq/rq#1188.

We are having some issues, and I suspect this might be the underlying cause. Primarily:

scheduled queries look like they're being executed multiple times
occasionally a query gets stuck. The job running the query disappears but the Redis entry remains, consuming a space in the query pool.

The second issue eventually consumes the entire query pool and prevents any query from running. When this happens, we have to manually flush the Redis cache.

We are running Redash v10.1.0 from the redash/redash:10.1.0.b50633 docker image.

We found #5801, which looks like the same issue we are having. We have upgraded rq to 1.10.1 and rq-scheduler to 0.10.0 in our environment. This does seem to have alleviated the stuck queries issue. It doesn't seem to have changed the scheduled queries multiple executions. However, we are now seeing the error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 1: invalid start byte
repeated throughout the logs. The stack trace here matches the linked issue in the rq library. I am not sure if upgrading rq caused these errors - I more suspect that upgrading rq has improved the error logging and is surfacing an existing issue.

The text was updated successfully, but these errors were encountered:

justinclift · 2023-09-06T03:59:09Z

Thanks, that looks like an important bug to investigate / fix. 😄

justinclift · 2023-09-06T04:03:36Z

Our latest development code is using rq 1.5.0 and rq-scheduler 0.10.0, which both seem fairly old now.:

redash/requirements.txt

Lines 47 to 48 in c2e7df0

 rq==1.5.0 

 rq-scheduler==0.10.0

Latest versions are currently:

rq 1.15.1: https://pypi.org/project/rq/
rq-scheduler 0.13.1 : https://pypi.org/project/rq-scheduler/

We should be able to bump them up to much newer, and hopefully also adjust the connection options without too much hassle.

justinclift · 2023-09-06T04:09:17Z

This seems to be where decode_responses=True is added:

redash/redash/settings/helpers.py

Lines 47 to 66 in c2e7df0

 def add_decode_responses_to_redis_url(url): 

 """Make sure that the Redis URL includes the `decode_responses` option.""" 

 parsed = urlparse(url) 

 query = "decode_responses=True" 

 if parsed.query and "decode_responses" not in parsed.query: 

 query = "{}&{}".format(parsed.query, query) 

 elif "decode_responses" in parsed.query: 

 query = parsed.query 

 return urlunparse( 

 [ 

 parsed.scheme, 

 parsed.netloc, 

 parsed.path, 

 parsed.params, 

 query, 

 parsed.fragment, 

 ] 

 )

Looks like that was originally added during the migration of Redash from Python 2 to Python 3: 246eca1

* Decode bytes returned from Redis

justinclift · 2023-09-06T04:21:04Z

Not setting that option might just be as simple as removing the helper function and removing the calls to it:

redash/redash/settings/__init__.py

Line 9 in c2e7df0

add_decode_responses_to_redis_url,

redash/redash/settings/__init__.py

Line 22 in c2e7df0

REDIS_URL = add_decode_responses_to_redis_url(_REDIS_URL)

Am not finding any info about why the option was added in the first place. It was likely done for a reason, though what that might be I have no idea.

We can probably try removing that option then see if our CI tests pass. That will at least let us know that Redash might work without it. We'd ideally want to see it operate without causing errors in practise though.

justinclift · 2023-09-06T04:23:07Z

Ahhh. Looks like the rq docs specifically call out that option as unsupported these days too:

https://github.com/rq/rq/pull/1833/files

Yeah, lets try removing it. 😄

justinclift · 2023-09-06T05:42:44Z

Looks like just removing that function completely isn't going to work as-is. When doing so just now in a test on my local computer, the backend unit tests start throwing errors:

========================================================================== short test summary info ==========================================================================
FAILED tests/test_models.py::QueryOutdatedQueriesTest::test_outdated_queries_works_scheduled_queries_tracker - AssertionError: Query(id=1, name='Query', query_hash='2d5127b8139f1f5f1086c4053324733a', version=1, user_id=1, org_id=1, data_source_id=1, query_hash='2d5127b8139f1f5f1...
FAILED tests/handlers/test_data_sources.py::TestDataSourcePausePost::test_pause_sets_reason - AssertionError: b'testing' != 'testing'
FAILED tests/handlers/test_query_results.py::TestQueryResultListAPI::test_execute_query_with_params - TypeError: can only concatenate str (not "bytes") to str
FAILED tests/models/test_data_sources.py::TestDataSourceIsPaused::test_allows_setting_reason - AssertionError: b'Some good reason.' != 'Some good reason.'
FAILED tests/models/test_users.py::TestUserDetail::test_sync - sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedFunction) operator does not exist: integer = bytea
FAILED tests/tasks/test_queries.py::TestEnqueueTask::test_multiple_enqueue_of_same_query - AttributeError: 'bytes' object has no attribute 'id'
FAILED tests/tasks/test_queries.py::TestEnqueueTask::test_reenqueue_during_job_cancellation - AttributeError: 'bytes' object has no attribute 'id'
FAILED tests/tasks/test_worker.py::TestWorkerMetrics::test_worker_records_success_metrics - TypeError: can only concatenate str (not "bytes") to str
FAILED tests/tasks/test_worker.py::TestQueueMetrics::test_enqueue_query_records_created_metric - TypeError: can only concatenate str (not "bytes") to str
==================================================== 9 failed, 787 passed, 1 skipped, 1175 warnings in 182.47s (0:03:02) ====================================================
make: *** [Makefile:40: backend-unit-tests] Error 1

The errors like can only concatenate str (not "bytes") to str should be solvable by casting the byte values to string first. However, I'm concerned there may be deeper issues from this approach. Let's see what others say...

@guidopetri Any interest in this one? 😄

justinclift · 2023-09-06T05:48:32Z

Thinking about it more, the existing code base already has the connection to rq not set that option:

# _REDIS_URL is the unchanged REDIS_URL we get from env vars, to be used later with RQ
_REDIS_URL = os.environ.get("REDASH_REDIS_URL", os.environ.get("REDIS_URL", "redis://localhost:6379/0"))
# This is the one to use for Redash' own connection:
REDIS_URL = add_decode_responses_to_redis_url(_REDIS_URL)

So _REDIS_URL is what's apparently getting used for the connection to rq, and has decode_responses unset.

Whereas REDIS_URL (different to _REDIS_URL) has decode_responses set, but isn't used for connecting to rq.

So, my reading of things is that we should definitely try upgrading the rq and rq-scheduler libraries to see if it helps your problem. The decode_responses option probably isn't involved here after all, as the connection to rq doesn't use it.

Am testing the upgraded libraries on my local computer at the moment. If they pass our CI tests on my local computer, then I'll make a PR out of it. 😄

justinclift · 2023-09-06T07:58:18Z

I've just created PR #6426, which upgrades the rq related dependencies as far as they can go without needing changes to our code base or tests.

@SchrosCat2013 Would you be ok to try adjusting your redis, rq, and rq-scheduler dependencies to match that, and see if it helps fix the problems?

And yeah, I'm aware you've been trying a slightly newer version of rq. That newer rq version isn't working 100% with our current tests, so might be causing some of the problem.

SchrosCat2013 · 2023-09-06T21:50:52Z

Thanks, I'll give that a go and let you know how it turns out

guidopetri · 2023-09-06T23:03:18Z

can only concatenate str (not "bytes") to str

This does indeed probably stem from Redis, but I agree that it's probably unrelated to the original error. AFAIK Redis by default returns bytes, and you have to decode with utf-8 to get str type, so that's what the error in the tests is complaining about.

SchrosCat2013 · 2023-09-08T02:59:59Z

Unfortunately, swapping to the suggested versions seems to have exacerbated the problem, if anything.

Part of this might be our environment - we're currently running an older version of Redis (v3.2.10). The suggested upgrade of redis-py puts this outside the list of supported versions. I'll give these library updates another go once we have upgraded Redis.

In the mean time I'm going to see if I can stand up and reproduce the issues in a Redash development environment.

justinclift · 2023-09-08T04:59:43Z

Interesting. Hopefully we can all get this solved sooner rather than later. 😄

In the mean time I'm going to see if I can stand up and reproduce the issues in a Redash development environment.

For the dev environment setup, please use the wiki page documentation rather than the knowledge base pages:

https://github.com/getredash/redash/wiki/Local-development-setup

We need to update the knowledge base pages to include the wiki stuff, but haven't gotten around to it yet.

SchrosCat2013 · 2023-10-19T04:31:06Z

I've had a chance to look into this. I found one place where the wrong connection was being passed to an RQ function leading to the UnicodeDecodeError. #6539

While looking for the problem, I noticed that the enqueue_query method is creating a Redis pipeline to watch a value, but the pipeline isn't being reset. I don't think this is related to the issue at hand, but seems like something that should be fixed. I've created a PR #6540 that calls reset inside a finally block. The other option is to change line 39 - with redis_connection.pipeline() as pipe:, but that does mean another layer of nesting and the code is already quite nested in that function.

I believe the actual source of the problem is the HardLimitingWorker.

This is performing a heartbeat on the Worker, but not on the Job. Once a query is picked off the queue for execution, RQ is giving the job a default timeout of 60s. Normally this would be updated every time the job heartbeat is called. Nothing happens immediately after this timeout expires. The next time the RQ StartedJobRegistry.cleanup method is called, the job status will be set to failed. The job will continue running, and will still list the job (execute_query) on the Redash Admin RQ status page. However the query will not appear on the query tab of the Admin RQ status page. It will also look like it has stopped executing to a user who has the query open, possibly leading them to click refresh and start another copy of the query.

The StartedJobRegistry.cleanup method is called a result of calling get_job_ids, and somewhat ironically it looks like this happens in two places in Redash -

The remove_ghost_locks maintenance task, which is probably not actually progressing to this step because of the UnicodeDecodeError
The rq_status API call, used by the Admin RQ status page where you would go to see the list of running queries

Rather than adding a call to the job heartbeat from within the HardLimitingWorker, I believe the best fix in this case is to remove the HardLimitingWorker entirely. It looks like this was cloned from an early version of RQ, and its purpose is to hard-kill a job that runs for too long without respecting its time limit. This functionality has since been baked directly into RQ. It was added in version 1.2.1, and has undergone a number of iterations of bugfixes.

I have managed to reproduce this locally for ad-hoc queries. I'm not certain if this is also the cause of the scheduled queries running multiple times yet, but will give the fixes a go and get back to you with the results.

justinclift · 2023-10-19T04:34:01Z

Sounds like really well done investigation. Awesome. 😄

justinclift · 2023-10-19T04:36:22Z

I believe the best fix in this case is to remove the HardLimitingWorker entirely. ...

Want to throw together a PR for this?

Also, any ideas on how we can definitively test the broken behaviour beforehand (smoking gun style), and test that it's fixed afterwards?

SchrosCat2013 · 2023-10-19T04:51:14Z

Happy to throw a PR together for this.

I'd have to give some thought about how you would go about testing this. As the broken behaviour is only exhibited once the RQ initial timeout of 60s has passed, you'd either have to have a test that waited 60s, messed with the internals of RQ, or directly modified the timeout for the job's underlying entry in Redis. None of those sound particularly appealing, though I could be persuaded on the third option.

justinclift · 2023-10-19T06:00:12Z

... or directly modified the timeout for the job's underlying entry in Redis.

Of the three, that one sounds most reasonable for our purposes here. But, lets ask @guidopetri as he's a lot better with Python than me. 😄

guidopetri · 2023-10-19T22:27:26Z

imo waiting for 60s for a test isn't that big of a deal, I'd be fine with that. Messing with the internals of Redis sounds more error-prone and in my opinion doesn't give enough lift that it's worth doing instead.

SchrosCat2013 added the Support Question label Sep 6, 2023

justinclift added the Bug label Sep 6, 2023

justinclift mentioned this issue Sep 6, 2023

Update rq related dependencies, to hopefully alleviate issue 6424 #6426

Merged

3 tasks

This was referenced Oct 19, 2023

Use rq_redis_connection when listing RQ jobs in remove_ghost_locks maintenance task #6539

Open

Reset Redis pipeline at end of enqueue_query #6540

Merged

ferduz mentioned this issue Nov 1, 2023

Enable TLS between Redash and Redis #6565

Closed

SchrosCat2013 mentioned this issue Nov 12, 2023

Local heartbeat fixes (For issue 6424) pushpay/redash#5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redis decode_responses rq incompatibility #6424

Redis decode_responses rq incompatibility #6424

SchrosCat2013 commented Sep 6, 2023

justinclift commented Sep 6, 2023

justinclift commented Sep 6, 2023 •

edited

justinclift commented Sep 6, 2023

justinclift commented Sep 6, 2023

justinclift commented Sep 6, 2023 •

edited

justinclift commented Sep 6, 2023

justinclift commented Sep 6, 2023 •

edited

justinclift commented Sep 6, 2023

SchrosCat2013 commented Sep 6, 2023

guidopetri commented Sep 6, 2023

SchrosCat2013 commented Sep 8, 2023

justinclift commented Sep 8, 2023

SchrosCat2013 commented Oct 19, 2023 •

edited

justinclift commented Oct 19, 2023

justinclift commented Oct 19, 2023

SchrosCat2013 commented Oct 19, 2023

justinclift commented Oct 19, 2023

guidopetri commented Oct 19, 2023

Redis decode_responses rq incompatibility #6424

Redis decode_responses rq incompatibility #6424

Comments

SchrosCat2013 commented Sep 6, 2023

justinclift commented Sep 6, 2023

justinclift commented Sep 6, 2023 • edited

justinclift commented Sep 6, 2023

justinclift commented Sep 6, 2023

justinclift commented Sep 6, 2023 • edited

justinclift commented Sep 6, 2023

justinclift commented Sep 6, 2023 • edited

justinclift commented Sep 6, 2023

SchrosCat2013 commented Sep 6, 2023

guidopetri commented Sep 6, 2023

SchrosCat2013 commented Sep 8, 2023

justinclift commented Sep 8, 2023

SchrosCat2013 commented Oct 19, 2023 • edited

justinclift commented Oct 19, 2023

justinclift commented Oct 19, 2023

SchrosCat2013 commented Oct 19, 2023

justinclift commented Oct 19, 2023

guidopetri commented Oct 19, 2023

justinclift commented Sep 6, 2023 •

edited

justinclift commented Sep 6, 2023 •

edited

justinclift commented Sep 6, 2023 •

edited

SchrosCat2013 commented Oct 19, 2023 •

edited