Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQS messages can be dropped during shutdown #1819

Open
mgorven opened this issue Nov 7, 2023 · 4 comments
Open

SQS messages can be dropped during shutdown #1819

mgorven opened this issue Nov 7, 2023 · 4 comments

Comments

@mgorven
Copy link

mgorven commented Nov 7, 2023

When celery is stopped gracefully, the in-flight request to SQS is not completed. If SQS returns messages for this request it considers them delivered, but Celery doesn't do anything with them (even if task_reject_on_worker_lost is set). SQS will redeliver the message when the visibility timeout expires, but this isn't great for a graceful shutdown.

SQS uses CurlClient to make requests, which uses celery.worker.loops.asynloop to do IO. The event loop is the first thing to stop when SIGTERM/SIGINT is received, and so in-flight requests are simply dropped.

@auvipy
Copy link
Member

auvipy commented Nov 8, 2023

@rafidka any insight to share?

@rafidka
Copy link
Contributor

rafidka commented Nov 8, 2023

@auvipy , I would need to do some debugging on this. I will try to allocate some time for it, but I cannot promise anything this week.

@mgorven , it would help if you have some easy step-by-step reproduction.

@mgorven
Copy link
Author

mgorven commented Nov 8, 2023

I don't have an easy repro of actually losing events, I'll work on that. What I've done is add debug prints in CurlClient._setup_request() and CurlClient._process() and I can see that a request is started but never finished during shutdown: https://gist.github.com/mgorven/f671b3b9384b2814d86f7e99451a2936

@mgorven
Copy link
Author

mgorven commented Nov 8, 2023

  1. Setup celery with an SQS queue which isn't processing other events and with a visibility timeout of a few minutes
  2. In another shell have a celery call command ready to dispatch a task
  3. Add a debug print in CurlSetup._setup_request() so you can see when a request is started
  4. Run celery worker
  5. Ctrl-C the worker just after it starts an SQS request
  6. Run celery call
  7. Start the worker again

You'd expect the worker to immediately process the dispatched task, but instead it only does so after the visibility timeout expires.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants