Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need details of what happens to running jobs during Shutdown #6213

Open
nhorton opened this issue Mar 4, 2024 · 6 comments
Open

Need details of what happens to running jobs during Shutdown #6213

nhorton opened this issue Mar 4, 2024 · 6 comments

Comments

@nhorton
Copy link

nhorton commented Mar 4, 2024

This is selfishly for Sidekiq Pro but I think it applies to all the others.

On https://github.com/sidekiq/sidekiq/wiki/Deployment, it says

"If any jobs are still running when the timeout is up, Sidekiq will push those jobs back to Redis so they can be rerun later".

But what does that actually mechanically mean?

  1. Is that a Sidekiq retry that goes back? Or is it actually put straight into the queue again?
  2. It looks like a Sidekiq::Shutdown is sent to each thread - is that correct and the intended behavior? Right now, that is triggering Rails after_discard callbacks since that is a descendant of Interrupt and nothing on the Rails side will catch that (somewhat by design).
  3. Is there any other signal allowed through that would go to sub-processes we have started? Or do we need to catch the Sidekiq::Shutdown, send a TERM to child processes, then reraise the Shutdown?
@mperham
Copy link
Collaborator

mperham commented Mar 5, 2024

I don't document this because it starts to get into implementation detail. Anything I document becomes behavior I have to support and maintain backwards compatibility with. I always advise people to read the code itself, the How Sidekiq Works blog post talks a bit about job fetch and shutdown too.

  1. It pushes the raw job payload back to the front of the queue; the job will be restarted ASAP.
  2. Yes, for jobs which run past the shutdown timeout, 25 seconds. Most jobs should finish within the timeout. A gem like sidekiq-iteration can help you with long-running database jobs.
  3. I'm not an expert in Ruby's signal handling semantics with child processes, I can't answer this with confidence. If your child process is long-running, seems like this might be necessary but that's speculative on my part.

@nhorton
Copy link
Author

nhorton commented Mar 8, 2024

Thanks for the info on this. I think #2 might be worth the documentation because the behavior is effectively "jobs disappear if they did not stop in time" because that Sidekiq::Shutdown exception won't be one that any standard handlers deal with.

@mperham
Copy link
Collaborator

mperham commented Mar 8, 2024

Could you explain more? Long running jobs are pushed back to Redis by Sidekiq::Manager.

@nhorton
Copy link
Author

nhorton commented Mar 13, 2024

@mperham - sorry, I thought I responded.

There are two reasons I say that.

  1. We still have anecdotal experiences with jobs disappearing, and this seems related
  2. We see after_discard Rails callbacks happen when the Sidekiq::Shutdown events have fired, so Rails thinks they are discarding. I suspect that is because the Sidekiq retry mechanism is a new job in Rail's sense?

Closely related in all this is that we just get a ton of questionable behaviors around Sidekiq in shutdown situations. For example, we had an instance get hard-killed earlier today, and 2 hours later the jobs it was doing are still showing in the UI as being in progress. The jobs were indeed re-enqueued and are happening on another machine as well, but even 2 hours later the working set status is incorrect.

@mperham
Copy link
Collaborator

mperham commented Mar 13, 2024

The jobs showing in Busy for hours later is because of the hard kill. Sidekiq's heartbeat is not fully transactional in order to minimize its runtime overhead, it uses pipelined instead of multi. This means there's a race condition where killing the process while it is updating the heartbeat data can leave the data in a half-baked form without expiry.

conn.pipelined do |transaction|
transaction.unlink(work_key)
curstate.each_pair do |tid, hash|
transaction.hset(work_key, tid, Sidekiq.dump_json(hash))
end
transaction.expire(work_key, 60)
end

Here's a simple fix (-1 means "no expiry" in Redis terms):

% redis-cli
127.0.0.1:6379> keys *:work
1) "Mikes-MacBook-Pro.local:18490:303a2ae38652:work"
127.0.0.1:6379> ttl Mikes-MacBook-Pro.local:18490:303a2ae38652:work
(integer) -1
127.0.0.1:6379> expire Mikes-MacBook-Pro.local:18490:303a2ae38652:work 60
(integer) 1

@mperham
Copy link
Collaborator

mperham commented Mar 13, 2024

...and that leakage is now fixed on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants