Need details of what happens to running jobs during Shutdown #6213

nhorton · 2024-03-04T23:13:23Z

This is selfishly for Sidekiq Pro but I think it applies to all the others.

On https://github.com/sidekiq/sidekiq/wiki/Deployment, it says

"If any jobs are still running when the timeout is up, Sidekiq will push those jobs back to Redis so they can be rerun later".

But what does that actually mechanically mean?

Is that a Sidekiq retry that goes back? Or is it actually put straight into the queue again?
It looks like a Sidekiq::Shutdown is sent to each thread - is that correct and the intended behavior? Right now, that is triggering Rails after_discard callbacks since that is a descendant of Interrupt and nothing on the Rails side will catch that (somewhat by design).
Is there any other signal allowed through that would go to sub-processes we have started? Or do we need to catch the Sidekiq::Shutdown, send a TERM to child processes, then reraise the Shutdown?

mperham · 2024-03-05T18:38:53Z

I don't document this because it starts to get into implementation detail. Anything I document becomes behavior I have to support and maintain backwards compatibility with. I always advise people to read the code itself, the How Sidekiq Works blog post talks a bit about job fetch and shutdown too.

It pushes the raw job payload back to the front of the queue; the job will be restarted ASAP.
Yes, for jobs which run past the shutdown timeout, 25 seconds. Most jobs should finish within the timeout. A gem like sidekiq-iteration can help you with long-running database jobs.
I'm not an expert in Ruby's signal handling semantics with child processes, I can't answer this with confidence. If your child process is long-running, seems like this might be necessary but that's speculative on my part.

nhorton · 2024-03-08T02:42:56Z

Thanks for the info on this. I think #2 might be worth the documentation because the behavior is effectively "jobs disappear if they did not stop in time" because that Sidekiq::Shutdown exception won't be one that any standard handlers deal with.

mperham · 2024-03-08T16:10:04Z

Could you explain more? Long running jobs are pushed back to Redis by Sidekiq::Manager.

nhorton · 2024-03-13T17:04:18Z

@mperham - sorry, I thought I responded.

There are two reasons I say that.

We still have anecdotal experiences with jobs disappearing, and this seems related
We see after_discard Rails callbacks happen when the Sidekiq::Shutdown events have fired, so Rails thinks they are discarding. I suspect that is because the Sidekiq retry mechanism is a new job in Rail's sense?

Closely related in all this is that we just get a ton of questionable behaviors around Sidekiq in shutdown situations. For example, we had an instance get hard-killed earlier today, and 2 hours later the jobs it was doing are still showing in the UI as being in progress. The jobs were indeed re-enqueued and are happening on another machine as well, but even 2 hours later the working set status is incorrect.

mperham · 2024-03-13T17:23:52Z

The jobs showing in Busy for hours later is because of the hard kill. Sidekiq's heartbeat is not fully transactional in order to minimize its runtime overhead, it uses pipelined instead of multi. This means there's a race condition where killing the process while it is updating the heartbeat data can leave the data in a half-baked form without expiry.

sidekiq/lib/sidekiq/launcher.rb

Lines 151 to 157 in 0507f0b

 conn.pipelined do |transaction| 

 transaction.unlink(work_key) 

 curstate.each_pair do |tid, hash| 

 transaction.hset(work_key, tid, Sidekiq.dump_json(hash)) 

 end 

 transaction.expire(work_key, 60) 

 end

Here's a simple fix (-1 means "no expiry" in Redis terms):

% redis-cli
127.0.0.1:6379> keys *:work
1) "Mikes-MacBook-Pro.local:18490:303a2ae38652:work"
127.0.0.1:6379> ttl Mikes-MacBook-Pro.local:18490:303a2ae38652:work
(integer) -1
127.0.0.1:6379> expire Mikes-MacBook-Pro.local:18490:303a2ae38652:work 60
(integer) 1

mperham · 2024-03-13T18:29:42Z

...and that leakage is now fixed on main.

mperham mentioned this issue Mar 13, 2024

Jobs list on Busy page can leak #6227

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need details of what happens to running jobs during Shutdown #6213

Need details of what happens to running jobs during Shutdown #6213

nhorton commented Mar 4, 2024 •

edited

mperham commented Mar 5, 2024 •

edited

nhorton commented Mar 8, 2024

mperham commented Mar 8, 2024

nhorton commented Mar 13, 2024

mperham commented Mar 13, 2024

mperham commented Mar 13, 2024

Need details of what happens to running jobs during Shutdown #6213

Need details of what happens to running jobs during Shutdown #6213

Comments

nhorton commented Mar 4, 2024 • edited

mperham commented Mar 5, 2024 • edited

nhorton commented Mar 8, 2024

mperham commented Mar 8, 2024

nhorton commented Mar 13, 2024

mperham commented Mar 13, 2024

mperham commented Mar 13, 2024

nhorton commented Mar 4, 2024 •

edited

mperham commented Mar 5, 2024 •

edited