How to retry a timed out job? #401

ross-nordstrom · 2023-05-26T17:33:29Z

Context

I'm trying to add some fail-safes around a resource-intensive job with a lot of external dependencies, so it can sometimes hang or OOM. It usually works on the next retry.

Issue

I'd like to set a job-specific timeout and have it retry after a TimeoutError, but I can't figure out how to do that. The TimeoutError seems to be terminal and I can't get it to retry... any advice on how to make this work?

See related issue: #402

Reproduction

run the worker python script.py worker
queue jobs by running python script.py client
Watch the job run, timeout, and NOT retry in the worker

import asyncio
import random

import arq.worker
from arq import create_pool
from arq.connections import RedisSettings
from arq.typing import WorkerSettingsBase
from future.moves import sys

from settings import redis_settings


async def __flaky_job(_ctx):
    latency = random.uniform(1.0, 60.0)
    print(f"Starting with that will take {latency:.2f}s to run (and is allowed to run up to 3s)...")
    await asyncio.sleep(latency)
    print("Done!")


flaky_job_with_timeout = arq.worker.func(__flaky_job, name='flaky_job', timeout=3)


class WorkerSettings(WorkerSettingsBase):
    redis_settings = redis_settings
    functions = [flaky_job_with_timeout]


def worker():
    print('Starting worker')
    arq.run_worker(WorkerSettings)


async def client():
    print('Running client')
    redis = await create_pool(RedisSettings())
    await redis.enqueue_job('flaky_job')
    print('Enqueued job')


if __name__ == '__main__':
    match sys.argv[1]:
        case 'client':
            asyncio.run(client())
        case 'worker':
            worker()

The text was updated successfully, but these errors were encountered:

JonasKs · 2023-05-26T21:55:26Z

There's many settings to the worker class, and you can specify max retry on specific jobs. Please see the documentation. I also suggest reading this part: https://arq-docs.helpmanual.io/#retrying-jobs-and-cancellation

ross-nordstrom · 2023-05-26T22:11:11Z

Right, I've been using Retry and job aborting successfully, but am struggling with timeouts-vs-retries.

Is it a correct expectation that a timed-out job (job runs longer than timeout) will be automatically retried?
If not, is there a way to do this?

I tried catching the error, but it doesn't propagate up to where I invoke run_worker:

try:
    arq.run_worker(WorkerSettings)
except Exception as e:
    # Never hit on TimeoutError
    logging.exception('worker error')

By the way, this is what the worker logs when you run my reproduction steps:

Starting worker
Starting with that will take 30.57s to run (and is allowed to run up to 3s)...
  3.00s ! 0de71325829747abbbd608ec97fc1f4c:flaky_job failed, TimeoutError:

ross-nordstrom mentioned this issue May 26, 2023

Lost job is retried after *WORST* job timeout, not *ITS* timeout #402

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to retry a timed out job? #401

How to retry a timed out job? #401

ross-nordstrom commented May 26, 2023 •

edited

JonasKs commented May 26, 2023

ross-nordstrom commented May 26, 2023

How to retry a timed out job? #401

How to retry a timed out job? #401

Comments

ross-nordstrom commented May 26, 2023 • edited

Context

Issue

Reproduction

JonasKs commented May 26, 2023

ross-nordstrom commented May 26, 2023

ross-nordstrom commented May 26, 2023 •

edited