DEV Community: Kolade Fajimi

Running Async Python Inside Celery Is Harder Than You Think.

Kolade Fajimi — Mon, 22 Jun 2026 23:17:50 +0000

The problem is straightforward to state and surprisingly hard to solve correctly.

Celery workers are synchronous. Celery spawns prefork worker processes, and when a task arrives, it calls your task function like this: task_function(*args, **kwargs). It expects a return value. It blocks the worker thread until it gets one. It does not know or care that you wrote async def.

But modern Python services are async. FastAPI is async. SQLAlchemy 2.0 is async. httpx, aiohttp, asyncpg the entire interesting half of the ecosystem has gone async-first. The idea of maintaining two parallel code paths, one async for your web layer, one sync for your task layer is exactly the kind of thing that creates maintenance debt, copy-paste bugs, and the kind of divergence you only notice when something breaks in production.

So you want to write async def task functions and have them work inside a Celery worker. How hard can it be?

Harder than it looks.

Why `asyncio.run()` doesn't work

The first thing most people try:

def task_wrapper(*args, **kwargs):
    return asyncio.run(your_async_function(*args, **kwargs))

This works in isolation. It fails in production for a specific reason: asyncio.run() creates a new event loop, runs the coroutine to completion, then closes the loop. If there is already a running event loop on the current thread, and there frequently is, in test environments, in newer Celery versions, in signal handlers, it raises:

RuntimeError: This event loop is already running.

The fix most people find next is nest_asyncio:

import nest_asyncio
nest_asyncio.apply()
# now asyncio.run() "works" from inside a running loop

nest_asyncio patches the event loop to allow re-entrant calls. It works in simple cases. The subtle failure mode: re-entrant event loops change the execution order of scheduled callbacks and coroutines. Code that was safe under normal scheduling assumptions becomes non-deterministic under concurrent load. Bugs appear only at production concurrency, only under specific timing, and are nearly impossible to reproduce in development.

The prefork complication

Even if you solve the asyncio.run() problem, Celery's prefork concurrency model introduces a second failure that takes longer to diagnose because it manifests as infinite silence rather than an immediate error.

When Celery starts, it forks N worker processes from a single parent. After fork(), the child process inherits the parent's memory including any event loop objects that existed before the fork.

The problem: fork() does not copy threads. A Python asyncio.AbstractEventLoop is driven by a thread calling loop.run_forever(). After fork(), the child has the loop object but not the thread running it. The loop's internal state may indicate it was running; nothing is actually driving it. Any coroutine scheduled onto this loop hangs indefinitely.

@worker_process_init.connect
def bad_init(**kwargs):
    loop = asyncio.get_event_loop()
    # This loop was inherited from the parent.
    # The thread driving it died when the parent forked.
    # loop.is_running() → False.
    # Scheduling coroutines onto it produces no results and no errors.
    future = asyncio.run_coroutine_threadsafe(some_coro(), loop)
    future.result()  # blocks forever

This is the kind of bug that produces a zero-width failure window. The loop object exists and looks valid. No exception is raised. Work just never completes. I spent the better part of a day convinced the issue was in the Redis client before realizing the loop scheduled to drive it had died at fork time.

The solution: a persistent bridge loop per worker process

The correct approach is to create a brand-new event loop inside each forked worker process and start a dedicated daemon thread to drive it. The bridge loop is the only asyncio runtime in the worker process. All async work runs on it. Celery's synchronous worker threads never touch an event loop directly.

worker_loop: asyncio.AbstractEventLoop | None = None

@worker_process_init.connect
def init_worker_process(**kwargs):
    global worker_loop

    # Always create a fresh loop in the forked child.
    # Never reuse the inherited parent loop object.
    worker_loop = asyncio.new_event_loop()

    # A daemon thread drives the loop independently of Celery's
    # synchronous execution threads.
    t = threading.Thread(
        target=_run_event_loop,
        args=(worker_loop,),
        daemon=True
    )
    t.start()

def _run_event_loop(loop):
    asyncio.set_event_loop(loop)
    loop.run_forever()

Now the bridge is asyncio.run_coroutine_threadsafe. When Celery calls the synchronous task wrapper, the wrapper schedules the async orchestration coroutine onto the background loop and blocks the worker thread waiting for the result:

def wrapper(self, *args, **kwargs):
    async def _orchestrate():
        # Schema migration, idempotency check, Phoenix heartbeat,
        # OTel span setup, task execution, fence validation, DLQ quarantine.
        result = await your_async_task_function(*args, **kwargs)
        return result

    # Schedule the coroutine from this synchronous thread onto
    # the event loop running on the background thread.
    future = asyncio.run_coroutine_threadsafe(_orchestrate(), worker_loop)

    # Block the Celery worker thread here. All actual work happens
    # on the bridge loop thread.
    return future.result(timeout=300)

run_coroutine_threadsafe is the correct API for this pattern. It is thread-safe, it returns a concurrent.futures.Future (not an asyncio Future), and future.result() blocks without touching the event loop. The background loop thread does all the async I/O. The Celery worker thread just waits.

This solves both problems cleanly:

No asyncio.run() from inside a running loop. The loop lives on a different thread.
No inherited-but-dead loop. Each worker creates its own after fork.

The `push/apush` split

Dispatching tasks has its own version of this problem. Celery's send_task is synchronous and blocking, it opens a broker connection and writes a message. If you call it from inside an async FastAPI route handler, you block the event loop during a network round-trip.

This is why Relier has two dispatch methods:

# From async code (FastAPI, async Django):
receipt = await send_invoice.apush(invoice_id)

# From sync code (Flask routes, sync Django views, scripts):
receipt = send_invoice.push(invoice_id)

apush runs the blocking broker send in an executor so the async caller is never blocked:

async def apush(self, *args, **kwargs):
    # Admission check, schema wrapping, OTel context injection...

    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(
        None,
        lambda: celery_app.send_task(
            self.name,
            args=(envelope,),
            queue=queue,
            task_id=task_id,
        ),
    )

push explicitly guards against being called from inside a running loop:

def push(self, *args, **kwargs):
    try:
        asyncio.get_running_loop()
    except RuntimeError:
        pass  # No running loop on this thread. Safe.
    else:
        raise RuntimeError(
            f"{self.name}.push() was called from inside a running event loop, "
            "where it would block and deadlock that loop. "
            f"Use `await {self.name}.apush(...)` instead."
        )

    # Inside a Celery worker: reuse the bridge loop.
    if worker_loop and worker_loop.is_running():
        future = asyncio.run_coroutine_threadsafe(
            self.apush(*args, **kwargs), worker_loop
        )
        return future.result(timeout=5.0)

    # Outside Celery (Flask route, script):
    return asyncio.run(self.apush(*args, **kwargs))

The error message in the RuntimeError matters. When someone calls push() from a FastAPI route handler, they get an actionable message telling them exactly what to do instead. Not a silent deadlock. Not a timeout with no context. A specific message at the exact moment the mistake is made.

The check itself, asyncio.get_running_loop() in a try/except RuntimeError is the canonical way to detect whether the current thread is running an event loop. It raises RuntimeError if no loop is running on this thread, which is the safe case for push().

Sync tasks in an async world

What about existing sync task functions? A codebase of def tasks shouldn't require a full rewrite to benefit from Relier's reliability stack.

Inside the orchestration coroutine, execution branches on whether the function is async:

if inspect.iscoroutinefunction(func):
    result = await func(*actual_args, **actual_kwargs)
else:
    result = await asyncio.to_thread(func, *actual_args, **actual_kwargs)

asyncio.to_thread runs the sync function in Python's default thread pool executor. The orchestration layer awaits it without blocking the bridge loop. All the async infrastructure, heartbeat refreshes, Phoenix registration, OTel span updates, fence validation keeps running concurrently on the bridge loop while the sync function runs on a thread pool thread.

The constraint is honest: two-tier timeouts (soft_timeout, hard_timeout) only work for async def tasks. A sync function running in asyncio.to_thread cannot be cooperatively cancelled from outside. Relier raises ValueError at decoration time if you pass timeout parameters to a sync task, rather than silently providing no protection:

@rl_task(soft_timeout=8, hard_timeout=10)  # ValueError at import time
def sync_task(data: str) -> dict:
    ...

# Fix: convert to async def, or remove the timeout parameters.

Failing loudly at decoration time is better than failing silently at runtime when the timeout fires and nothing happens.

Timeout enforcement without thread kills

Two-tier timeouts deserve their own explanation because they interact with the bridge loop in a non-obvious way.

When a task starts, Relier spawns two watcher coroutines as asyncio tasks alongside the actual work:

task_coro = asyncio.create_task(func(*args, **kwargs))

async def _soft_timeout_handler():
    await asyncio.sleep(float(soft))
    if not task_coro.done():
        # Fire the recovery hook. Task keeps running.
        if on_soft:
            await on_soft(ctx)

async def _hard_timeout_handler():
    await asyncio.sleep(float(hard))
    if not task_coro.done():
        task_coro.cancel()  # Delivers CancelledError at next await point.

soft_watcher = asyncio.create_task(_soft_timeout_handler())
hard_watcher = asyncio.create_task(_hard_timeout_handler())

done, pending = await asyncio.wait(
    {task_coro, hard_watcher},
    return_when=asyncio.FIRST_COMPLETED,
)

All three coroutines run concurrently on the bridge loop. The soft timeout fires and calls your recovery hook, where you can call ctx.set_partial(state) to checkpoint work in progress while the task keeps running. If the task doesn't finish before the hard deadline, task_coro.cancel() delivers asyncio.CancelledError at the task's next await point.

No thread kills. No SIGALRM. No OS-level signals. Pure cooperative asyncio cancellation. This matters for cleanup: CancelledError propagates through finally blocks. Resources get released. Partial state gets checkpointed. The task gets quarantined to the DLQ with its full payload and resurrection history. None of that happens with a hard OS kill.

The disposable loop case

One edge case worth knowing: outside a Celery worker in a CLI script, a management command, a test, there's no bridge loop. The task wrapper's loop resolution falls through to creating a fresh event loop just for that call:

def _get_worker_loop():
    # 1. Check for persistent worker bridge.
    if relier.tasks.app.worker_loop is not None:
        return relier.tasks.app.worker_loop

    # 2. Check for a running loop on this thread (test contexts).
    try:
        return asyncio.get_running_loop()
    except RuntimeError:
        pass

    # 3. Create a disposable loop for this one call.
    loop = asyncio.new_event_loop()
    asyncio.set_event_loop(loop)
    return loop

Disposable loops are cleaned up after the call: Redis connections are closed, the loop is stopped and closed, and asyncio.set_event_loop(None) clears the thread-local reference. The persistent worker_loop is specifically excluded from this cleanup path closing the bridge loop mid-execution would kill all in-flight tasks.

What I learned

The prefork problem is the kind of failure that shows up as "nothing happens" rather than an exception. You schedule coroutines, they don't run, no error surfaces. It took a day of debugging the wrong thing before I isolated it to the inherited-but-dead loop. The fix (create a fresh loop in worker_process_init) is obvious in retrospect. Getting there required understanding exactly what fork() does to threads.

asyncio.run_coroutine_threadsafe is underused. Most Python developers never need to cross a thread boundary into a running event loop, so the API is obscure. But for anything that marries a sync framework (Celery, Django ORM, WSGI in general) with async internals, it is the correct and safe way to do it. It appears in the Python docs in a single paragraph. It deserves more.

The two-method dispatch split (push/apush) is the right API surface even though it introduces surface area. The alternative, a single method that auto-detects the context and does the right thing sounds better but produces confusing failures when the auto-detection is wrong. The explicit split makes the contract clear. Async code always uses apush. Sync code always uses push. The guard in push() exists so that misuse produces a useful error immediately rather than a deadlock ten seconds later.

Cooperative timeout cancellation is better than OS-level signals for tasks that care about cleanup. The finally block guarantee is the part that matters: partial state can be persisted, connections can be closed, the DLQ entry gets written with everything needed to re-inspect or re-dispatch. An OS kill gives you none of that.

The whole bridge, bridge loop thread, run_coroutine_threadsafe, push/apush split, disposable loop cleanup is about 200 lines in app.py and decorator.py combined. The complexity is real but contained. Once the pattern is in place, every async def task function just works, without the task author knowing anything about the event loop infrastructure underneath.

GitHub: github.com/getrelier/relier

Docs: getrelier.github.io/relier

pip install relier

Redis Lua Scripting for Distributed Systems: How Atomicity Prevents Race Conditions.

Kolade Fajimi — Mon, 15 Jun 2026 09:29:13 +0000

There is a failure mode that is almost impossible to reproduce in tests but completely reproducible in production under load. Two workers. One Redis key. Both check whether a task has already been claimed. Both see "no." Both claim it. Now the task runs twice.

This is not a logic bug. The logic is correct. The problem is that correctness requires atomicity, and two separate Redis commands, however fast are never atomic.

Every critical operation in Relier is a Lua script. Not because Lua is elegant. Because Redis's single-threaded execution model means a Lua script is the only way to make a distributed check-then-act actually correct.

This is a walk through all six of them.

Why GET + SET is never safe

When you need to claim a task, check whether it's been taken, and if not, take it, the naive implementation is two commands:

existing = await redis.get(key)
if existing is None:
    await redis.set(key, worker_id)

This has a race condition that is invisible at low concurrency and guaranteed to trigger at high concurrency.

t=0ms  Worker A: GET "task:123" → nil   # not claimed
t=0ms  Worker B: GET "task:123" → nil   # not claimed (same read, no lock)
t=1ms  Worker A: SET "task:123" "worker-A"
t=1ms  Worker B: SET "task:123" "worker-B"

Both workers claimed the task. Worker A's write is immediately overwritten by Worker B. Now both are executing it.

The fix is not a Python-side lock. Python-side locks are process-local, they can't protect you across multiple workers, containers, or machines. The fix is moving the check-and-set into a single Redis operation that cannot be interleaved with anything else.

That is what Lua gives you. Redis executes Lua scripts atomically. No other command runs between the first line and the last.

ACQUIRE_LUA (idempotency claim)

The first script Relier uses is the idempotency claim. When @rl_task(idempotent=True) is set, every task submission runs this before any work happens:

local existing = redis.call('GET', KEYS[1])
if existing then
    return {1, existing}
end
redis.call('SET', KEYS[1], ARGV[1], 'NX', 'EX', ARGV[2])
return {0, false}

KEYS[1] is the idempotency key, derived from the task name and arguments, or set explicitly by the caller. ARGV[1] is the in-flight sentinel (a UUID tied to this execution attempt). ARGV[2] is the TTL.

What this closes: the race between two concurrent submissions of the same task. Both arrive at Redis. Both invoke the script. Redis executes them one at a time. The first one sees existing = nil, claims the key, returns {0, false}, proceed. The second one sees existing = <the sentinel>, returns {1, existing}, already claimed, skip. One execution. Not two.

The NX flag on the SET is belt-and-suspenders. The GET already gates on existence, but NX means the SET itself is also a no-op if the key was written between the GET and SET, which cannot happen inside a Lua script, but matters if you ever run the commands outside one.

The return value carries the existing state so the caller can distinguish between in-flight (sentinel value) and completed (cached result JSON). That distinction is what lets Relier return the cached result to a duplicate caller without re-running the task.

RELEASE_LUA (compare-and-delete)

When a task fails mid-execution, Relier needs to release the in-flight sentinel so future submissions can retry. The naive approach is a plain DEL:

await redis.delete(key)

This is wrong.

Consider: Worker A claims task X and sets its sentinel. Worker A dies. Relier's Phoenix resurrector detects the expired heartbeat and re-queues task X. Worker B claims it and sets its own sentinel. Now Worker A's crash handler wakes up (the process didn't actually die, it threw an exception) and runs DEL key. It deletes Worker B's sentinel. Worker B's task is now claimable by a third worker. Task runs twice.

The correct operation is conditional delete: delete the key only if the stored value is still your sentinel. That is RELEASE_LUA:

if redis.call('GET', KEYS[1]) == ARGV[1] then
    return redis.call('DEL', KEYS[1])
end
return 0

ARGV[1] is the sentinel value this worker set when it claimed the task. If the stored value no longer matches, because another worker claimed it after resurrection, the delete is skipped. Worker B's execution is protected.

The compare-and-delete pattern shows up in distributed systems literature under many names (CAS, compare-and-swap). The point is always the same: never release a lock without first verifying you still own it. Plain DEL is always wrong.

RESURRECT_LUA (atomic lease and fence token)

When the Phoenix resurrector detects a dead worker and re-queues a task, two things need to happen atomically:

Acquire a lease, so concurrent resurrectors don't both re-queue the same task
Publish a new fence token, so the zombie worker (if it wakes up) can't commit stale results

local lease_key = KEYS[1]
local fence_key = KEYS[2]

local token = ARGV[1]
local lease_ttl = tonumber(ARGV[2])
local fence_ttl = tonumber(ARGV[3])

if redis.call("EXISTS", lease_key) == 1 then
    return 0
end

redis.call("SET", lease_key, token, "EX", lease_ttl)
redis.call("SET", fence_key, token, "EX", fence_ttl)

return 1

The atomicity here solves a specific problem. Imagine two resurrectors scanning simultaneously, which happens whenever you run more than one worker embedding the scanner. Both detect the same expired heartbeat. If they could interleave:

Resurrector A: EXISTS lease_key → 0  (no lease)
Resurrector B: EXISTS lease_key → 0  (no lease, A hasn't written yet)
Resurrector A: SET lease_key ...
Resurrector B: SET lease_key ...     # both acquired the lease
Both dispatch the task to the queue

The script prevents this. The EXISTS check and the SET are one atomic unit. One of the two resurrectors wins. The other sees EXISTS = 1 and returns 0. One re-queue dispatch. Not two.

The fence token is set in the same script for a different reason: the new fence token must be visible in Redis before any worker picks up the re-queued task. If lease acquisition and fence token write were separate commands, a worker could pick up the task between them, before the fence token was set and have nothing to validate against. That window is eliminated by collapsing both writes into one atomic script.

VALIDATE_LUA (worker self-check)

While a task is running, the worker periodically validates that it still owns the execution slot, that it hasn't been declared dead and resurrected while it was doing real work:

local lease_key = KEYS[1]
local fence_key = KEYS[2]

local expected = ARGV[1]

local lease = redis.call("GET", lease_key)
if lease ~= expected then
    return 0
end

local fence = redis.call("GET", fence_key)
if fence ~= expected then
    return 2
end

return 1

Return values: 1 = still current, 0 = invalid lease, 2 = stale fence.

The worker checks both the lease key and the fence key against its own token. If either has been overwritten by a resurrector — because the worker's heartbeat expired while it was GC-paused or doing slow I/O — the worker gets a non-1 return, cancels its own execution, and exits cleanly.

This is how Relier handles the zombie worker problem: the zombie doesn't get detected by an external process and killed — it self-detects during a validation check and stops. The heartbeat expiry + periodic VALIDATE_LUA check forms a cooperative detection loop: the resurrector re-queues the task, the zombie eventually validates and cancels itself.

COMMIT_CHECK_LUA (the last gate before writing results)

Even with VALIDATE_LUA running periodically, there is a window: the last validation passes, then the fence token expires, then the task tries to write its result. Without a final gate, that write goes through even though the worker is now a zombie from Redis's perspective.

COMMIT_CHECK_LUA runs immediately before any result is committed to storage:

local fence_key = KEYS[1]
local expected = ARGV[1]

local current = redis.call("GET", fence_key)

if current ~= expected then
    return 0
end

return 1

0 = stale worker, result rejected. 1 = still current, write proceeds.

This is the commit protocol that makes "exactly-once execution" a verifiable claim rather than an optimistic assertion. The fence token is the proof of ownership. The Lua script makes checking and acting on that proof atomic. No commit can slip through between a successful check and the actual write because the check and the conditional return are the same Redis operation.

In our chaos tests, this script rejects stale commits from zombie workers on every run. The log line "zombie commit rejected" appears exactly as often as "GC-pause-length resurrection" events, one rejection per zombie wakeup. The math checks out.

ADMISSION_LUA (rate limiting without races)

The admission control script is the simplest of the set and the most illustrative of why Lua is the right tool:

local current = redis.call('INCR', KEYS[1])
if current == 1 then
    redis.call('EXPIRE', KEYS[1], ARGV[2])
end
local limit = tonumber(ARGV[1])
if current > limit then
    return {0, current, redis.call('TTL', KEYS[1])}
end
return {1, current, 0}

KEYS[1] = the rate limit window key (e.g., rl:admission:global). ARGV[1] = the request limit. ARGV[2] = the window duration in seconds.

The correctness problem here is the INCR + EXPIRE pair. INCR creates the key if it doesn't exist and increments atomically. But if the EXPIRE ran as a separate command:

Worker A: INCR → 1
Worker B: INCR → 2
Worker A: EXPIRE key 10   # only one EXPIRE runs
Worker B: EXPIRE key 10   # fine here, but what if B crashed before this?

If the EXPIRE never runs, crash, exception, anything between the two commands, the key has no TTL. It accumulates forever. Every request after the window should have reset is rejected until someone manually clears it.

Inside the Lua script, EXPIRE runs on the first increment (current == 1) in the same atomic block as the INCR. Either both happen or neither happens. The window key always has a TTL.

The returned TTL from the final TTL call becomes the Retry-After value in the HTTP 429 response. The client knows exactly when the window resets and when to retry. This is not an approximation, it is the live Redis TTL, accurate to the second.

Why EVALSHA matters on hot paths

Loading a Lua script into Redis via SCRIPT LOAD returns a SHA1 hash. Subsequent calls use EVALSHA instead of EVAL the script body never travels over the network again.

On a hot admission control path processing 5,000 requests per 10-second window, that is 5,000 round-trips that skip the script serialization and deserialization entirely. The Redis server has the script compiled in its script cache. The only network payload is the EVALSHA command, the key, and the arguments.

If the Redis server is restarted, the script cache is cleared. EVALSHA will return NOSCRIPT. Relier handles this with a fallback: on NoScriptError, reload the script and retry:

async def _evalsha_with_fallback(self, redis_client, sha, window_key, *args):
    try:
        return await redis_client.evalsha(sha, 1, window_key, *args)
    except redis.exceptions.NoScriptError:
        self._script_sha = await redis_client.script_load(ADMISSION_LUA)
        return await redis_client.evalsha(self._script_sha, 1, window_key, *args)

One reload on restart. Every call after that is EVALSHA again.

Our benchmark warmup phase (100 iterations discarded before measurement) covers exactly this: Redis Lua scripts load on first call, connection pools establish, the asyncio loop settles. By the time we start measuring, every script is loaded and every subsequent call hits the SHA cache. The p99 0.559ms admission control latency reflects the warm-path cost, not the cold-start cost.

What I learned

The failure modes Lua scripts prevent are all variations of the same shape: check a condition, act on it. In a single-threaded program, that is always safe. In a distributed system, the gap between the check and the action is where concurrent writes slip through.

The non-obvious lesson is that this problem does not get easier as your system gets faster. Faster workers mean more concurrent check-then-act operations per second, which means more collisions per second, which means more corrupted state per second. Atomicity requirements get harder to satisfy as throughput increases, not easier.

The six scripts in Relier each close a specific gap. ACQUIRE_LUA closes the claim race. RELEASE_LUA closes the stale-release race. RESURRECT_LUA closes the double-resurrection race. VALIDATE_LUA closes the zombie-detection gap. COMMIT_CHECK_LUA closes the stale-commit window. ADMISSION_LUA closes the TTL-expiry race.

None of these are novel patterns. Distributed systems literature has described all of them. What took time was mapping each abstract race to a concrete failure in a running test, watching it reproduce, and then verifying that the Lua script eliminated it.

The chaos suite in the Relier repo exists for exactly this: run it against your own cluster, on your own Redis, with your own task code. The correctness claims should survive your environment, not just ours.

GitHub: https://github.com/getrelier/relier
Docs: https://getrelier.github.io/relier
Architecture reference (all scripts): https://getrelier.github.io/relier/architecture/
pip install relier

Celery loses 8% of your tasks by default. Here's the reliability layer I built to fix that.

Kolade Fajimi — Tue, 02 Jun 2026 00:47:11 +0000

Celery is one of the most widely deployed task queue systems in Python. It is also, by default, a system that silently loses approximately 8% of your tasks the moment a worker crashes.

This is not a bug. It is the designed default behaviour. And most teams shipping Celery in production either do not know about it or have accepted it as a cost of doing business.

I built Relier because I was not willing to accept it.

How Celery loses tasks

When a Celery worker picks up a task from the broker, it sends an acknowledgement (ACK) immediately, before the task runs. From the broker's perspective, the task is done. The worker owns it now.

If the worker is killed (OOM, SIGKILL, kernel memory pressure, deploy) while the task is executing, the broker has already marked that task as delivered. The task is gone. No retry, no trace, no record it was ever picked up.

This is task_acks_late=False, Celery's default.

At 10M tasks per day, 8% loss is 800,000 silently dropped jobs. Every. Single. Day.

Why flipping `task_acks_late=True` is not enough

The standard advice for this problem is task_acks_late=True. It helps. In our benchmarks, it takes delivery from 92.0% to 96.0%, recovering about half the lost tasks.

But it does not solve the problem, for a specific reason.

When a worker dies with task_acks_late=True, the broker keeps the unacknowledged message in an unacked set. Redelivery is gated by visibility_timeout, the time the broker waits before assuming the worker is gone and requeuing the message. On the Redis broker, this defaults to approximately one hour.

So a task killed at 2:00 PM sits waiting for redelivery until 3:00 PM. In most production systems, the SLA for that task is measured in seconds or minutes, not hours.

The deeper problem: you have traded silent loss for hour-long redelivery latency, without knowing which tasks are stuck in that limbo.

Our bench ran 500 tasks through 5 SIGKILL cycles:

	Delivery rate
Vanilla Celery (default)	92.0% (460/500)
Vanilla + `task_acks_late=True`	96.0% (480/500)
Relier	100% (500/500)

The Phoenix Pattern

Relier implements what I call the Phoenix Pattern. The design is straightforward in principle and non-trivial to get right in practice.

Every @rl_task registers a heartbeat in Redis when it starts executing, a key with a configurable TTL (default 10 seconds). The task refreshes that heartbeat on a background loop while it runs. Every worker embeds a resurrection scanner that watches for expired heartbeats every few seconds, so the surviving workers recover a dead worker's tasks on their own, distributed locks keep concurrent scanners from replaying the same task twice. (You can also run a standalone rl run-resurrector process as belt-and-suspenders for the case where every worker dies at once.)

When a worker dies mid-task, its heartbeat stops refreshing. After one TTL window, the resurrector detects the expired heartbeat and atomically re-queues the orphaned task onto a special re-queue queue. A healthy worker picks it up. The original task arguments are preserved exactly.

In our benchmarks, OOM recovery averaged 7.1 seconds with a p99 of 8.9 seconds not 35 seconds, not an hour. Seconds.

Worker dies at t=0
Heartbeat expires at t=10s (heartbeat_ttl)
Resurrector detects at t=12s (next scan)
Task re-queued at t=12s (atomic)
Healthy worker picks up at t=12–14s
Task completes

This is why Relier achieves 100% delivery: it does not rely on the broker's visibility timeout. It has its own independent detection mechanism with a TTL you control.

The hard part: fence tokens and zombie workers

The description above makes Phoenix sound simple. The part that took the most work to get right is the zombie worker problem.

Consider this scenario:

Worker A picks up Task X. Heartbeat registered.
Worker A has a long GC pause. Its heartbeat expires.
The resurrector detects the expired heartbeat and re-queues Task X.
Worker B picks up Task X and completes it. Result committed to Redis.
Worker A wakes up from its GC pause and tries to commit its result.

Without any protection, step 5 causes silent data corruption. Worker A commits a stale result, overwriting Worker B's correct result. The task has now effectively executed twice, with the wrong result stored.

Relier prevents this with fence tokens. When Phoenix re-queues a task, it generates a new fence token, a monotonically increasing integer associated with the task's execution slot. The completion protocol is an atomic Lua script: "commit this result only if the current fence token matches the token this worker was given when it claimed the task."

Worker A was given fence token v1. After resurrection, the slot is now at v2. When Worker A tries to commit, the Lua script sees the mismatch and rejects the write. No data corruption. No duplicate result.

This is the correctness guarantee that makes "exactly-once execution" mean something.

Everything else Relier adds

Beyond Phoenix, a production-grade task system needs several more things. Relier ships them as part of the same @rl_task decorator:

Idempotency. @rl_task(idempotent=True) adds an atomic Redis Lua check before task execution. If the same task has already been submitted for the same logical key (which you can set explicitly or let Relier derive from the arguments), the second submission returns immediately without spawning work. In our benchmark: 50 submissions of the same task, 1 execution. Vanilla Celery: 50 executions.

Two-tier timeouts. soft_timeout=8, hard_timeout=10 gives you a cleanup hook that fires at 8 seconds (save state, close connections, emit structured logs) and a hard cancellation at 10 seconds via asyncio.CancelledError. Zombie tasks that would block a worker forever are quarantined instead.

Graceful shutdown. On SIGTERM, the worker drains in-flight tasks rather than dropping them. Tasks that cannot complete before shutdown hands them off to Phoenix on the re-queue queue. In our benchmark: 3 cycles of 20 tasks each, Relier 100% survival, vanilla Celery 0%.

Dead Letter Queue. Tasks that exhaust their max_resurrections allowance are quarantined to the DLQ with their full payload, stack trace, and complete resurrection history. The rl dlq inspect CLI shows everything. rl dlq release <id> re-dispatches a specific failed task. Nothing disappears silently.

Admission control. An atomic Lua fixed-window rate limiter on every apush() call. If the cluster is saturated, you get an AdmissionRejectedError with a Retry-After header, not a flooded queue and a cascade failure. In our benchmark: p99 0.559ms, well under the 1ms claim.

Rolling deploy protection. Every payload is wrapped in a versioned envelope with a SHA-256 checksum. Register a migration function, bump CURRENT_VERSION, and v2 workers silently upgrade v1 payloads mid-deploy. Old and new workers can run simultaneously without payload schema mismatches. Checksums catch broker-side corruption before your code ever runs.

Benchmarks

All numbers below from the built-in bench suite running against live Redis on Linux (Docker, python:3.11-slim, prefork=4 workers), synthetic 0.5s tasks:

Metric	Relier v0.1.6	Vanilla	Vanilla + acks_late
Task delivery (500 tasks, 5 kills)	100%	92.0%	96.0%
OOM recovery avg / p99	7.1s / 8.9s	∞ lost	∞ (visibility_timeout)
Idempotent recovery (delayed restart)	re-ran 4.8s	∞ lost	∞ (visibility_timeout)
Graceful shutdown (3 cycles)	100%	0%	0%
Duplicate prevention (50 submissions)	1/50 ran	50/50 ran	50/50 ran
Admission control p99	0.559ms	n/a	n/a
Dispatch overhead (net)	+1.87ms	baseline	n/a

The 1.87ms dispatch overhead covers the admission Lua script + SHA-256 envelope wrap + heartbeat registration. On any task doing real work (a database query, an HTTP call, an AI inference), this cost is invisible.

Getting started

pip install relier

from relier import rl_task

@rl_task(
    queue="default",
    idempotent=True,
    soft_timeout=25,
    hard_timeout=30,
)
async def send_invoice(invoice_id: str) -> dict:
    await charge_card(invoice_id)
    await send_email(invoice_id)
    return {"invoice_id": invoice_id}

# From FastAPI:
await send_invoice.apush(invoice_id)

# From Flask / Django:
send_invoice.push(invoice_id)

Three processes to run:

celery -A relier.tasks.app worker -l info -Q high_priority,default,re-queue --include=tasks
rl run-resurrector
uvicorn main:app

Or the full stack (Redis, workers, resurrector, Prometheus, Grafana) with:

docker compose -f docker-compose.bench.yml up --build

Requirements: Python 3.11+, Redis 7+ with AOF persistence and maxmemory-policy noeviction. Relier preflight-checks both on startup and refuses to run if either is wrong.

What I learned building this

The failure modes that are hardest to reason about are not the obvious ones. A worker dying is obvious, you see the process disappear. A GC pause that makes a healthy process look dead to an external observer, then have it wake up and try to write stale state, that is the case that breaks naive implementations.

Rolling deploys without schema versioning are a silent data loss vector that almost nobody talks about. The checksum + migration system exists because I watched a TypeError on a renamed argument silently DLQ a week's worth of invoice tasks with no alert.

Fence tokens are not a novel idea. The pattern comes from Martin Kleppmann's writing on distributed locking. But seeing the exact failure mode in a test, instrumenting it, and then watching the Lua script atomically reject the zombie commit, that was the moment Relier went from "probably correct" to "verifiably correct."

The chaos suite in the repo exists for this reason. Five scenarios: worker-kill, network-partition, load-spike, task-corrupt, slow-task. Run them against your own cluster, your own Redis, your own task code. You should not have to trust my benchmarks. Prove it yourself.

GitHub: github.com/getrelier/relier

Docs: getrelier.github.io/relier

Install: pip install relier

DEV Community: Kolade Fajimi

Running Async Python Inside Celery Is Harder Than You Think.

Why asyncio.run() doesn't work

The prefork complication

The solution: a persistent bridge loop per worker process

The push/apush split

Sync tasks in an async world

Timeout enforcement without thread kills

The disposable loop case

What I learned

Redis Lua Scripting for Distributed Systems: How Atomicity Prevents Race Conditions.

Why GET + SET is never safe

ACQUIRE_LUA (idempotency claim)

RELEASE_LUA (compare-and-delete)

RESURRECT_LUA (atomic lease and fence token)

VALIDATE_LUA (worker self-check)

COMMIT_CHECK_LUA (the last gate before writing results)

ADMISSION_LUA (rate limiting without races)

Why EVALSHA matters on hot paths

What I learned

Celery loses 8% of your tasks by default. Here's the reliability layer I built to fix that.

How Celery loses tasks

Why flipping task_acks_late=True is not enough

The Phoenix Pattern

The hard part: fence tokens and zombie workers

Everything else Relier adds

Benchmarks

Getting started

What I learned building this

Why `asyncio.run()` doesn't work

The `push/apush` split

Why flipping `task_acks_late=True` is not enough