<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kolade Fajimi</title>
    <description>The latest articles on DEV Community by Kolade Fajimi (@akoladefaj).</description>
    <link>https://dev.to/akoladefaj</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3704648%2Fceef07a0-12ba-4ab6-9c68-3a1ab613f07e.png</url>
      <title>DEV Community: Kolade Fajimi</title>
      <link>https://dev.to/akoladefaj</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/akoladefaj"/>
    <language>en</language>
    <item>
      <title>Running Async Python Inside Celery Is Harder Than You Think.</title>
      <dc:creator>Kolade Fajimi</dc:creator>
      <pubDate>Mon, 22 Jun 2026 23:17:50 +0000</pubDate>
      <link>https://dev.to/akoladefaj/running-async-python-inside-celery-is-harder-than-you-think-2046</link>
      <guid>https://dev.to/akoladefaj/running-async-python-inside-celery-is-harder-than-you-think-2046</guid>
      <description>&lt;p&gt;The problem is straightforward to state and surprisingly hard to solve correctly.&lt;/p&gt;

&lt;p&gt;Celery workers are synchronous. Celery spawns prefork worker processes, and when a task arrives, it calls your task function like this: &lt;code&gt;task_function(*args, **kwargs)&lt;/code&gt;. It expects a return value. It blocks the worker thread until it gets one. It does not know or care that you wrote &lt;code&gt;async def&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But modern Python services are async. FastAPI is async. SQLAlchemy 2.0 is async. httpx, aiohttp, asyncpg the entire interesting half of the ecosystem has gone async-first. The idea of maintaining two parallel code paths, one async for your web layer, one sync for your task layer is exactly the kind of thing that creates maintenance debt, copy-paste bugs, and the kind of divergence you only notice when something breaks in production.&lt;/p&gt;

&lt;p&gt;So you want to write &lt;code&gt;async def&lt;/code&gt; task functions and have them work inside a Celery worker. How hard can it be?&lt;/p&gt;

&lt;p&gt;Harder than it looks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why &lt;code&gt;asyncio.run()&lt;/code&gt; doesn't work
&lt;/h3&gt;

&lt;p&gt;The first thing most people try:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;task_wrapper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;your_async_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works in isolation. It fails in production for a specific reason: &lt;code&gt;asyncio.run()&lt;/code&gt; creates a new event loop, runs the coroutine to completion, then closes the loop. If there is already a running event loop on the current thread, and there frequently is, in test environments, in newer Celery versions, in signal handlers, it raises:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;RuntimeError: This event loop is already running&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The fix most people find next is &lt;code&gt;nest_asyncio&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;nest_asyncio&lt;/span&gt;
&lt;span class="n"&gt;nest_asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# now asyncio.run() "works" from inside a running loop
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;nest_asyncio&lt;/code&gt; patches the event loop to allow re-entrant calls. It works in simple cases. The subtle failure mode: re-entrant event loops change the execution order of scheduled callbacks and coroutines. Code that was safe under normal scheduling assumptions becomes non-deterministic under concurrent load. Bugs appear only at production concurrency, only under specific timing, and are nearly impossible to reproduce in development.&lt;/p&gt;

&lt;h3&gt;
  
  
  The prefork complication
&lt;/h3&gt;

&lt;p&gt;Even if you solve the &lt;code&gt;asyncio.run()&lt;/code&gt; problem, Celery's prefork concurrency model introduces a second failure that takes longer to diagnose because it manifests as infinite silence rather than an immediate error.&lt;/p&gt;

&lt;p&gt;When Celery starts, it forks N worker processes from a single parent. After &lt;code&gt;fork()&lt;/code&gt;, the child process inherits the parent's memory including any event loop objects that existed before the fork.&lt;/p&gt;

&lt;p&gt;The problem: &lt;code&gt;fork()&lt;/code&gt; does not copy threads. A Python &lt;code&gt;asyncio.AbstractEventLoop&lt;/code&gt; is driven by a thread calling &lt;code&gt;loop.run_forever()&lt;/code&gt;. After &lt;code&gt;fork()&lt;/code&gt;, the child has the loop object but not the thread running it. The loop's internal state may indicate it was running; nothing is actually driving it. Any coroutine scheduled onto this loop hangs indefinitely.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@worker_process_init.connect&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;bad_init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_event_loop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# This loop was inherited from the parent.
&lt;/span&gt;    &lt;span class="c1"&gt;# The thread driving it died when the parent forked.
&lt;/span&gt;    &lt;span class="c1"&gt;# loop.is_running() → False.
&lt;/span&gt;    &lt;span class="c1"&gt;# Scheduling coroutines onto it produces no results and no errors.
&lt;/span&gt;    &lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_coroutine_threadsafe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;some_coro&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# blocks forever
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the kind of bug that produces a zero-width failure window. The loop object exists and looks valid. No exception is raised. Work just never completes. I spent the better part of a day convinced the issue was in the Redis client before realizing the loop scheduled to drive it had died at fork time.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution: a persistent bridge loop per worker process
&lt;/h3&gt;

&lt;p&gt;The correct approach is to create a brand-new event loop inside each forked worker process and start a dedicated daemon thread to drive it. The bridge loop is the only asyncio runtime in the worker process. All async work runs on it. Celery's synchronous worker threads never touch an event loop directly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;worker_loop&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AbstractEventLoop&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="nd"&gt;@worker_process_init.connect&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;init_worker_process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="n"&gt;worker_loop&lt;/span&gt;

    &lt;span class="c1"&gt;# Always create a fresh loop in the forked child.
&lt;/span&gt;    &lt;span class="c1"&gt;# Never reuse the inherited parent loop object.
&lt;/span&gt;    &lt;span class="n"&gt;worker_loop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_event_loop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# A daemon thread drives the loop independently of Celery's
&lt;/span&gt;    &lt;span class="c1"&gt;# synchronous execution threads.
&lt;/span&gt;    &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_run_event_loop&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;worker_loop&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt;
        &lt;span class="n"&gt;daemon&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_run_event_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_event_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_forever&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the bridge is &lt;code&gt;asyncio.run_coroutine_threadsafe&lt;/code&gt;. When Celery calls the synchronous task wrapper, the wrapper schedules the async orchestration coroutine onto the background loop and blocks the worker thread waiting for the result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;wrapper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_orchestrate&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="c1"&gt;# Schema migration, idempotency check, Phoenix heartbeat,
&lt;/span&gt;        &lt;span class="c1"&gt;# OTel span setup, task execution, fence validation, DLQ quarantine.
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;your_async_task_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;

    &lt;span class="c1"&gt;# Schedule the coroutine from this synchronous thread onto
&lt;/span&gt;    &lt;span class="c1"&gt;# the event loop running on the background thread.
&lt;/span&gt;    &lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_coroutine_threadsafe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_orchestrate&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;worker_loop&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Block the Celery worker thread here. All actual work happens
&lt;/span&gt;    &lt;span class="c1"&gt;# on the bridge loop thread.
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;run_coroutine_threadsafe&lt;/code&gt; is the correct API for this pattern. It is thread-safe, it returns a &lt;code&gt;concurrent.futures.Future&lt;/code&gt; (not an asyncio Future), and &lt;code&gt;future.result()&lt;/code&gt; blocks without touching the event loop. The background loop thread does all the async I/O. The Celery worker thread just waits.&lt;/p&gt;

&lt;p&gt;This solves both problems cleanly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No &lt;code&gt;asyncio.run()&lt;/code&gt; from inside a running loop. The loop lives on a different thread.&lt;/li&gt;
&lt;li&gt;No inherited-but-dead loop. Each worker creates its own after fork.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The &lt;code&gt;push/apush&lt;/code&gt; split
&lt;/h3&gt;

&lt;p&gt;Dispatching tasks has its own version of this problem. Celery's send_task is synchronous and blocking, it opens a broker connection and writes a message. If you call it from inside an async FastAPI route handler, you block the event loop during a network round-trip.&lt;/p&gt;

&lt;p&gt;This is why Relier has two dispatch methods:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# From async code (FastAPI, async Django):
&lt;/span&gt;&lt;span class="n"&gt;receipt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;send_invoice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# From sync code (Flask routes, sync Django views, scripts):
&lt;/span&gt;&lt;span class="n"&gt;receipt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;send_invoice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;apush&lt;/code&gt; runs the blocking broker send in an executor so the async caller is never blocked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;apush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Admission check, schema wrapping, OTel context injection...
&lt;/span&gt;
    &lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_running_loop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_in_executor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;celery_app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;envelope&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt;
            &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;push&lt;/code&gt; explicitly guards against being called from inside a running loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_running_loop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# No running loop on this thread. Safe.
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.push() was called from inside a running event loop, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;where it would block and deadlock that loop. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Use `await &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.apush(...)` instead.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Inside a Celery worker: reuse the bridge loop.
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;worker_loop&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;worker_loop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_running&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_coroutine_threadsafe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;worker_loop&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Outside Celery (Flask route, script):
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The error message in the &lt;code&gt;RuntimeError&lt;/code&gt; matters. When someone calls &lt;code&gt;push()&lt;/code&gt; from a FastAPI route handler, they get an actionable message telling them exactly what to do instead. Not a silent deadlock. Not a timeout with no context. A specific message at the exact moment the mistake is made.&lt;/p&gt;

&lt;p&gt;The check itself, &lt;code&gt;asyncio.get_running_loop()&lt;/code&gt; in a &lt;code&gt;try/except&lt;/code&gt; &lt;code&gt;RuntimeError&lt;/code&gt; is the canonical way to detect whether the current thread is running an event loop. It raises &lt;code&gt;RuntimeError&lt;/code&gt; if no loop is running on this thread, which is the safe case for &lt;code&gt;push()&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sync tasks in an async world
&lt;/h3&gt;

&lt;p&gt;What about existing sync task functions? A codebase of &lt;code&gt;def tasks&lt;/code&gt; shouldn't require a full rewrite to benefit from Relier's reliability stack.&lt;/p&gt;

&lt;p&gt;Inside the orchestration coroutine, execution branches on whether the function is async:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;inspect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iscoroutinefunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;actual_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;actual_kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;actual_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;actual_kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;asyncio.to_thread&lt;/code&gt; runs the sync function in Python's default thread pool executor. The orchestration layer awaits it without blocking the bridge loop. All the async infrastructure, heartbeat refreshes, Phoenix registration, OTel span updates, fence validation keeps running concurrently on the bridge loop while the sync function runs on a thread pool thread.&lt;/p&gt;

&lt;p&gt;The constraint is honest: two-tier timeouts (&lt;code&gt;soft_timeout&lt;/code&gt;, &lt;code&gt;hard_timeout&lt;/code&gt;) only work for &lt;code&gt;async def&lt;/code&gt; tasks. A sync function running in &lt;code&gt;asyncio.to_thread&lt;/code&gt; cannot be cooperatively cancelled from outside. Relier raises &lt;code&gt;ValueError&lt;/code&gt; at decoration time if you pass timeout parameters to a sync task, rather than silently providing no protection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@rl_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;soft_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hard_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ValueError at import time
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sync_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;# Fix: convert to async def, or remove the timeout parameters.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Failing loudly at decoration time is better than failing silently at runtime when the timeout fires and nothing happens.&lt;/p&gt;

&lt;h3&gt;
  
  
  Timeout enforcement without thread kills
&lt;/h3&gt;

&lt;p&gt;Two-tier timeouts deserve their own explanation because they interact with the bridge loop in a non-obvious way.&lt;/p&gt;

&lt;p&gt;When a task starts, Relier spawns two watcher coroutines as asyncio tasks alongside the actual work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;task_coro&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_soft_timeout_handler&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;soft&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;task_coro&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;done&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="c1"&gt;# Fire the recovery hook. Task keeps running.
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;on_soft&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;on_soft&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_hard_timeout_handler&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hard&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;task_coro&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;done&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;task_coro&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cancel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Delivers CancelledError at next await point.
&lt;/span&gt;
&lt;span class="n"&gt;soft_watcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_soft_timeout_handler&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;hard_watcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;_hard_timeout_handler&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pending&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task_coro&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hard_watcher&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;return_when&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FIRST_COMPLETED&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All three coroutines run concurrently on the bridge loop. The soft timeout fires and calls your recovery hook, where you can call &lt;code&gt;ctx.set_partial(state)&lt;/code&gt; to checkpoint work in progress while the task keeps running. If the task doesn't finish before the hard deadline, &lt;code&gt;task_coro.cancel()&lt;/code&gt; delivers &lt;code&gt;asyncio.CancelledError&lt;/code&gt; at the task's next await point.&lt;/p&gt;

&lt;p&gt;No thread kills. No SIGALRM. No OS-level signals. Pure cooperative asyncio cancellation. This matters for cleanup: &lt;code&gt;CancelledError&lt;/code&gt; propagates through finally blocks. Resources get released. Partial state gets checkpointed. The task gets quarantined to the DLQ with its full payload and resurrection history. None of that happens with a hard OS kill.&lt;/p&gt;

&lt;h3&gt;
  
  
  The disposable loop case
&lt;/h3&gt;

&lt;p&gt;One edge case worth knowing: outside a Celery worker in a CLI script, a management command, a test, there's no bridge loop. The task wrapper's loop resolution falls through to creating a fresh event loop just for that call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_worker_loop&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Check for persistent worker bridge.
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;relier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;worker_loop&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;relier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;worker_loop&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Check for a running loop on this thread (test contexts).
&lt;/span&gt;    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_running_loop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Create a disposable loop for this one call.
&lt;/span&gt;    &lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_event_loop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_event_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Disposable loops are cleaned up after the call: Redis connections are closed, the loop is stopped and closed, and &lt;code&gt;asyncio.set_event_loop(None)&lt;/code&gt; clears the thread-local reference. The persistent &lt;code&gt;worker_loop&lt;/code&gt; is specifically excluded from this cleanup path closing the bridge loop mid-execution would kill all in-flight tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I learned
&lt;/h3&gt;

&lt;p&gt;The prefork problem is the kind of failure that shows up as "nothing happens" rather than an exception. You schedule coroutines, they don't run, no error surfaces. It took a day of debugging the wrong thing before I isolated it to the inherited-but-dead loop. The fix (create a fresh loop in &lt;code&gt;worker_process_init&lt;/code&gt;) is obvious in retrospect. Getting there required understanding exactly what &lt;code&gt;fork()&lt;/code&gt; does to threads.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;asyncio.run_coroutine_threadsafe&lt;/code&gt; is underused. Most Python developers never need to cross a thread boundary into a running event loop, so the API is obscure. But for anything that marries a sync framework (Celery, Django ORM, WSGI in general) with async internals, it is the correct and safe way to do it. It appears in the Python docs in a single paragraph. It deserves more.&lt;/p&gt;

&lt;p&gt;The two-method dispatch split (push/apush) is the right API surface even though it introduces surface area. The alternative, a single method that auto-detects the context and does the right thing sounds better but produces confusing failures when the auto-detection is wrong. The explicit split makes the contract clear. Async code always uses &lt;code&gt;apush&lt;/code&gt;. Sync code always uses &lt;code&gt;push&lt;/code&gt;. The guard in &lt;code&gt;push()&lt;/code&gt; exists so that misuse produces a useful error immediately rather than a deadlock ten seconds later.&lt;/p&gt;

&lt;p&gt;Cooperative timeout cancellation is better than OS-level signals for tasks that care about cleanup. The finally block guarantee is the part that matters: partial state can be persisted, connections can be closed, the DLQ entry gets written with everything needed to re-inspect or re-dispatch. An OS kill gives you none of that.&lt;/p&gt;

&lt;p&gt;The whole bridge, bridge loop thread, run_coroutine_threadsafe, push/apush split, disposable loop cleanup is about 200 lines in &lt;code&gt;app.py&lt;/code&gt; and &lt;code&gt;decorator.py&lt;/code&gt; combined. The complexity is real but contained. Once the pattern is in place, every &lt;code&gt;async def&lt;/code&gt; task function just works, without the task author knowing anything about the event loop infrastructure underneath.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/getrelier/relier" rel="noopener noreferrer"&gt;github.com/getrelier/relier&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://getrelier.github.io/relier" rel="noopener noreferrer"&gt;getrelier.github.io/relier&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;pip install relier&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>backend</category>
      <category>programming</category>
      <category>python</category>
    </item>
    <item>
      <title>Redis Lua Scripting for Distributed Systems: How Atomicity Prevents Race Conditions.</title>
      <dc:creator>Kolade Fajimi</dc:creator>
      <pubDate>Mon, 15 Jun 2026 09:29:13 +0000</pubDate>
      <link>https://dev.to/akoladefaj/redis-lua-scripting-for-distributed-systems-how-atomicity-prevents-race-conditions-2bgk</link>
      <guid>https://dev.to/akoladefaj/redis-lua-scripting-for-distributed-systems-how-atomicity-prevents-race-conditions-2bgk</guid>
      <description>&lt;p&gt;There is a failure mode that is almost impossible to reproduce in tests but completely reproducible in production under load. Two workers. One Redis key. Both check whether a task has already been claimed. Both see "no." Both claim it. Now the task runs twice.&lt;/p&gt;

&lt;p&gt;This is not a logic bug. The logic is correct. The problem is that correctness requires atomicity, and two separate Redis commands, however fast are never atomic.&lt;/p&gt;

&lt;p&gt;Every critical operation in Relier is a Lua script. Not because Lua is elegant. Because Redis's single-threaded execution model means a Lua script is the only way to make a distributed check-then-act actually correct.&lt;/p&gt;

&lt;p&gt;This is a walk through all six of them.&lt;/p&gt;




&lt;h3&gt;
  
  
  Why GET + SET is never safe
&lt;/h3&gt;

&lt;p&gt;When you need to claim a task, check whether it's been taken, and if not, take it, the naive implementation is two commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;existing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;worker_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This has a race condition that is invisible at low concurrency and guaranteed to trigger at high concurrency.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;t=0ms  Worker A: GET "task:123" → nil   # not claimed
t=0ms  Worker B: GET "task:123" → nil   # not claimed (same read, no lock)
t=1ms  Worker A: SET "task:123" "worker-A"
t=1ms  Worker B: SET "task:123" "worker-B"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both workers claimed the task. Worker A's write is immediately overwritten by Worker B. Now both are executing it.&lt;/p&gt;

&lt;p&gt;The fix is not a Python-side lock. Python-side locks are process-local, they can't protect you across multiple workers, containers, or machines. The fix is moving the check-and-set into a single Redis operation that cannot be interleaved with anything else.&lt;/p&gt;

&lt;p&gt;That is what Lua gives you. Redis executes Lua scripts atomically. No other command runs between the first line and the last.&lt;/p&gt;




&lt;h3&gt;
  
  
  ACQUIRE_LUA (idempotency claim)
&lt;/h3&gt;

&lt;p&gt;The first script Relier uses is the idempotency claim. When &lt;code&gt;@rl_task(idempotent=True)&lt;/code&gt; is set, every task submission runs this before any work happens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lua"&gt;&lt;code&gt;&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'GET'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'SET'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;'NX'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'EX'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;KEYS[1]&lt;/code&gt; is the idempotency key, derived from the task name and arguments, or set explicitly by the caller. &lt;code&gt;ARGV[1]&lt;/code&gt; is the in-flight sentinel (a UUID tied to this execution attempt). &lt;code&gt;ARGV[2]&lt;/code&gt; is the TTL.&lt;/p&gt;

&lt;p&gt;What this closes: the race between two concurrent submissions of the same task. Both arrive at Redis. Both invoke the script. Redis executes them one at a time. The first one sees &lt;code&gt;existing = nil&lt;/code&gt;, claims the key, returns &lt;code&gt;{0, false}&lt;/code&gt;, proceed. The second one sees &lt;code&gt;existing = &amp;lt;the sentinel&amp;gt;&lt;/code&gt;, returns &lt;code&gt;{1, existing}&lt;/code&gt;, already claimed, skip. One execution. Not two.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;NX&lt;/code&gt; flag on the &lt;code&gt;SET&lt;/code&gt; is belt-and-suspenders. The &lt;code&gt;GET&lt;/code&gt; already gates on existence, but &lt;code&gt;NX&lt;/code&gt; means the SET itself is also a no-op if the key was written between the GET and SET, which cannot happen inside a Lua script, but matters if you ever run the commands outside one.&lt;/p&gt;

&lt;p&gt;The return value carries the existing state so the caller can distinguish between in-flight (sentinel value) and completed (cached result JSON). That distinction is what lets Relier return the cached result to a duplicate caller without re-running the task.&lt;/p&gt;




&lt;h3&gt;
  
  
  RELEASE_LUA (compare-and-delete)
&lt;/h3&gt;

&lt;p&gt;When a task fails mid-execution, Relier needs to release the in-flight sentinel so future submissions can retry. The naive approach is a plain &lt;code&gt;DEL&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is wrong.&lt;/p&gt;

&lt;p&gt;Consider: Worker A claims task X and sets its sentinel. Worker A dies. Relier's Phoenix resurrector detects the expired heartbeat and re-queues task X. Worker B claims it and sets &lt;strong&gt;its own&lt;/strong&gt; sentinel. Now Worker A's crash handler wakes up (the process didn't actually die, it threw an exception) and runs &lt;code&gt;DEL key&lt;/code&gt;. It deletes Worker B's sentinel. Worker B's task is now claimable by a third worker. Task runs twice.&lt;/p&gt;

&lt;p&gt;The correct operation is conditional delete: delete the key only if the stored value is still your sentinel. That is RELEASE_LUA:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lua"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'GET'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'DEL'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ARGV[1]&lt;/code&gt; is the sentinel value this worker set when it claimed the task. If the stored value no longer matches, because another worker claimed it after resurrection, the delete is skipped. Worker B's execution is protected.&lt;/p&gt;

&lt;p&gt;The compare-and-delete pattern shows up in distributed systems literature under many names (CAS, compare-and-swap). The point is always the same: never release a lock without first verifying you still own it. Plain DEL is always wrong.&lt;/p&gt;




&lt;h3&gt;
  
  
  RESURRECT_LUA (atomic lease and fence token)
&lt;/h3&gt;

&lt;p&gt;When the Phoenix resurrector detects a dead worker and re-queues a task, two things need to happen atomically:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Acquire a lease, so concurrent resurrectors don't both re-queue the same task&lt;/li&gt;
&lt;li&gt;Publish a new fence token, so the zombie worker (if it wakes up) can't commit stale results
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lua"&gt;&lt;code&gt;&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;lease_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;fence_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;lease_ttl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;fence_ttl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"EXISTS"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lease_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"SET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lease_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"EX"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lease_ttl&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"SET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fence_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"EX"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fence_ttl&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The atomicity here solves a specific problem. Imagine two resurrectors scanning simultaneously, which happens whenever you run more than one worker embedding the scanner. Both detect the same expired heartbeat. If they could interleave:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Resurrector A: EXISTS lease_key → 0  (no lease)
Resurrector B: EXISTS lease_key → 0  (no lease, A hasn't written yet)
Resurrector A: SET lease_key ...
Resurrector B: SET lease_key ...     # both acquired the lease
Both dispatch the task to the queue
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script prevents this. The EXISTS check and the SET are one atomic unit. One of the two resurrectors wins. The other sees &lt;code&gt;EXISTS = 1&lt;/code&gt; and returns 0. One re-queue dispatch. Not two.&lt;/p&gt;

&lt;p&gt;The fence token is set in the same script for a different reason: the new fence token must be visible in Redis before any worker picks up the re-queued task. If lease acquisition and fence token write were separate commands, a worker could pick up the task between them, before the fence token was set and have nothing to validate against. That window is eliminated by collapsing both writes into one atomic script.&lt;/p&gt;




&lt;h3&gt;
  
  
  VALIDATE_LUA (worker self-check)
&lt;/h3&gt;

&lt;p&gt;While a task is running, the worker periodically validates that it still owns the execution slot, that it hasn't been declared dead and resurrected while it was doing real work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lua"&gt;&lt;code&gt;&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;lease_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;fence_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;lease&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lease_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;lease&lt;/span&gt; &lt;span class="o"&gt;~=&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;fence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fence_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;fence&lt;/span&gt; &lt;span class="o"&gt;~=&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Return values: &lt;code&gt;1&lt;/code&gt; = still current, &lt;code&gt;0&lt;/code&gt; = invalid lease, &lt;code&gt;2&lt;/code&gt; = stale fence.&lt;/p&gt;

&lt;p&gt;The worker checks both the lease key and the fence key against its own token. If either has been overwritten by a resurrector — because the worker's heartbeat expired while it was GC-paused or doing slow I/O — the worker gets a non-1 return, cancels its own execution, and exits cleanly.&lt;/p&gt;

&lt;p&gt;This is how Relier handles the zombie worker problem: the zombie doesn't get detected by an external process and killed — it self-detects during a validation check and stops. The heartbeat expiry + periodic VALIDATE_LUA check forms a cooperative detection loop: the resurrector re-queues the task, the zombie eventually validates and cancels itself.&lt;/p&gt;




&lt;h3&gt;
  
  
  COMMIT_CHECK_LUA (the last gate before writing results)
&lt;/h3&gt;

&lt;p&gt;Even with VALIDATE_LUA running periodically, there is a window: the last validation passes, then the fence token expires, then the task tries to write its result. Without a final gate, that write goes through even though the worker is now a zombie from Redis's perspective.&lt;/p&gt;

&lt;p&gt;COMMIT_CHECK_LUA runs immediately before any result is committed to storage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lua"&gt;&lt;code&gt;&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;fence_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fence_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;~=&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;0&lt;/code&gt; = stale worker, result rejected. &lt;code&gt;1&lt;/code&gt; = still current, write proceeds.&lt;/p&gt;

&lt;p&gt;This is the commit protocol that makes "exactly-once execution" a verifiable claim rather than an optimistic assertion. The fence token is the proof of ownership. The Lua script makes checking and acting on that proof atomic. No commit can slip through between a successful check and the actual write because the check and the conditional return are the same Redis operation.&lt;/p&gt;

&lt;p&gt;In our chaos tests, this script rejects stale commits from zombie workers on every run. The log line &lt;code&gt;"zombie commit rejected"&lt;/code&gt; appears exactly as often as &lt;code&gt;"GC-pause-length resurrection"&lt;/code&gt; events, one rejection per zombie wakeup. The math checks out.&lt;/p&gt;




&lt;h3&gt;
  
  
  ADMISSION_LUA (rate limiting without races)
&lt;/h3&gt;

&lt;p&gt;The admission control script is the simplest of the set and the most illustrative of why Lua is the right tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lua"&gt;&lt;code&gt;&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'INCR'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'EXPIRE'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'TTL'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])}&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;KEYS[1]&lt;/code&gt; = the rate limit window key (e.g., &lt;code&gt;rl:admission:global&lt;/code&gt;). &lt;code&gt;ARGV[1]&lt;/code&gt; = the request limit. &lt;code&gt;ARGV[2]&lt;/code&gt; = the window duration in seconds.&lt;/p&gt;

&lt;p&gt;The correctness problem here is the INCR + EXPIRE pair. &lt;code&gt;INCR&lt;/code&gt; creates the key if it doesn't exist and increments atomically. But if the &lt;code&gt;EXPIRE&lt;/code&gt; ran as a separate command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Worker A: INCR → 1
Worker B: INCR → 2
Worker A: EXPIRE key 10   # only one EXPIRE runs
Worker B: EXPIRE key 10   # fine here, but what if B crashed before this?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the &lt;code&gt;EXPIRE&lt;/code&gt; never runs, crash, exception, anything between the two commands, the key has no TTL. It accumulates forever. Every request after the window should have reset is rejected until someone manually clears it.&lt;/p&gt;

&lt;p&gt;Inside the Lua script, &lt;code&gt;EXPIRE&lt;/code&gt; runs on the first increment (&lt;code&gt;current == 1&lt;/code&gt;) in the same atomic block as the &lt;code&gt;INCR&lt;/code&gt;. Either both happen or neither happens. The window key always has a TTL.&lt;/p&gt;

&lt;p&gt;The returned TTL from the final &lt;code&gt;TTL&lt;/code&gt; call becomes the &lt;code&gt;Retry-After&lt;/code&gt; value in the HTTP 429 response. The client knows exactly when the window resets and when to retry. This is not an approximation, it is the live Redis TTL, accurate to the second.&lt;/p&gt;




&lt;h3&gt;
  
  
  Why EVALSHA matters on hot paths
&lt;/h3&gt;

&lt;p&gt;Loading a Lua script into Redis via &lt;code&gt;SCRIPT LOAD&lt;/code&gt; returns a SHA1 hash. Subsequent calls use &lt;code&gt;EVALSHA&lt;/code&gt; instead of &lt;code&gt;EVAL&lt;/code&gt; the script body never travels over the network again.&lt;/p&gt;

&lt;p&gt;On a hot admission control path processing 5,000 requests per 10-second window, that is 5,000 round-trips that skip the script serialization and deserialization entirely. The Redis server has the script compiled in its script cache. The only network payload is the EVALSHA command, the key, and the arguments.&lt;/p&gt;

&lt;p&gt;If the Redis server is restarted, the script cache is cleared. &lt;code&gt;EVALSHA&lt;/code&gt; will return &lt;code&gt;NOSCRIPT&lt;/code&gt;. Relier handles this with a fallback: on &lt;code&gt;NoScriptError&lt;/code&gt;, reload the script and retry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_evalsha_with_fallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sha&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evalsha&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sha&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NoScriptError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_script_sha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;script_load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ADMISSION_LUA&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evalsha&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_script_sha&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One reload on restart. Every call after that is EVALSHA again.&lt;/p&gt;

&lt;p&gt;Our benchmark warmup phase (100 iterations discarded before measurement) covers exactly this: Redis Lua scripts load on first call, connection pools establish, the asyncio loop settles. By the time we start measuring, every script is loaded and every subsequent call hits the SHA cache. The p99 0.559ms admission control latency reflects the warm-path cost, not the cold-start cost.&lt;/p&gt;




&lt;h3&gt;
  
  
  What I learned
&lt;/h3&gt;

&lt;p&gt;The failure modes Lua scripts prevent are all variations of the same shape: check a condition, act on it. In a single-threaded program, that is always safe. In a distributed system, the gap between the check and the action is where concurrent writes slip through.&lt;/p&gt;

&lt;p&gt;The non-obvious lesson is that this problem does not get easier as your system gets faster. Faster workers mean more concurrent check-then-act operations per second, which means more collisions per second, which means more corrupted state per second. Atomicity requirements get harder to satisfy as throughput increases, not easier.&lt;/p&gt;

&lt;p&gt;The six scripts in Relier each close a specific gap. ACQUIRE_LUA closes the claim race. RELEASE_LUA closes the stale-release race. RESURRECT_LUA closes the double-resurrection race. VALIDATE_LUA closes the zombie-detection gap. COMMIT_CHECK_LUA closes the stale-commit window. ADMISSION_LUA closes the TTL-expiry race.&lt;/p&gt;

&lt;p&gt;None of these are novel patterns. Distributed systems literature has described all of them. What took time was mapping each abstract race to a concrete failure in a running test, watching it reproduce, and then verifying that the Lua script eliminated it.&lt;/p&gt;

&lt;p&gt;The chaos suite in the Relier repo exists for exactly this: run it against your own cluster, on your own Redis, with your own task code. The correctness claims should survive your environment, not just ours.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/getrelier/relier" rel="noopener noreferrer"&gt;https://github.com/getrelier/relier&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://getrelier.github.io/relier" rel="noopener noreferrer"&gt;https://getrelier.github.io/relier&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Architecture reference (all scripts):&lt;/strong&gt; &lt;a href="https://getrelier.github.io/relier/architecture/" rel="noopener noreferrer"&gt;https://getrelier.github.io/relier/architecture/&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;pip install relier&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>redis</category>
      <category>distributedsystems</category>
      <category>python</category>
      <category>backend</category>
    </item>
    <item>
      <title>Celery loses 8% of your tasks by default. Here's the reliability layer I built to fix that.</title>
      <dc:creator>Kolade Fajimi</dc:creator>
      <pubDate>Tue, 02 Jun 2026 00:47:11 +0000</pubDate>
      <link>https://dev.to/akoladefaj/celery-loses-8-of-your-tasks-by-default-heres-the-reliability-layer-i-built-to-fix-that-40mc</link>
      <guid>https://dev.to/akoladefaj/celery-loses-8-of-your-tasks-by-default-heres-the-reliability-layer-i-built-to-fix-that-40mc</guid>
      <description>&lt;p&gt;Celery is one of the most widely deployed task queue systems in Python. It is also, by default, a system that silently loses approximately 8% of your tasks the moment a worker crashes.&lt;/p&gt;

&lt;p&gt;This is not a bug. It is the designed default behaviour. And most teams shipping Celery in production either do not know about it or have accepted it as a cost of doing business.&lt;/p&gt;

&lt;p&gt;I built Relier because I was not willing to accept it.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Celery loses tasks
&lt;/h3&gt;

&lt;p&gt;When a Celery worker picks up a task from the broker, it sends an acknowledgement (ACK) immediately, before the task runs. From the broker's perspective, the task is done. The worker owns it now.&lt;/p&gt;

&lt;p&gt;If the worker is killed (OOM, SIGKILL, kernel memory pressure, deploy) while the task is executing, the broker has already marked that task as delivered. The task is gone. No retry, no trace, no record it was ever picked up.&lt;/p&gt;

&lt;p&gt;This is &lt;code&gt;task_acks_late=False&lt;/code&gt;, Celery's default.&lt;/p&gt;

&lt;p&gt;At 10M tasks per day, 8% loss is 800,000 silently dropped jobs. Every. Single. Day.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why flipping &lt;code&gt;task_acks_late=True&lt;/code&gt; is not enough
&lt;/h3&gt;

&lt;p&gt;The standard advice for this problem is &lt;code&gt;task_acks_late=True&lt;/code&gt;. It helps. In our benchmarks, it takes delivery from 92.0% to 96.0%, recovering about half the lost tasks.&lt;/p&gt;

&lt;p&gt;But it does not solve the problem, for a specific reason.&lt;/p&gt;

&lt;p&gt;When a worker dies with &lt;code&gt;task_acks_late=True&lt;/code&gt;, the broker keeps the unacknowledged message in an &lt;code&gt;unacked&lt;/code&gt; set. Redelivery is gated by &lt;code&gt;visibility_timeout&lt;/code&gt;, the time the broker waits before assuming the worker is gone and requeuing the message. On the Redis broker, this defaults to approximately one hour.&lt;/p&gt;

&lt;p&gt;So a task killed at 2:00 PM sits waiting for redelivery until 3:00 PM. In most production systems, the SLA for that task is measured in seconds or minutes, not hours.&lt;/p&gt;

&lt;p&gt;The deeper problem: you have traded silent loss for hour-long redelivery latency, without knowing which tasks are stuck in that limbo.&lt;/p&gt;

&lt;p&gt;Our bench ran 500 tasks through 5 SIGKILL cycles:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Delivery rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vanilla Celery (default)&lt;/td&gt;
&lt;td&gt;92.0% (460/500)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vanilla + &lt;code&gt;task_acks_late=True&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;96.0% (480/500)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Relier&lt;/td&gt;
&lt;td&gt;100% (500/500)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The Phoenix Pattern
&lt;/h3&gt;

&lt;p&gt;Relier implements what I call the Phoenix Pattern. The design is straightforward in principle and non-trivial to get right in practice.&lt;/p&gt;

&lt;p&gt;Every &lt;code&gt;@rl_task&lt;/code&gt; registers a heartbeat in Redis when it starts executing, a key with a configurable TTL (default 10 seconds). The task refreshes that heartbeat on a background loop while it runs. Every worker embeds a resurrection scanner that watches for expired heartbeats every few seconds, so the surviving workers recover a dead worker's tasks on their own, distributed locks keep concurrent scanners from replaying the same task twice. (You can also run a standalone &lt;code&gt;rl run-resurrector&lt;/code&gt; process as belt-and-suspenders for the case where every worker dies at once.)&lt;/p&gt;

&lt;p&gt;When a worker dies mid-task, its heartbeat stops refreshing. After one TTL window, the resurrector detects the expired heartbeat and atomically re-queues the orphaned task onto a special &lt;code&gt;re-queue&lt;/code&gt; queue. A healthy worker picks it up. The original task arguments are preserved exactly.&lt;/p&gt;

&lt;p&gt;In our benchmarks, OOM recovery averaged 7.1 seconds with a p99 of 8.9 seconds not 35 seconds, not an hour. Seconds.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Worker dies at t=0&lt;/li&gt;
&lt;li&gt;Heartbeat expires at t=10s (heartbeat_ttl)&lt;/li&gt;
&lt;li&gt;Resurrector detects at t=12s (next scan)&lt;/li&gt;
&lt;li&gt;Task re-queued at t=12s (atomic)&lt;/li&gt;
&lt;li&gt;Healthy worker picks up at t=12–14s&lt;/li&gt;
&lt;li&gt;Task completes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why Relier achieves 100% delivery: it does not rely on the broker's visibility timeout. It has its own independent detection mechanism with a TTL you control.&lt;/p&gt;

&lt;h3&gt;
  
  
  The hard part: fence tokens and zombie workers
&lt;/h3&gt;

&lt;p&gt;The description above makes Phoenix sound simple. The part that took the most work to get right is the zombie worker problem.&lt;/p&gt;

&lt;p&gt;Consider this scenario:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Worker A picks up Task X. Heartbeat registered.&lt;/li&gt;
&lt;li&gt;Worker A has a long GC pause. Its heartbeat expires.&lt;/li&gt;
&lt;li&gt;The resurrector detects the expired heartbeat and re-queues Task X.&lt;/li&gt;
&lt;li&gt;Worker B picks up Task X and completes it. Result committed to Redis.&lt;/li&gt;
&lt;li&gt;Worker A wakes up from its GC pause and tries to commit its result.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without any protection, step 5 causes silent data corruption. Worker A commits a stale result, overwriting Worker B's correct result. The task has now effectively executed twice, with the wrong result stored.&lt;/p&gt;

&lt;p&gt;Relier prevents this with fence tokens. When Phoenix re-queues a task, it generates a new fence token, a monotonically increasing integer associated with the task's execution slot. The completion protocol is an atomic Lua script: "commit this result only if the current fence token matches the token this worker was given when it claimed the task."&lt;/p&gt;

&lt;p&gt;Worker A was given fence token &lt;code&gt;v1&lt;/code&gt;. After resurrection, the slot is now at &lt;code&gt;v2&lt;/code&gt;. When Worker A tries to commit, the Lua script sees the mismatch and rejects the write. No data corruption. No duplicate result.&lt;/p&gt;

&lt;p&gt;This is the correctness guarantee that makes "exactly-once execution" mean something.&lt;/p&gt;

&lt;h3&gt;
  
  
  Everything else Relier adds
&lt;/h3&gt;

&lt;p&gt;Beyond Phoenix, a production-grade task system needs several more things. Relier ships them as part of the same &lt;code&gt;@rl_task&lt;/code&gt; decorator:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Idempotency.&lt;/strong&gt; &lt;code&gt;@rl_task(idempotent=True)&lt;/code&gt; adds an atomic Redis Lua check before task execution. If the same task has already been submitted for the same logical key (which you can set explicitly or let Relier derive from the arguments), the second submission returns immediately without spawning work. In our benchmark: 50 submissions of the same task, 1 execution. Vanilla Celery: 50 executions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two-tier timeouts.&lt;/strong&gt; &lt;code&gt;soft_timeout=8, hard_timeout=10&lt;/code&gt; gives you a cleanup hook that fires at 8 seconds (save state, close connections, emit structured logs) and a hard cancellation at 10 seconds via &lt;code&gt;asyncio.CancelledError&lt;/code&gt;. Zombie tasks that would block a worker forever are quarantined instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graceful shutdown.&lt;/strong&gt; On SIGTERM, the worker drains in-flight tasks rather than dropping them. Tasks that cannot complete before shutdown hands them off to Phoenix on the re-queue queue. In our benchmark: 3 cycles of 20 tasks each, Relier 100% survival, vanilla Celery 0%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dead Letter Queue.&lt;/strong&gt; Tasks that exhaust their &lt;code&gt;max_resurrections&lt;/code&gt; allowance are quarantined to the DLQ with their full payload, stack trace, and complete resurrection history. The &lt;code&gt;rl dlq inspect&lt;/code&gt; CLI shows everything. &lt;code&gt;rl dlq release &amp;lt;id&amp;gt;&lt;/code&gt; re-dispatches a specific failed task. Nothing disappears silently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Admission control.&lt;/strong&gt; An atomic Lua fixed-window rate limiter on every &lt;code&gt;apush()&lt;/code&gt; call. If the cluster is saturated, you get an &lt;code&gt;AdmissionRejectedError&lt;/code&gt; with a &lt;code&gt;Retry-After&lt;/code&gt; header, not a flooded queue and a cascade failure. In our benchmark: p99 0.559ms, well under the 1ms claim.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rolling deploy protection.&lt;/strong&gt; Every payload is wrapped in a versioned envelope with a SHA-256 checksum. Register a migration function, bump &lt;code&gt;CURRENT_VERSION&lt;/code&gt;, and v2 workers silently upgrade v1 payloads mid-deploy. Old and new workers can run simultaneously without payload schema mismatches. Checksums catch broker-side corruption before your code ever runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmarks
&lt;/h3&gt;

&lt;p&gt;All numbers below from the built-in bench suite running against live Redis on Linux (Docker, python:3.11-slim, prefork=4 workers), synthetic 0.5s tasks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Relier v0.1.6&lt;/th&gt;
&lt;th&gt;Vanilla&lt;/th&gt;
&lt;th&gt;Vanilla + acks_late&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Task delivery (500 tasks, 5 kills)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;92.0%&lt;/td&gt;
&lt;td&gt;96.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OOM recovery avg / p99&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7.1s / 8.9s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;∞ lost&lt;/td&gt;
&lt;td&gt;∞ (visibility_timeout)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Idempotent recovery (delayed restart)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;re-ran 4.8s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;∞ lost&lt;/td&gt;
&lt;td&gt;∞ (visibility_timeout)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Graceful shutdown (3 cycles)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duplicate prevention (50 submissions)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1/50 ran&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50/50 ran&lt;/td&gt;
&lt;td&gt;50/50 ran&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Admission control p99&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.559ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dispatch overhead (net)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+1.87ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xzab5kzj9lxfz6610yg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xzab5kzj9lxfz6610yg.png" alt="Grafana dashboard — end of benchmark run. Green line (Relier) reaches 577 cumulative completions across all test cycles. Yellow line (Vanilla Celery) flatlines at 460 after the first SIGKILL cycles. Resurrections: 51 total. Redis memory: 2.92 MiB — no accumulation across the full run." width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft16tykm7zh6dxyp031bn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft16tykm7zh6dxyp031bn.png" alt="Redis commands/sec across the full benchmark run (Grafana, Test 7). Spikes correspond to task turnover bursts during SIGKILL cycles — peaking at ~83 ops/sec during the 500-task delivery test. Baseline returns to near-zero immediately after each burst. No accumulation across 577 completions and 51 resurrections." width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The 1.87ms dispatch overhead covers the admission Lua script + SHA-256 envelope wrap + heartbeat registration. On any task doing real work (a database query, an HTTP call, an AI inference), this cost is invisible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting started
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;relier
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;relier&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;rl_task&lt;/span&gt;

&lt;span class="nd"&gt;@rl_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;idempotent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;soft_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;hard_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;send_invoice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;charge_card&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;send_email&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoice_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# From FastAPI:
&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;send_invoice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# From Flask / Django:
&lt;/span&gt;&lt;span class="n"&gt;send_invoice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three processes to run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;celery &lt;span class="nt"&gt;-A&lt;/span&gt; relier.tasks.app worker &lt;span class="nt"&gt;-l&lt;/span&gt; info &lt;span class="nt"&gt;-Q&lt;/span&gt; high_priority,default,re-queue &lt;span class="nt"&gt;--include&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;tasks
rl run-resurrector
uvicorn main:app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or the full stack (Redis, workers, resurrector, Prometheus, Grafana) with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose &lt;span class="nt"&gt;-f&lt;/span&gt; docker-compose.bench.yml up &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requirements: Python 3.11+, Redis 7+ with AOF persistence and &lt;code&gt;maxmemory-policy noeviction&lt;/code&gt;. Relier preflight-checks both on startup and refuses to run if either is wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I learned building this
&lt;/h3&gt;

&lt;p&gt;The failure modes that are hardest to reason about are not the obvious ones. A worker dying is obvious, you see the process disappear. A GC pause that makes a healthy process look dead to an external observer, then have it wake up and try to write stale state, that is the case that breaks naive implementations.&lt;/p&gt;

&lt;p&gt;Rolling deploys without schema versioning are a silent data loss vector that almost nobody talks about. The checksum + migration system exists because I watched a TypeError on a renamed argument silently DLQ a week's worth of invoice tasks with no alert.&lt;/p&gt;

&lt;p&gt;Fence tokens are not a novel idea. The pattern comes from Martin Kleppmann's writing on distributed locking. But seeing the exact failure mode in a test, instrumenting it, and then watching the Lua script atomically reject the zombie commit, that was the moment Relier went from "probably correct" to "verifiably correct."&lt;/p&gt;

&lt;p&gt;The chaos suite in the repo exists for this reason. Five scenarios: worker-kill, network-partition, load-spike, task-corrupt, slow-task. Run them against your own cluster, your own Redis, your own task code. You should not have to trust my benchmarks. Prove it yourself.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/getrelier/relier" rel="noopener noreferrer"&gt;github.com/getrelier/relier&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://getrelier.github.io/relier" rel="noopener noreferrer"&gt;getrelier.github.io/relier&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Install:&lt;/strong&gt; &lt;code&gt;pip install relier&lt;/code&gt;&lt;/p&gt;




</description>
      <category>python</category>
      <category>celery</category>
      <category>webdev</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
