<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kolade Fajimi</title>
    <description>The latest articles on DEV Community by Kolade Fajimi (@akoladefaj).</description>
    <link>https://dev.to/akoladefaj</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3704648%2Fceef07a0-12ba-4ab6-9c68-3a1ab613f07e.png</url>
      <title>DEV Community: Kolade Fajimi</title>
      <link>https://dev.to/akoladefaj</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/akoladefaj"/>
    <language>en</language>
    <item>
      <title>Celery loses 8% of your tasks by default. Here's the reliability layer I built to fix that.</title>
      <dc:creator>Kolade Fajimi</dc:creator>
      <pubDate>Tue, 02 Jun 2026 00:47:11 +0000</pubDate>
      <link>https://dev.to/akoladefaj/celery-loses-8-of-your-tasks-by-default-heres-the-reliability-layer-i-built-to-fix-that-40mc</link>
      <guid>https://dev.to/akoladefaj/celery-loses-8-of-your-tasks-by-default-heres-the-reliability-layer-i-built-to-fix-that-40mc</guid>
      <description>&lt;p&gt;Celery is one of the most widely deployed task queue systems in Python. It is also, by default, a system that silently loses approximately 8% of your tasks the moment a worker crashes.&lt;/p&gt;

&lt;p&gt;This is not a bug. It is the designed default behaviour. And most teams shipping Celery in production either do not know about it or have accepted it as a cost of doing business.&lt;/p&gt;

&lt;p&gt;I built Relier because I was not willing to accept it.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Celery loses tasks
&lt;/h3&gt;

&lt;p&gt;When a Celery worker picks up a task from the broker, it sends an acknowledgement (ACK) immediately, before the task runs. From the broker's perspective, the task is done. The worker owns it now.&lt;/p&gt;

&lt;p&gt;If the worker is killed (OOM, SIGKILL, kernel memory pressure, deploy) while the task is executing, the broker has already marked that task as delivered. The task is gone. No retry, no trace, no record it was ever picked up.&lt;/p&gt;

&lt;p&gt;This is &lt;code&gt;task_acks_late=False&lt;/code&gt;, Celery's default.&lt;/p&gt;

&lt;p&gt;At 10M tasks per day, 8% loss is 800,000 silently dropped jobs. Every. Single. Day.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why flipping &lt;code&gt;task_acks_late=True&lt;/code&gt; is not enough
&lt;/h3&gt;

&lt;p&gt;The standard advice for this problem is &lt;code&gt;task_acks_late=True&lt;/code&gt;. It helps. In our benchmarks, it takes delivery from 92.0% to 96.0%, recovering about half the lost tasks.&lt;/p&gt;

&lt;p&gt;But it does not solve the problem, for a specific reason.&lt;/p&gt;

&lt;p&gt;When a worker dies with &lt;code&gt;task_acks_late=True&lt;/code&gt;, the broker keeps the unacknowledged message in an &lt;code&gt;unacked&lt;/code&gt; set. Redelivery is gated by &lt;code&gt;visibility_timeout&lt;/code&gt;, the time the broker waits before assuming the worker is gone and requeuing the message. On the Redis broker, this defaults to approximately one hour.&lt;/p&gt;

&lt;p&gt;So a task killed at 2:00 PM sits waiting for redelivery until 3:00 PM. In most production systems, the SLA for that task is measured in seconds or minutes, not hours.&lt;/p&gt;

&lt;p&gt;The deeper problem: you have traded silent loss for hour-long redelivery latency, without knowing which tasks are stuck in that limbo.&lt;/p&gt;

&lt;p&gt;Our bench ran 500 tasks through 5 SIGKILL cycles:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Delivery rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vanilla Celery (default)&lt;/td&gt;
&lt;td&gt;92.0% (460/500)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vanilla + &lt;code&gt;task_acks_late=True&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;96.0% (480/500)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Relier&lt;/td&gt;
&lt;td&gt;100% (500/500)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The Phoenix Pattern
&lt;/h3&gt;

&lt;p&gt;Relier implements what I call the Phoenix Pattern. The design is straightforward in principle and non-trivial to get right in practice.&lt;/p&gt;

&lt;p&gt;Every &lt;code&gt;@rl_task&lt;/code&gt; registers a heartbeat in Redis when it starts executing, a key with a configurable TTL (default 10 seconds). The task refreshes that heartbeat on a background loop while it runs. Every worker embeds a resurrection scanner that watches for expired heartbeats every few seconds, so the surviving workers recover a dead worker's tasks on their own, distributed locks keep concurrent scanners from replaying the same task twice. (You can also run a standalone &lt;code&gt;rl run-resurrector&lt;/code&gt; process as belt-and-suspenders for the case where every worker dies at once.)&lt;/p&gt;

&lt;p&gt;When a worker dies mid-task, its heartbeat stops refreshing. After one TTL window, the resurrector detects the expired heartbeat and atomically re-queues the orphaned task onto a special &lt;code&gt;re-queue&lt;/code&gt; queue. A healthy worker picks it up. The original task arguments are preserved exactly.&lt;/p&gt;

&lt;p&gt;In our benchmarks, OOM recovery averaged 7.1 seconds with a p99 of 8.9 seconds not 35 seconds, not an hour. Seconds.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Worker dies at t=0&lt;/li&gt;
&lt;li&gt;Heartbeat expires at t=10s (heartbeat_ttl)&lt;/li&gt;
&lt;li&gt;Resurrector detects at t=12s (next scan)&lt;/li&gt;
&lt;li&gt;Task re-queued at t=12s (atomic)&lt;/li&gt;
&lt;li&gt;Healthy worker picks up at t=12–14s&lt;/li&gt;
&lt;li&gt;Task completes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why Relier achieves 100% delivery: it does not rely on the broker's visibility timeout. It has its own independent detection mechanism with a TTL you control.&lt;/p&gt;

&lt;h3&gt;
  
  
  The hard part: fence tokens and zombie workers
&lt;/h3&gt;

&lt;p&gt;The description above makes Phoenix sound simple. The part that took the most work to get right is the zombie worker problem.&lt;/p&gt;

&lt;p&gt;Consider this scenario:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Worker A picks up Task X. Heartbeat registered.&lt;/li&gt;
&lt;li&gt;Worker A has a long GC pause. Its heartbeat expires.&lt;/li&gt;
&lt;li&gt;The resurrector detects the expired heartbeat and re-queues Task X.&lt;/li&gt;
&lt;li&gt;Worker B picks up Task X and completes it. Result committed to Redis.&lt;/li&gt;
&lt;li&gt;Worker A wakes up from its GC pause and tries to commit its result.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without any protection, step 5 causes silent data corruption. Worker A commits a stale result, overwriting Worker B's correct result. The task has now effectively executed twice, with the wrong result stored.&lt;/p&gt;

&lt;p&gt;Relier prevents this with fence tokens. When Phoenix re-queues a task, it generates a new fence token, a monotonically increasing integer associated with the task's execution slot. The completion protocol is an atomic Lua script: "commit this result only if the current fence token matches the token this worker was given when it claimed the task."&lt;/p&gt;

&lt;p&gt;Worker A was given fence token &lt;code&gt;v1&lt;/code&gt;. After resurrection, the slot is now at &lt;code&gt;v2&lt;/code&gt;. When Worker A tries to commit, the Lua script sees the mismatch and rejects the write. No data corruption. No duplicate result.&lt;/p&gt;

&lt;p&gt;This is the correctness guarantee that makes "exactly-once execution" mean something.&lt;/p&gt;

&lt;h3&gt;
  
  
  Everything else Relier adds
&lt;/h3&gt;

&lt;p&gt;Beyond Phoenix, a production-grade task system needs several more things. Relier ships them as part of the same &lt;code&gt;@rl_task&lt;/code&gt; decorator:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Idempotency.&lt;/strong&gt; &lt;code&gt;@rl_task(idempotent=True)&lt;/code&gt; adds an atomic Redis Lua check before task execution. If the same task has already been submitted for the same logical key (which you can set explicitly or let Relier derive from the arguments), the second submission returns immediately without spawning work. In our benchmark: 50 submissions of the same task, 1 execution. Vanilla Celery: 50 executions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two-tier timeouts.&lt;/strong&gt; &lt;code&gt;soft_timeout=8, hard_timeout=10&lt;/code&gt; gives you a cleanup hook that fires at 8 seconds (save state, close connections, emit structured logs) and a hard cancellation at 10 seconds via &lt;code&gt;asyncio.CancelledError&lt;/code&gt;. Zombie tasks that would block a worker forever are quarantined instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graceful shutdown.&lt;/strong&gt; On SIGTERM, the worker drains in-flight tasks rather than dropping them. Tasks that cannot complete before shutdown hands them off to Phoenix on the re-queue queue. In our benchmark: 3 cycles of 20 tasks each, Relier 100% survival, vanilla Celery 0%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dead Letter Queue.&lt;/strong&gt; Tasks that exhaust their &lt;code&gt;max_resurrections&lt;/code&gt; allowance are quarantined to the DLQ with their full payload, stack trace, and complete resurrection history. The &lt;code&gt;rl dlq inspect&lt;/code&gt; CLI shows everything. &lt;code&gt;rl dlq release &amp;lt;id&amp;gt;&lt;/code&gt; re-dispatches a specific failed task. Nothing disappears silently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Admission control.&lt;/strong&gt; An atomic Lua fixed-window rate limiter on every &lt;code&gt;apush()&lt;/code&gt; call. If the cluster is saturated, you get an &lt;code&gt;AdmissionRejectedError&lt;/code&gt; with a &lt;code&gt;Retry-After&lt;/code&gt; header, not a flooded queue and a cascade failure. In our benchmark: p99 0.559ms, well under the 1ms claim.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rolling deploy protection.&lt;/strong&gt; Every payload is wrapped in a versioned envelope with a SHA-256 checksum. Register a migration function, bump &lt;code&gt;CURRENT_VERSION&lt;/code&gt;, and v2 workers silently upgrade v1 payloads mid-deploy. Old and new workers can run simultaneously without payload schema mismatches. Checksums catch broker-side corruption before your code ever runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmarks
&lt;/h3&gt;

&lt;p&gt;All numbers below from the built-in bench suite running against live Redis on Linux (Docker, python:3.11-slim, prefork=4 workers), synthetic 0.5s tasks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Relier v0.1.6&lt;/th&gt;
&lt;th&gt;Vanilla&lt;/th&gt;
&lt;th&gt;Vanilla + acks_late&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Task delivery (500 tasks, 5 kills)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;92.0%&lt;/td&gt;
&lt;td&gt;96.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OOM recovery avg / p99&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7.1s / 8.9s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;∞ lost&lt;/td&gt;
&lt;td&gt;∞ (visibility_timeout)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Idempotent recovery (delayed restart)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;re-ran 4.8s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;∞ lost&lt;/td&gt;
&lt;td&gt;∞ (visibility_timeout)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Graceful shutdown (3 cycles)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duplicate prevention (50 submissions)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1/50 ran&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50/50 ran&lt;/td&gt;
&lt;td&gt;50/50 ran&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Admission control p99&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.559ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dispatch overhead (net)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+1.87ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;n/a&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xzab5kzj9lxfz6610yg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xzab5kzj9lxfz6610yg.png" alt="Grafana dashboard — end of benchmark run. Green line (Relier) reaches 577 cumulative completions across all test cycles. Yellow line (Vanilla Celery) flatlines at 460 after the first SIGKILL cycles. Resurrections: 51 total. Redis memory: 2.92 MiB — no accumulation across the full run." width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft16tykm7zh6dxyp031bn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft16tykm7zh6dxyp031bn.png" alt="Redis commands/sec across the full benchmark run (Grafana, Test 7). Spikes correspond to task turnover bursts during SIGKILL cycles — peaking at ~83 ops/sec during the 500-task delivery test. Baseline returns to near-zero immediately after each burst. No accumulation across 577 completions and 51 resurrections." width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The 1.87ms dispatch overhead covers the admission Lua script + SHA-256 envelope wrap + heartbeat registration. On any task doing real work (a database query, an HTTP call, an AI inference), this cost is invisible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting started
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;relier
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;relier&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;rl_task&lt;/span&gt;

&lt;span class="nd"&gt;@rl_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;idempotent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;soft_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;hard_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;send_invoice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;charge_card&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;send_email&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoice_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# From FastAPI:
&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;send_invoice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# From Flask / Django:
&lt;/span&gt;&lt;span class="n"&gt;send_invoice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three processes to run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;celery &lt;span class="nt"&gt;-A&lt;/span&gt; relier.tasks.app worker &lt;span class="nt"&gt;-l&lt;/span&gt; info &lt;span class="nt"&gt;-Q&lt;/span&gt; high_priority,default,re-queue &lt;span class="nt"&gt;--include&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;tasks
rl run-resurrector
uvicorn main:app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or the full stack (Redis, workers, resurrector, Prometheus, Grafana) with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose &lt;span class="nt"&gt;-f&lt;/span&gt; docker-compose.bench.yml up &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requirements: Python 3.11+, Redis 7+ with AOF persistence and &lt;code&gt;maxmemory-policy noeviction&lt;/code&gt;. Relier preflight-checks both on startup and refuses to run if either is wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I learned building this
&lt;/h3&gt;

&lt;p&gt;The failure modes that are hardest to reason about are not the obvious ones. A worker dying is obvious, you see the process disappear. A GC pause that makes a healthy process look dead to an external observer, then have it wake up and try to write stale state, that is the case that breaks naive implementations.&lt;/p&gt;

&lt;p&gt;Rolling deploys without schema versioning are a silent data loss vector that almost nobody talks about. The checksum + migration system exists because I watched a TypeError on a renamed argument silently DLQ a week's worth of invoice tasks with no alert.&lt;/p&gt;

&lt;p&gt;Fence tokens are not a novel idea. The pattern comes from Martin Kleppmann's writing on distributed locking. But seeing the exact failure mode in a test, instrumenting it, and then watching the Lua script atomically reject the zombie commit, that was the moment Relier went from "probably correct" to "verifiably correct."&lt;/p&gt;

&lt;p&gt;The chaos suite in the repo exists for this reason. Five scenarios: worker-kill, network-partition, load-spike, task-corrupt, slow-task. Run them against your own cluster, your own Redis, your own task code. You should not have to trust my benchmarks. Prove it yourself.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/getrelier/relier" rel="noopener noreferrer"&gt;github.com/getrelier/relier&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://getrelier.github.io/relier" rel="noopener noreferrer"&gt;getrelier.github.io/relier&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Install:&lt;/strong&gt; &lt;code&gt;pip install relier&lt;/code&gt;&lt;/p&gt;




</description>
      <category>python</category>
      <category>celery</category>
      <category>webdev</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
