- Book: Event-Driven Architecture Pocket Guide
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Picture it: 5:47 PM on a Friday, the on-call phone vibrates. A single payment event with a malformed currency field, "USD " with a trailing space your parser hates, has been retrying since 4:12 PM. Each retry blocks the partition behind it. Six thousand orders are now stuck waiting on one rotten apple. Your consumer's lag chart looks like a staircase going up.
You did not have a dead letter queue. That is the entire problem.
The DLQ pattern is one of those ideas that sounds boring in a system-design interview and saves your weekend in production. The shape is simple: when a message fails N times, stop hammering it, set it aside in a separate queue, and move on. The implementation is small enough to fit in 200 lines. The discipline around it (failure metadata and redrive tooling) is what separates teams that ship event-driven systems from teams that fear them.
The shape of a poison-message incident
You usually find out from a graph, not a log. Consumer lag climbs. Throughput on the partition drops to zero. processed_per_minute turns into a flat line. The hot path is fine. One specific message, the poison one, is being redelivered forever because every consumer marks it as "failed, retry."
Three things compound the damage:
- In-order delivery on partitioned queues. Kafka delivers per partition in order. If consumer offsets do not advance past the bad message, every later message on that partition waits behind it. RabbitMQ with a single consumer behaves the same way.
- Tight retry loops. The default retry without backoff means the same broken parser runs the same broken code 400 times a minute. Your CPU graph spikes. Your error log fills with the same stack trace.
- No isolation. Without a DLQ, the only way to "skip" the message is to commit the offset by hand or delete the message manually. Both require a human who knows what they're doing, on a Friday night.
The SQS docs call these "poison pill" messages. The fix is the same regardless of broker: cap retries, route the failure somewhere safe, keep the main queue moving. Confluent describes the same pattern under the Kafka DLQ topic name.
The four moving parts
A working DLQ has four pieces, and missing any of them leaves you with the original problem in a different shape.
- A main queue with an attempt counter on each message.
- A failure handler that increments the counter, applies backoff, and decides "retry or DLQ."
- A dead letter queue that stores the failed message plus the failure metadata you'll need at 9 AM Monday: the original payload, the last exception, the stack trace, the consumer that failed it, the timestamp, and the attempt count.
- A redrive path that lets a human (or a cron job) pull messages out of the DLQ, repair them, and put them back on the main queue.
Most teams nail the first three and forget the fourth. A DLQ without redrive is a graveyard. You want the operational verb to be "republish from DLQ," not "open a ticket and copy-paste from logs."
A 200-line Postgres-backed worker queue with DLQ
Postgres is not always the right transport for events (Kafka or NATS will outscale it), but it is the right place to start a worker queue when you already have Postgres in your stack. FOR UPDATE SKIP LOCKED makes it work. The packagemain Outbox post uses the same Postgres-as-queue primitive in Go.
Here is the schema. Two tables, one for the live queue, one for the DLQ.
CREATE TABLE jobs (
id BIGSERIAL PRIMARY KEY,
payload JSONB NOT NULL,
attempts INT NOT NULL DEFAULT 0,
max_attempts INT NOT NULL DEFAULT 5,
run_after TIMESTAMPTZ NOT NULL DEFAULT NOW(),
locked_at TIMESTAMPTZ,
locked_by TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX jobs_run_after_idx
ON jobs (run_after)
WHERE locked_at IS NULL;
CREATE TABLE jobs_dlq (
id BIGSERIAL PRIMARY KEY,
original_id BIGINT NOT NULL,
payload JSONB NOT NULL,
attempts INT NOT NULL,
error_class TEXT NOT NULL,
error_msg TEXT NOT NULL,
stacktrace TEXT NOT NULL,
failed_by TEXT NOT NULL,
failed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Two notes. The partial index on jobs (run_after) WHERE locked_at IS NULL keeps the worker poll cheap as the table grows. The jobs_dlq table captures the metadata you'll wish you had: exception class and full stacktrace as separate columns so you can GROUP BY error_class and find your top 5 poison shapes.
Now the worker. Roughly 100 lines of Python with psycopg, no framework.
import json
import os
import socket
import time
import traceback
from contextlib import contextmanager
from datetime import datetime, timedelta
import psycopg
WORKER_ID = f"{socket.gethostname()}-{os.getpid()}"
LOCK_TIMEOUT = timedelta(minutes=5)
@contextmanager
def conn():
c = psycopg.connect(os.environ["DATABASE_URL"], autocommit=False)
try:
yield c
finally:
c.close()
def claim_one(c):
with c.cursor() as cur:
cur.execute(
"""
SELECT id, payload, attempts, max_attempts
FROM jobs
WHERE run_after <= NOW()
AND (locked_at IS NULL
OR locked_at < NOW() - %s)
ORDER BY id
FOR UPDATE SKIP LOCKED
LIMIT 1
""",
(LOCK_TIMEOUT,),
)
row = cur.fetchone()
if not row:
return None
job_id, payload, attempts, max_attempts = row
cur.execute(
"UPDATE jobs SET locked_at = NOW(),"
" locked_by = %s WHERE id = %s",
(WORKER_ID, job_id),
)
c.commit()
return job_id, payload, attempts, max_attempts
def complete(c, job_id):
with c.cursor() as cur:
cur.execute("DELETE FROM jobs WHERE id = %s", (job_id,))
c.commit()
def fail(c, job_id, payload, attempts, max_attempts, exc):
next_attempts = attempts + 1
if next_attempts >= max_attempts:
move_to_dlq(c, job_id, payload, next_attempts, exc)
return
backoff = min(2 ** next_attempts, 300)
with c.cursor() as cur:
cur.execute(
"""
UPDATE jobs
SET attempts = %s,
run_after = NOW() + (%s || ' seconds')::interval,
locked_at = NULL,
locked_by = NULL
WHERE id = %s
""",
(next_attempts, str(backoff), job_id),
)
c.commit()
def move_to_dlq(c, job_id, payload, attempts, exc):
with c.cursor() as cur:
cur.execute(
"""
INSERT INTO jobs_dlq
(original_id, payload, attempts,
error_class, error_msg, stacktrace, failed_by)
VALUES (%s, %s, %s, %s, %s, %s, %s)
""",
(
job_id,
json.dumps(payload),
attempts,
type(exc).__name__,
str(exc)[:500],
traceback.format_exc()[:4000],
WORKER_ID,
),
)
cur.execute("DELETE FROM jobs WHERE id = %s", (job_id,))
c.commit()
The shape worth pointing at: claim_one is the only function that holds a row lock across a transaction. The handler runs outside that transaction, so a slow handler does not block other workers. The locked_at < NOW() - LOCK_TIMEOUT clause is your crash recovery. If a worker dies mid-job, the row becomes claimable again after five minutes.
The runner loop is short.
def handle(payload):
# Your business logic. Raise on failure.
if "currency" in payload:
if not payload["currency"].isalpha():
raise ValueError("invalid currency code")
# ... process payment, fan out, whatever.
def run_forever(idle_sleep=1.0):
while True:
with conn() as c:
claimed = claim_one(c)
if not claimed:
time.sleep(idle_sleep)
continue
job_id, payload, attempts, max_attempts = claimed
try:
handle(payload)
with conn() as c:
complete(c, job_id)
except Exception as exc:
with conn() as c:
fail(c, job_id, payload, attempts,
max_attempts, exc)
if __name__ == "__main__":
run_forever()
You now have a worker queue with capped retries, exponential backoff, crash recovery, and a DLQ that captures enough metadata to debug Monday morning. The worker is about 130 lines; with the redrive script below, you land at ~200.
The redrive script nobody writes (until they need it)
The piece that makes this a real DLQ is the part that gets messages out. Here it is, about 50 lines.
import json
import os
import sys
import psycopg
def list_dlq(error_class=None, limit=20):
with psycopg.connect(os.environ["DATABASE_URL"]) as c:
with c.cursor() as cur:
if error_class:
cur.execute(
"""
SELECT id, original_id, error_class,
error_msg, attempts, failed_at
FROM jobs_dlq
WHERE error_class = %s
ORDER BY failed_at DESC
LIMIT %s
""",
(error_class, limit),
)
else:
cur.execute(
"""
SELECT error_class, COUNT(*) AS n
FROM jobs_dlq
GROUP BY error_class
ORDER BY n DESC
"""
)
for row in cur.fetchall():
print(row)
def redrive(dlq_id):
with psycopg.connect(os.environ["DATABASE_URL"]) as c:
with c.cursor() as cur:
cur.execute(
"SELECT payload FROM jobs_dlq WHERE id = %s",
(dlq_id,),
)
(payload,) = cur.fetchone()
cur.execute(
"INSERT INTO jobs (payload, attempts)"
" VALUES (%s, 0)",
(json.dumps(payload),),
)
cur.execute(
"DELETE FROM jobs_dlq WHERE id = %s", (dlq_id,)
)
c.commit()
print(f"redriven {dlq_id}")
if __name__ == "__main__":
cmd = sys.argv[1]
if cmd == "ls":
list_dlq(*sys.argv[2:])
elif cmd == "redrive":
redrive(int(sys.argv[2]))
python dlq.py ls shows the top error classes. python dlq.py ls ValueError 50 shows the last 50 of one class. python dlq.py redrive 12345 puts a single message back on the main queue with attempts = 0. That last one is the verb your on-call needs at 6 PM on a Friday.
Things to add before this sees real traffic
The 200 lines above are a working spine, not a finished system. Two adjustments matter before production.
Alert on DLQ growth, not just size. A DLQ at 1,200 messages is fine if it has been there for a month. A DLQ that grew by 200 in the last 5 minutes is a fire. Expose jobs_dlq_count as a Prometheus gauge backed by SELECT COUNT(*) FROM jobs_dlq (a prometheus_client Gauge scraped on a 15-second interval is enough), then alert on rate(jobs_dlq_count[5m]) when it crosses the threshold for your traffic shape. The Confluent guide makes the same point about Kafka DLQ topics.
Validate before redriving. Whatever broke the message is probably still broken. A redrive that loops the same message back into the same poison path just refills the DLQ. Add a --validate flag that runs the message through handle in a sandbox before the insert.
For Kafka deployments, the same shape applies: a DLQ topic per consumer group, headers carrying original_topic, attempt_count, error_class, and a redrive consumer that reads from the DLQ topic and republishes to the source. The Karafka DLQ docs and the SQS redrive blog both walk through broker-specific variants.
The Friday rule
If your event-driven system does not have these 200 lines or their equivalent, the next poison message is your weekend. Write the redrive script. Page yourself when the DLQ grows fast, not when it is non-empty. Your Friday self will thank your Tuesday self.
If this was useful
The Outbox, Saga, and DLQ chapters of Event-Driven Architecture Pocket Guide cover the failure modes most tutorials skip: partition stalls, redrive loops, ordering guarantees that quietly break under load, and the operational rituals that keep distributed systems shippable. If you've ever had a Friday like the one above, the book is for you.
Top comments (0)