The Dead Letter Queue Pattern: 200 Lines That Save Your Friday

#architecture #eventdriven #microservices #tutorial

Book: Event-Driven Architecture Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Picture it: 5:47 PM on a Friday, the on-call phone vibrates. A single payment event with a malformed currency field, "USD " with a trailing space your parser hates, has been retrying since 4:12 PM. Each retry blocks the partition behind it. Six thousand orders are now stuck waiting on one rotten apple. Your consumer's lag chart looks like a staircase going up.

You did not have a dead letter queue. That is the entire problem.

The DLQ pattern is one of those ideas that sounds boring in a system-design interview and saves your weekend in production. The shape is simple: when a message fails N times, stop hammering it, set it aside in a separate queue, and move on. The implementation is small enough to fit in 200 lines. The discipline around it (failure metadata and redrive tooling) is what separates teams that ship event-driven systems from teams that fear them.

The shape of a poison-message incident

You usually find out from a graph, not a log. Consumer lag climbs. Throughput on the partition drops to zero. processed_per_minute turns into a flat line. The hot path is fine. One specific message, the poison one, is being redelivered forever because every consumer marks it as "failed, retry."

Three things compound the damage:

In-order delivery on partitioned queues. Kafka delivers per partition in order. If consumer offsets do not advance past the bad message, every later message on that partition waits behind it. RabbitMQ with a single consumer behaves the same way.
Tight retry loops. The default retry without backoff means the same broken parser runs the same broken code 400 times a minute. Your CPU graph spikes. Your error log fills with the same stack trace.
No isolation. Without a DLQ, the only way to "skip" the message is to commit the offset by hand or delete the message manually. Both require a human who knows what they're doing, on a Friday night.

The SQS docs call these "poison pill" messages. The fix is the same regardless of broker: cap retries, route the failure somewhere safe, keep the main queue moving. Confluent describes the same pattern under the Kafka DLQ topic name.

The four moving parts

A working DLQ has four pieces, and missing any of them leaves you with the original problem in a different shape.

A main queue with an attempt counter on each message.
A failure handler that increments the counter, applies backoff, and decides "retry or DLQ."
A dead letter queue that stores the failed message plus the failure metadata you'll need at 9 AM Monday: the original payload, the last exception, the stack trace, the consumer that failed it, the timestamp, and the attempt count.
A redrive path that lets a human (or a cron job) pull messages out of the DLQ, repair them, and put them back on the main queue.

Most teams nail the first three and forget the fourth. A DLQ without redrive is a graveyard. You want the operational verb to be "republish from DLQ," not "open a ticket and copy-paste from logs."

A 200-line Postgres-backed worker queue with DLQ

Postgres is not always the right transport for events (Kafka or NATS will outscale it), but it is the right place to start a worker queue when you already have Postgres in your stack. FOR UPDATE SKIP LOCKED makes it work. The packagemain Outbox post uses the same Postgres-as-queue primitive in Go.

Here is the schema. Two tables, one for the live queue, one for the DLQ.

CREATE TABLE jobs (
    id          BIGSERIAL PRIMARY KEY,
    payload     JSONB NOT NULL,
    attempts    INT NOT NULL DEFAULT 0,
    max_attempts INT NOT NULL DEFAULT 5,
    run_after   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    locked_at   TIMESTAMPTZ,
    locked_by   TEXT,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX jobs_run_after_idx
    ON jobs (run_after)
    WHERE locked_at IS NULL;

CREATE TABLE jobs_dlq (
    id          BIGSERIAL PRIMARY KEY,
    original_id BIGINT NOT NULL,
    payload     JSONB NOT NULL,
    attempts    INT NOT NULL,
    error_class TEXT NOT NULL,
    error_msg   TEXT NOT NULL,
    stacktrace  TEXT NOT NULL,
    failed_by   TEXT NOT NULL,
    failed_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Two notes. The partial index on jobs (run_after) WHERE locked_at IS NULL keeps the worker poll cheap as the table grows. The jobs_dlq table captures the metadata you'll wish you had: exception class and full stacktrace as separate columns so you can GROUP BY error_class and find your top 5 poison shapes.

Now the worker. Roughly 100 lines of Python with psycopg, no framework.

import json
import os
import socket
import time
import traceback
from contextlib import contextmanager
from datetime import datetime, timedelta

import psycopg

WORKER_ID = f"{socket.gethostname()}-{os.getpid()}"
LOCK_TIMEOUT = timedelta(minutes=5)


@contextmanager
def conn():
    c = psycopg.connect(os.environ["DATABASE_URL"], autocommit=False)
    try:
        yield c
    finally:
        c.close()


def claim_one(c):
    with c.cursor() as cur:
        cur.execute(
            """
            SELECT id, payload, attempts, max_attempts
            FROM jobs
            WHERE run_after <= NOW()
              AND (locked_at IS NULL
                   OR locked_at < NOW() - %s)
            ORDER BY id
            FOR UPDATE SKIP LOCKED
            LIMIT 1
            """,
            (LOCK_TIMEOUT,),
        )
        row = cur.fetchone()
        if not row:
            return None
        job_id, payload, attempts, max_attempts = row
        cur.execute(
            "UPDATE jobs SET locked_at = NOW(),"
            " locked_by = %s WHERE id = %s",
            (WORKER_ID, job_id),
        )
        c.commit()
        return job_id, payload, attempts, max_attempts


def complete(c, job_id):
    with c.cursor() as cur:
        cur.execute("DELETE FROM jobs WHERE id = %s", (job_id,))
        c.commit()


def fail(c, job_id, payload, attempts, max_attempts, exc):
    next_attempts = attempts + 1
    if next_attempts >= max_attempts:
        move_to_dlq(c, job_id, payload, next_attempts, exc)
        return
    backoff = min(2 ** next_attempts, 300)
    with c.cursor() as cur:
        cur.execute(
            """
            UPDATE jobs
            SET attempts   = %s,
                run_after  = NOW() + (%s || ' seconds')::interval,
                locked_at  = NULL,
                locked_by  = NULL
            WHERE id = %s
            """,
            (next_attempts, str(backoff), job_id),
        )
        c.commit()


def move_to_dlq(c, job_id, payload, attempts, exc):
    with c.cursor() as cur:
        cur.execute(
            """
            INSERT INTO jobs_dlq
            (original_id, payload, attempts,
             error_class, error_msg, stacktrace, failed_by)
            VALUES (%s, %s, %s, %s, %s, %s, %s)
            """,
            (
                job_id,
                json.dumps(payload),
                attempts,
                type(exc).__name__,
                str(exc)[:500],
                traceback.format_exc()[:4000],
                WORKER_ID,
            ),
        )
        cur.execute("DELETE FROM jobs WHERE id = %s", (job_id,))
        c.commit()

The shape worth pointing at: claim_one is the only function that holds a row lock across a transaction. The handler runs outside that transaction, so a slow handler does not block other workers. The locked_at < NOW() - LOCK_TIMEOUT clause is your crash recovery. If a worker dies mid-job, the row becomes claimable again after five minutes.

The runner loop is short.

def handle(payload):
    # Your business logic. Raise on failure.
    if "currency" in payload:
        if not payload["currency"].isalpha():
            raise ValueError("invalid currency code")
    # ... process payment, fan out, whatever.


def run_forever(idle_sleep=1.0):
    while True:
        with conn() as c:
            claimed = claim_one(c)
            if not claimed:
                time.sleep(idle_sleep)
                continue
            job_id, payload, attempts, max_attempts = claimed
        try:
            handle(payload)
            with conn() as c:
                complete(c, job_id)
        except Exception as exc:
            with conn() as c:
                fail(c, job_id, payload, attempts,
                     max_attempts, exc)


if __name__ == "__main__":
    run_forever()

You now have a worker queue with capped retries, exponential backoff, crash recovery, and a DLQ that captures enough metadata to debug Monday morning. The worker is about 130 lines; with the redrive script below, you land at ~200.

The redrive script nobody writes (until they need it)

The piece that makes this a real DLQ is the part that gets messages out. Here it is, about 50 lines.

import json
import os
import sys

import psycopg


def list_dlq(error_class=None, limit=20):
    with psycopg.connect(os.environ["DATABASE_URL"]) as c:
        with c.cursor() as cur:
            if error_class:
                cur.execute(
                    """
                    SELECT id, original_id, error_class,
                           error_msg, attempts, failed_at
                    FROM jobs_dlq
                    WHERE error_class = %s
                    ORDER BY failed_at DESC
                    LIMIT %s
                    """,
                    (error_class, limit),
                )
            else:
                cur.execute(
                    """
                    SELECT error_class, COUNT(*) AS n
                    FROM jobs_dlq
                    GROUP BY error_class
                    ORDER BY n DESC
                    """
                )
            for row in cur.fetchall():
                print(row)


def redrive(dlq_id):
    with psycopg.connect(os.environ["DATABASE_URL"]) as c:
        with c.cursor() as cur:
            cur.execute(
                "SELECT payload FROM jobs_dlq WHERE id = %s",
                (dlq_id,),
            )
            (payload,) = cur.fetchone()
            cur.execute(
                "INSERT INTO jobs (payload, attempts)"
                " VALUES (%s, 0)",
                (json.dumps(payload),),
            )
            cur.execute(
                "DELETE FROM jobs_dlq WHERE id = %s", (dlq_id,)
            )
        c.commit()
        print(f"redriven {dlq_id}")


if __name__ == "__main__":
    cmd = sys.argv[1]
    if cmd == "ls":
        list_dlq(*sys.argv[2:])
    elif cmd == "redrive":
        redrive(int(sys.argv[2]))

python dlq.py ls shows the top error classes. python dlq.py ls ValueError 50 shows the last 50 of one class. python dlq.py redrive 12345 puts a single message back on the main queue with attempts = 0. That last one is the verb your on-call needs at 6 PM on a Friday.

Things to add before this sees real traffic

The 200 lines above are a working spine, not a finished system. Two adjustments matter before production.

Alert on DLQ growth, not just size. A DLQ at 1,200 messages is fine if it has been there for a month. A DLQ that grew by 200 in the last 5 minutes is a fire. Expose jobs_dlq_count as a Prometheus gauge backed by SELECT COUNT(*) FROM jobs_dlq (a prometheus_client Gauge scraped on a 15-second interval is enough), then alert on rate(jobs_dlq_count[5m]) when it crosses the threshold for your traffic shape. The Confluent guide makes the same point about Kafka DLQ topics.

Validate before redriving. Whatever broke the message is probably still broken. A redrive that loops the same message back into the same poison path just refills the DLQ. Add a --validate flag that runs the message through handle in a sandbox before the insert.

For Kafka deployments, the same shape applies: a DLQ topic per consumer group, headers carrying original_topic, attempt_count, error_class, and a redrive consumer that reads from the DLQ topic and republishes to the source. The Karafka DLQ docs and the SQS redrive blog both walk through broker-specific variants.

The Friday rule

If your event-driven system does not have these 200 lines or their equivalent, the next poison message is your weekend. Write the redrive script. Page yourself when the DLQ grows fast, not when it is non-empty. Your Friday self will thank your Tuesday self.

If this was useful

The Outbox, Saga, and DLQ chapters of Event-Driven Architecture Pocket Guide cover the failure modes most tutorials skip: partition stalls, redrive loops, ordering guarantees that quietly break under load, and the operational rituals that keep distributed systems shippable. If you've ever had a Friday like the one above, the book is for you.