Dead Letter Queue Triage: The 5 Categories That Cover 95% of Failures

#architecture #eventdriven #devops #reliability

Book: Event-Driven Architecture Pocket Guide: Saga, CQRS, Outbox, and the Traps Nobody Warns You About
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Your dead-letter queue has 2,400 messages. Six engineers each open one, shrug, and close the tab. Nobody knows which are safe to replay, which are landmines, which were already fixed three deploys ago. The queue sits there for a month and then somebody clicks "purge all" because the alert noise won. That's the failure mode this post is about.

The fix isn't more dashboards. It's a taxonomy. Five categories cover roughly 95% of what you'll see in a working production system, and once messages are tagged you get one replay protocol per category instead of one judgement call per message.

Why DLQs become graveyards

A DLQ without classification is just a folder of mysteries. Each message needs a human to read the payload, find the original consumer code, guess what went wrong, decide if the root cause is fixed, and pick between replay, edit-and-replay, drop, or escalate. That's ten minutes of careful work per message. At 2,400 messages, you have an entire engineer-month sitting in a queue you ignore.

Three things are missing. No owner: the queue belongs to whoever's on call this week, which means nobody. No taxonomy, so every failure looks bespoke until you've sorted twenty of them. No replay protocol, so even when you know what happened, you don't know what to do next.

The taxonomy is the cheap fix. The other two follow from it.

Category 1: Transient

The handler did everything right and the world hiccupped. DNS resolution timed out for 800ms. The downstream API returned 503 because their pod was rotating. A connection pool was momentarily exhausted. Rate-limit headers came back saying "try again in 12 seconds" and the retry budget ran out.

Concrete examples you'll recognise: requests.exceptions.ConnectionError, psycopg2.OperationalError: server closed the connection unexpectedly, HTTP 429 with Retry-After, gRPC UNAVAILABLE, Kafka NotLeaderForPartitionException.

These messages are safe to replay as-is. The payload is valid, the logic is correct, the only thing that failed is a network or a quota. If the same message lands in the DLQ twice in a row with the same transient cause, your retry budget on the consumer is too small. Fix that and these stop arriving.

Category 2: Schema mismatch

The consumer was built against OrderPlaced v2 and a v3 event landed. Maybe the producer team added a required field and forgot to upcaster the old consumer. Maybe protobuf got bumped, a field got renamed, an enum got a new value the consumer's switch statement doesn't handle. JSON consumers fail with KeyError: 'currency_code'. Avro consumers fail at deserialize. Protobuf consumers silently default fields and then explode three calls later.

The tell: the error happens during decoding or right after, before any business logic runs. Stack traces look like pydantic.ValidationError, json.JSONDecodeError, google.protobuf.message.DecodeError, avro.io.SchemaResolutionException.

Don't replay these blindly. The consumer needs a code change or a schema-registry upcaster before the message becomes processable. Replaying without the fix puts the same message right back in the DLQ.

Category 3: Business-rule rejection

The event arrived clean, decoded fine, and the handler ran the right code path. The business logic said no. OrderShipped for an order that's already in REFUNDED state. PaymentCaptured for an amount that exceeds the auth. InventoryReserved for a SKU that was discontinued yesterday.

These are real events that violate an invariant. Replaying them won't help: the invariant will still reject them tomorrow. Sometimes they represent a real bug upstream (the order service shouldn't have emitted that event). Sometimes they're race conditions you can fix with idempotency keys. Sometimes they're correct behavior, and the event is genuinely stale and dropping it is right.

The action is almost never "replay." It's "investigate, decide, document, drop or compensate." If you find yourself replaying category-3 messages, your downstream is probably tolerating bad state.

Category 4: Poison message

The handler itself has a bug, and a specific input shape trips it. Division by zero when a discount is 100%. An off-by-one when an order has exactly one line item. None propagated through a chain that assumed non-null. A regex that ReDoSes on a long string. An integer overflow on a quantity field nobody thought would exceed 32 bits.

The signature: the same payload fails identically on every retry, and a small change to the payload would make it succeed. Stack traces point at your code, not at infrastructure or decoders.

Replay without a code fix and the message comes right back. Worse, in some setups it can wedge a partition. The consumer keeps picking up the poisonous offset, fails, retries, never advances. Park these in the DLQ, ship the code fix, then replay.

Category 5: Lost context

The event references state that no longer exists. RefundIssued for an order that was hard-deleted last week during a GDPR purge. CommentCreated on a post whose ID got reassigned after a database restore. A foreign key into a tenant that was offboarded. A reference to a feature flag that's been removed.

These look like business-rule rejections but they're worse. There's no row to fix, no compensating action to take. The handler queries the parent entity and gets nothing back, or gets back something with a different tenant, or hits a stale cache that returned data the auth layer no longer permits.

Some are safe to drop with an audit note. Some need a compensating event on the source side. None should be replayed in their current form.

A triage script: auto-classify on arrival

Don't make humans do the first pass. The classifier below tags each message the moment it lands in the DLQ. It uses three signals: the exception class from the consumer, the HTTP status if there was a downstream call, and the payload shape. Tag goes into a message attribute (triage_category) and into a separate index so the on-call dashboard can show counts per category.

# dlq_classifier.py - run on DLQ arrival, tag with triage_category.
# Subscribes to the DLQ topic, classifies, writes the tag back as an
# attribute and emits a metric. Runs as a sidecar per service.

import json
import re
from dataclasses import dataclass
from typing import Literal

Category = Literal[
    "transient", "schema_mismatch", "business_rule",
    "poison", "lost_context", "unknown",
]

TRANSIENT_EXC = {
    "ConnectionError", "Timeout", "OperationalError",
    "ServiceUnavailable", "ThrottlingException",
    "NotLeaderForPartitionException", "RetryError",
}
TRANSIENT_STATUS = {408, 425, 429, 500, 502, 503, 504}

SCHEMA_EXC = {
    "ValidationError", "JSONDecodeError", "DecodeError",
    "SchemaResolutionException", "KeyError", "AttributeError",
}

BUSINESS_PATTERNS = [
    re.compile(r"invariant\s+violat", re.I),
    re.compile(r"already\s+(refunded|shipped|cancelled)", re.I),
    re.compile(r"amount\s+exceeds", re.I),
    re.compile(r"state\s+transition\s+not\s+allowed", re.I),
]

LOST_CONTEXT_PATTERNS = [
    re.compile(r"(order|user|tenant|post)\s+not\s+found", re.I),
    re.compile(r"foreign\s+key\s+constraint", re.I),
    re.compile(r"no\s+such\s+(row|record|entity)", re.I),
]


@dataclass
class FailedMessage:
    payload: dict
    exception_class: str
    exception_message: str
    downstream_status: int | None  # None if no HTTP call involved
    retry_count: int


def classify(msg: FailedMessage) -> Category:
    # Transient first - cheapest check, most common cause.
    if msg.exception_class in TRANSIENT_EXC:
        return "transient"
    if msg.downstream_status in TRANSIENT_STATUS:
        return "transient"

    # Schema before business: decode errors fire before logic runs.
    if msg.exception_class in SCHEMA_EXC:
        return "schema_mismatch"

    # Lost context before business: "not found" is the giveaway.
    for pattern in LOST_CONTEXT_PATTERNS:
        if pattern.search(msg.exception_message):
            return "lost_context"

    # Business rule next: explicit invariant language.
    for pattern in BUSINESS_PATTERNS:
        if pattern.search(msg.exception_message):
            return "business_rule"

    # Poison heuristic: same payload, same exception, >= 3 retries.
    # Tune the threshold to your retry budget.
    if msg.retry_count >= 3 and msg.exception_class not in TRANSIENT_EXC:
        return "poison"

    return "unknown"


def handle(raw: bytes) -> None:
    envelope = json.loads(raw)
    msg = FailedMessage(
        payload=envelope["payload"],
        exception_class=envelope["error"]["class"],
        exception_message=envelope["error"]["message"],
        downstream_status=envelope.get("downstream", {}).get("status"),
        retry_count=envelope.get("retry_count", 0),
    )
    category = classify(msg)
    # Write back to the DLQ index (a tiny Postgres table or
    # whatever your dashboard reads). Don't mutate the original.
    upsert_triage(envelope["message_id"], category)
    emit_metric("dlq.classified", tags={"category": category})

That's about 60 lines and it'll tag 80–90% of your real DLQ traffic correctly the first day you ship it. The unknown bucket is where you tune. Every unknown is either a new pattern to add or a signal your downstream is throwing exception types nobody's catalogued yet.

Replay protocol per category

The classifier earns its keep at replay time. One protocol per category. No judgement calls.

Transient. Bulk replay with exponential backoff. Cap the rate so you don't re-DoS the recovered downstream. If a message lands back in the DLQ as transient again, escalate the retry budget, don't keep replaying.

Schema mismatch. Block replay until a consumer version newer than the event version is deployed. Tag the DLQ message with the producer's schema version. After the deploy, replay only messages older than the deploy timestamp.

Business-rule rejection. Never bulk replay. Open a ticket per message or per pattern. Decide explicitly: drop with audit, compensate (emit a corrective event), or escalate to the product owner.

Poison. Block replay until a fix is shipped. Tag with the commit SHA that introduced the bug if you can identify it. After the fix deploys, replay only messages whose created_at predates the fix.

Lost context. Never replay. Either drop with an audit row, or emit a compensating event on the source service. These are the most dangerous to replay blindly because the failure won't reproduce in staging.

The whole on-call playbook fits on one page once the categories are defined. That's the point.

The gotcha: re-categorising during replay

When you replay a tagged message and it fails again, the new failure can have a different signature. A transient message hits a schema_mismatch because the consumer got upgraded mid-replay. A business_rule rejection turns into lost_context because the parent entity got cleaned up overnight.

The instinct is to overwrite the original tag with the new one. Don't. You'll destroy the audit trail and the next on-call will have no idea why the message was originally parked.

Store the original classification immutably with a timestamp. Add a replay_attempts array where each entry has its own category, exception, and timestamp. The current state is the last entry; the history is the array. When you build dashboards or alerting, key off both. "Messages originally classified X that became Y on replay" is one of the most useful signals you can get about the health of your event flows.

CREATE TABLE dlq_triage (
    message_id        text PRIMARY KEY,
    original_category text NOT NULL,
    original_tagged_at timestamptz NOT NULL DEFAULT now(),
    replay_attempts   jsonb NOT NULL DEFAULT '[]'::jsonb,
    current_category  text NOT NULL,
    resolved          boolean NOT NULL DEFAULT false
);

-- Append a replay attempt without losing the original.
UPDATE dlq_triage
   SET replay_attempts = replay_attempts || jsonb_build_object(
           'category',   $1,
           'exception',  $2,
           'attempted_at', now()
       ),
       current_category = $1
 WHERE message_id = $3;

Five categories. One classifier. Five replay protocols. An audit trail that survives re-tagging. That's the whole system, and it scales from a 200-message DLQ to a 200,000-message DLQ without changing shape.

The hard part isn't writing the code. It's the discipline of refusing to replay until something is classified, and refusing to replay business-rule and lost-context messages at all without a human decision. Most teams skip both steps and end up purging the queue twice a quarter. Don't be that team.

What's the biggest category in your DLQ right now, and what's stopping you from automating the replay protocol for it?

If this was useful

DLQ design is one of the recurring traps in event-driven systems. The queue itself isn't hard, but the operational discipline around it is what separates a working pipeline from a graveyard. The chapter on outbox, replay, and failure handling in the Event-Driven Architecture Pocket Guide walks through this taxonomy plus the surrounding patterns (idempotency keys, poison-pill detection, compensating events) in more depth, with the same opinionated stance: automate the boring decisions, escalate the real ones.