Outbox Pattern in Postgres, End to End: Producer, Relayer, Consumer

#postgres #eventdriven #architecture #tutorial

Book: Event-Driven Architecture Pocket Guide: Saga, CQRS, Outbox, and the Traps Nobody Warns You About
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You charged the card. You sent the payment.captured webhook. Then the database transaction rolled back on a unique-key violation in the order table, and the customer just got a confirmation email for an order that does not exist. By morning, support has 14 tickets and the finance lead is asking why Stripe and your DB disagree on what happened on April 18 at 02:14 UTC.

The dual-write problem. You wrote to Postgres and to Kafka in two separate steps, and one of them landed while the other did not. There is no retry policy that fixes this, because the broker call and the DB commit are not in the same atomic unit. The outbox pattern is the boring fix that has been quietly carrying production event-driven systems for a decade. Below is a complete, runnable version in Postgres: schema, producer, relayer with FOR UPDATE SKIP LOCKED, idempotent consumer, and the failure modes that bite you anyway.

The schema

One table. Same database as your business tables, so you can write to both inside the same transaction.

CREATE TABLE events_outbox (
    id            BIGSERIAL PRIMARY KEY,
    aggregate_id  UUID        NOT NULL,
    type          TEXT        NOT NULL,
    payload       JSONB       NOT NULL,
    created_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
    dispatched_at TIMESTAMPTZ
);

CREATE INDEX events_outbox_pending_idx
    ON events_outbox (created_at)
    WHERE dispatched_at IS NULL;

aggregate_id is the entity the event is about: order id, user id, payment id. Consumers downstream will use it as a partition key so events for the same aggregate stay ordered. The partial index on created_at WHERE dispatched_at IS NULL is what keeps the relayer query cheap as the table grows. Without it, a poll over a 50M-row outbox is a sequential scan that locks up everything else.

Notice what is not here. No topic column, no retry_count, no status enum. Keep this table small. The relayer derives the topic from type. Retries are the relayer's problem, not the table's. A status enum invites people to add states ("processing", "failed", "skipped") that turn the outbox into a workflow engine, which is where teams quietly lose the plot.

The producer: one transaction, two writes

The whole point of the pattern is that the business write and the outbox write commit together or not at all. In Python with psycopg:

import json
from uuid import UUID
import psycopg

def place_order(
    conn: psycopg.Connection,
    order_id: UUID,
    customer_id: UUID,
    total_cents: int,
) -> None:
    with conn.transaction():
        conn.execute(
            """
            INSERT INTO orders (id, customer_id, total_cents, status)
            VALUES (%s, %s, %s, 'placed')
            """,
            (order_id, customer_id, total_cents),
        )
        _emit_order_placed(
            conn, order_id, customer_id, total_cents,
        )

The outbox insert is a thin helper that reuses the same connection, so both writes ride the same transaction:

def _emit_order_placed(
    conn: psycopg.Connection,
    order_id: UUID,
    customer_id: UUID,
    total_cents: int,
) -> None:
    payload = json.dumps({
        "order_id": str(order_id),
        "customer_id": str(customer_id),
        "total_cents": total_cents,
    })
    conn.execute(
        """
        INSERT INTO events_outbox (aggregate_id, type, payload)
        VALUES (%s, %s, %s::jsonb)
        """,
        (order_id, "order.placed", payload),
    )

If the order insert fails, the outbox insert never happened. If the outbox insert fails, the order insert is rolled back. There is no path where the customer sees the side-effect of an event the system never published. That is the entire guarantee, and you get it for free as long as both writes share the same connection and transaction.

A few things that look optional but are not. Always serialize the payload at write time, not at relay time. The row should carry the exact bytes you intend to publish. Always include enough data in the payload that the consumer does not need to call back into your service to enrich it; otherwise the event is a tombstone that points at mutable state, and the contract leaks.

The relayer: poll, claim, publish, mark

A separate process (a sidecar, a scheduled job, a long-running consumer) polls the outbox, claims a batch, publishes to the broker, and marks the rows dispatched. The claim uses FOR UPDATE SKIP LOCKED so multiple relayer replicas can run in parallel without stepping on each other.

SELECT id, aggregate_id, type, payload
FROM events_outbox
WHERE dispatched_at IS NULL
ORDER BY id
LIMIT 100
FOR UPDATE SKIP LOCKED;

FOR UPDATE takes a row lock for the length of the transaction. SKIP LOCKED tells Postgres not to wait if a row is already locked. It just walks past it. With three relayer pods running this query concurrently, each one grabs a disjoint batch and the broker fan-out scales horizontally. No coordination service, no lease table, no Zookeeper. (SKIP LOCKED behavior)

In Python, wrapping it around a Kafka producer:

from confluent_kafka import Producer

producer = Producer({"bootstrap.servers": "kafka:9092"})

CLAIM_SQL = """
SELECT id, aggregate_id, type, payload
FROM events_outbox
WHERE dispatched_at IS NULL
ORDER BY id
LIMIT %s
FOR UPDATE SKIP LOCKED
"""

MARK_SQL = """
UPDATE events_outbox
SET dispatched_at = now()
WHERE id = ANY(%s)
"""

The relay loop opens a transaction, claims a batch with FOR UPDATE SKIP LOCKED, publishes each row to Kafka keyed by aggregate_id, then marks the rows dispatched before commit:

def relay_once(
    conn: psycopg.Connection, batch: int = 100,
) -> int:
    with conn.transaction():
        rows = conn.execute(CLAIM_SQL, (batch,)).fetchall()
        if not rows:
            return 0

        ids = []
        for event_id, aggregate_id, etype, payload in rows:
            producer.produce(
                topic=etype.split(".", 1)[0],
                key=str(aggregate_id),
                value=payload,
                headers=[
                    ("event_id", str(event_id).encode()),
                ],
            )
            ids.append(event_id)

        producer.flush(timeout=5)
        conn.execute(MARK_SQL, (ids,))
    return len(rows)

Read the order of operations carefully. The transaction stays open across the broker publish. If the publish fails, the UPDATE never runs, the transaction rolls back, the row locks release, and the next poll picks the rows up again. If the publish succeeds but the UPDATE fails, you have already published, and the row gets re-claimed on the next poll and re-published. At-least-once delivery is the contract. The consumer side is what makes it look exactly-once.

The event_id header is the key to the consumer's idempotency. Include it on every message. It is the only piece of identity that survives a re-publish.

The idempotent consumer

A consumer that processes the same event twice without observable side-effects. The pattern is a processed_events table on the consumer side, scoped to the consumer name, with a unique constraint on (consumer, event_id).

CREATE TABLE processed_events (
    consumer    TEXT        NOT NULL,
    event_id    BIGINT      NOT NULL,
    processed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    PRIMARY KEY (consumer, event_id)
);

The consumer wraps the side-effect and the dedup write in one transaction. If the event has already been processed, the insert violates the primary key and the work is skipped:

def handle_order_placed(
    conn: psycopg.Connection,
    consumer: str,
    event_id: int,
    payload: dict,
) -> None:
    with conn.transaction():
        try:
            conn.execute(
                """
                INSERT INTO processed_events (consumer, event_id)
                VALUES (%s, %s)
                """,
                (consumer, event_id),
            )
        except psycopg.errors.UniqueViolation:
            return  # already processed; skip the side-effect

        # business side-effect goes inside the same tx
        conn.execute(
            """
            INSERT INTO loyalty_points (customer_id, points)
            VALUES (%s, %s)
            ON CONFLICT (customer_id)
            DO UPDATE SET
              points = loyalty_points.points + EXCLUDED.points
            """,
            (payload["customer_id"], payload["total_cents"] // 100),
        )

Two rules to keep the contract honest. The dedup write and the side-effect must share a transaction. Otherwise a crash between them can make the event look processed while the work is missing. And the side-effect itself should be idempotent at the data layer too (ON CONFLICT, upserts, conditional updates) so a partial replay during a Kafka rebalance does not corrupt state.

Failure modes that still bite you

The pattern fixes the dual-write problem. It does not fix the laws of physics.

Reordering across aggregates. If two relayer replicas pick up rows 100 and 101 at the same time and the broker accepts 101 first, downstream consumers see 101 before 100. As long as both rows belong to different aggregate_id values, this is fine. Kafka partitioning by aggregate_id keeps per-entity order. Across aggregates, you have no ordering guarantee, and any consumer that assumes global event order is wrong. Single-replica relayer plus a single Kafka partition is the only way to get total order, and it caps your throughput at one stream.

Dispatch lag. The relayer polls every N seconds. Between commit and publish there is a window where the row exists in the database but no event has reached Kafka. Any system that reads the DB and a downstream projection in the same breath will see a state the events have not announced. Document the lag, set an SLO on it, and alert when now() - max(created_at) WHERE dispatched_at IS NULL crosses the threshold. (SeatGeek's outbox writeup puts the working number around 1–5 seconds.)

Stuck rows. Sometimes a row never dispatches. Producer wrote a payload that exceeds the broker's max message size. Producer wrote a type that no topic resolves to. Relayer crashed mid-flush in a way that leaves a row whose ID is past the locked region. The fix is a watchdog query that runs every minute:

SELECT id, type, age(now(), created_at) AS waiting
FROM events_outbox
WHERE dispatched_at IS NULL
  AND created_at < now() - interval '5 minutes'
ORDER BY id
LIMIT 50;

Page on rows that have been waiting more than five minutes. The vast majority of the time, the alert catches a misconfigured topic or a payload-too-large situation before the customer-facing system feels it.

Outbox table bloat. Successfully dispatched rows pile up. A nightly job that deletes rows where dispatched_at < now() - interval '7 days' keeps the table small enough that the partial index stays warm in cache. Do not delete rows synchronously inside the relayer transaction. You will fight your own vacuum process and the tail latency goes feral.

The outbox is not glamorous. It is a single table, a polling loop, and a unique index on the consumer side, and it has been quietly outperforming most of the distributed-transaction frameworks built to replace it. Get the schema right, keep the relayer dumb, make the consumer idempotent at the data layer, and put alerts on the three failure modes above. Everything else is decoration.

If this was useful

The Outbox chapter of the Event-Driven Architecture Pocket Guide goes deeper into the trade-offs between polling relayers and Debezium-style log tailing, when to switch from a single outbox to per-aggregate streams, and the schema-evolution rules that keep five-year-old consumers from breaking when you rename a field. If you are running outbox in production and the failure modes section above sounded too familiar, the rest of the book is for you.