Gabriel Anhaia

Posted on Apr 26

200-Line Outbox Pattern: ~90% Fewer Mystery Bugs

#architecture #eventdriven #microservices #postgres

Book: Event-Driven Architecture Pocket Guide: Saga, CQRS, Outbox, and the Traps Nobody Warns You About
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You saved the order in Postgres. The transaction committed at 14:02:11. Then you called kafka.Produce("order-placed", payload). The broker was rebalancing. The call timed out. Your handler logged the error and returned. The order is in the database. The event is not on the topic. Three downstream services never hear about it: billing, fulfillment, customer-emails. The customer's card was charged. Their warehouse pick never happens. Your on-call gets paged at 3am with a ticket titled "where is my order."

This is the dual-write problem, and an engineer I know spent six months chasing it before someone gave it a name. They called the resulting class of incidents "mystery bugs." In their retrospective, the engineer estimated that roughly 90% of them disappeared the day they shipped a transactional outbox. The technique has a canonical writeup on microservices.io by Chris Richardson and a Confluent course on the same pattern. The implementation is small enough to fit in your head. This post walks the 200-line Go version end-to-end.

Why dual writes break

The shape of the bug is always the same. Two systems. One transaction in your head, two transactions in reality. Whatever you wrote second can fail while the thing you wrote first is durable.

// The bug.
func PlaceOrder(ctx context.Context, db *sql.DB, k Producer, o Order) error {
    if _, err := db.ExecContext(ctx,
        "INSERT INTO orders (id, total, status) VALUES ($1, $2, 'placed')",
        o.ID, o.Total); err != nil {
        return err
    }
    // Anything below this line that fails leaves orders inconsistent.
    return k.Produce("order-placed", o.Bytes())
}

If the Produce call fails, you can retry. But the retry is in a different transaction context than the database write, and your service can crash between the commit and the retry. If you swap the order (publish first, then write), you can commit a phantom event for an order that never lands. There is no ordering of two independent writes that gives you atomicity.

The outbox pattern reframes the problem. You only do one durable write, to your own database. You write the business row and the event row in the same transaction. A separate process reads the event row and publishes it to Kafka, then marks the row as sent. If the relay crashes mid-publish, the row stays unsent, the relay restarts, and the event eventually makes it. If Kafka is down for an hour, the rows pile up in the outbox; when Kafka comes back, the relay drains it.

You trade synchronous publish for at-least-once delivery. The Decodable revisit of the pattern names this trade explicitly: events arrive milliseconds to seconds after commit, but they always arrive.

The schema

Two tables. Your business table (orders, in this example) and the outbox.

CREATE TABLE orders (
    id      uuid PRIMARY KEY,
    total   numeric(12, 2) NOT NULL,
    status  text NOT NULL,
    placed_at timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE outbox (
    id              bigserial PRIMARY KEY,
    aggregate_type  text NOT NULL,
    aggregate_id    uuid NOT NULL,
    event_type      text NOT NULL,
    payload         jsonb NOT NULL,
    idempotency_key uuid NOT NULL UNIQUE,
    created_at      timestamptz NOT NULL DEFAULT now(),
    sent_at         timestamptz
);

CREATE INDEX outbox_unsent
    ON outbox (created_at)
    WHERE sent_at IS NULL;

The partial index on unsent rows is the trick that keeps the relay's polling query cheap forever. The idempotency_key is a UUID generated when the row is inserted; downstream consumers use it to deduplicate, because at-least-once means at-some-point-twice.

The atomic write

The producer side is a Go function that writes the order and the event in the same transaction. If anything fails, the whole thing rolls back.

package outbox

import (
    "context"
    "database/sql"
    "encoding/json"

    "github.com/google/uuid"
)

type Event struct {
    AggregateType  string
    AggregateID    uuid.UUID
    EventType      string
    Payload        any
    IdempotencyKey uuid.UUID
}

func Append(ctx context.Context, tx *sql.Tx, e Event) error {
    body, err := json.Marshal(e.Payload)
    if err != nil {
        return err
    }
    _, err = tx.ExecContext(ctx, `
        INSERT INTO outbox
            (aggregate_type, aggregate_id, event_type,
             payload, idempotency_key)
        VALUES ($1, $2, $3, $4, $5)
    `, e.AggregateType, e.AggregateID, e.EventType,
        body, e.IdempotencyKey)
    return err
}

And the place-order handler that uses it:

func PlaceOrder(ctx context.Context, db *sql.DB, o Order) error {
    tx, err := db.BeginTx(ctx, nil)
    if err != nil {
        return err
    }
    defer tx.Rollback()

    if _, err := tx.ExecContext(ctx, `
        INSERT INTO orders (id, total, status)
        VALUES ($1, $2, 'placed')
    `, o.ID, o.Total); err != nil {
        return err
    }

    err = outbox.Append(ctx, tx, outbox.Event{
        AggregateType:  "Order",
        AggregateID:    o.ID,
        EventType:      "order-placed",
        Payload:        o,
        IdempotencyKey: uuid.New(),
    })
    if err != nil {
        return err
    }
    return tx.Commit()
}

The order row and the event row commit together or not at all. There is no window where the order exists without the event, and no window where the event exists without the order.

The relay

A separate goroutine (or a separate process, depending on your deploy model) polls the outbox, publishes anything unsent, and marks it sent. The polling is cheap because of the partial index on sent_at IS NULL.

package outbox

import (
    "context"
    "database/sql"
    "log"
    "time"
)

type Producer interface {
    Produce(ctx context.Context, topic string,
        key, payload []byte, headers map[string]string) error
}

type Relay struct {
    DB        *sql.DB
    Producer  Producer
    BatchSize int
    Tick      time.Duration
}

func (r *Relay) Run(ctx context.Context) error {
    t := time.NewTicker(r.Tick)
    defer t.Stop()
    for {
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-t.C:
            if err := r.drain(ctx); err != nil {
                log.Printf("outbox drain: %v", err)
            }
        }
    }
}

func (r *Relay) drain(ctx context.Context) error {
    for {
        n, err := r.publishBatch(ctx)
        if err != nil {
            return err
        }
        if n == 0 {
            return nil
        }
    }
}

The actual publish loop uses SELECT ... FOR UPDATE SKIP LOCKED so multiple relay replicas can run safely without stepping on each other.

func (r *Relay) publishBatch(ctx context.Context) (int, error) {
    tx, err := r.DB.BeginTx(ctx, nil)
    if err != nil {
        return 0, err
    }
    defer tx.Rollback()

    rows, err := tx.QueryContext(ctx, `
        SELECT id, aggregate_id, event_type,
               payload, idempotency_key
        FROM outbox
        WHERE sent_at IS NULL
        ORDER BY id
        LIMIT $1
        FOR UPDATE SKIP LOCKED
    `, r.BatchSize)
    if err != nil {
        return 0, err
    }
    type row struct {
        id      int64
        aggID   string
        evType  string
        payload []byte
        idemKey string
    }
    var batch []row
    for rows.Next() {
        var rr row
        if err := rows.Scan(&rr.id, &rr.aggID, &rr.evType,
            &rr.payload, &rr.idemKey); err != nil {
            rows.Close()
            return 0, err
        }
        batch = append(batch, rr)
    }
    rows.Close()
    if len(batch) == 0 {
        return 0, tx.Commit()
    }

The SELECT locks each candidate row for the duration of this transaction. Other relay workers see those rows but skip them, which is what makes horizontal scaling free here. Once the batch is in memory, we publish each row to Kafka and then update sent_at in the same transaction.

    for _, b := range batch {
        headers := map[string]string{
            "idempotency-key": b.idemKey,
        }
        if err := r.Producer.Produce(ctx, b.evType,
            []byte(b.aggID), b.payload, headers); err != nil {
            return 0, err
        }
    }

    ids := make([]int64, len(batch))
    for i, b := range batch {
        ids[i] = b.id
    }
    if _, err := tx.ExecContext(ctx, `
        UPDATE outbox
        SET sent_at = now()
        WHERE id = ANY($1)
    `, ids); err != nil {
        return 0, err
    }
    return len(batch), tx.Commit()
}

That is the full thing. The total surface area, including the schema, sits comfortably under 200 lines. No framework. No CDC connector. No replication slot. You can read every line in a code review.

The SKIP LOCKED matters. It lets you run two relays for redundancy without inventing a leader-election scheme. Each relay grabs different rows; if one dies mid-batch, its locks are released and the other relay picks them up on the next tick.

Idempotency on the consumer side

At-least-once means the same event can land twice. Consumers deduplicate using the idempotency-key header. Many message brokers will give you a header on the message; the relay sets it from the outbox row.

// Consumer side, sketched.
func handleOrderPlaced(ctx context.Context, db *sql.DB, msg Message) error {
    key := msg.Headers["idempotency-key"]
    var seen bool
    err := db.QueryRowContext(ctx, `
        SELECT EXISTS (SELECT 1 FROM processed_events WHERE id = $1)
    `, key).Scan(&seen)
    if err != nil {
        return err
    }
    if seen {
        return nil // already handled, ack
    }
    // ... do the work, then record the key in the same transaction
}

Without the dedup table on the consumer, at-least-once is a foot-gun. With it, redelivery becomes a non-event.

The gotchas nobody warns you about

A few of these took the team I know an embarrassing number of postmortems to learn. The Conduktor outbox writeup and event-driven.io's deep dive into Postgres outbox ordering catalogue these failure modes in detail.

LISTEN/NOTIFY is not a polling replacement. Postgres LISTEN/NOTIFY looks like a free latency win — the relay subscribes, the writer notifies, the relay drains immediately. The catch is that notifications are not durable. If the relay is offline when the NOTIFY fires, that signal is gone. You still need a polling fallback for catch-up after restart and for any window where the listener is disconnected. The right architecture is NOTIFY as a wakeup hint and polling as the source of truth, not the other way around.

Logical replication trades polling for slot management. Debezium-style CDC reads from the WAL via a replication slot and is genuinely lower latency than polling. The trade is operational. A replication slot that falls behind retains WAL files indefinitely, and unbounded WAL retention is how you fill a Postgres data volume at 4am. If you go this route, monitor slot lag the way you monitor disk space. That is what a stuck slot will do to your disk.

Replication lag is not zero. If your relay reads from a Postgres read replica to take load off the primary, you will get gaps. The replica may not yet see the row the primary committed two seconds ago. The relay will skip it, the row sits unsent until the next poll, and the latency is whatever the replica lag is. Either point the relay at the primary or accept lag-shaped delivery latency.

Ordering across transactions is not guaranteed. If you depend on event A landing on the topic before event B because A's transaction committed first, you are about to be disappointed. Two transactions issuing nextval('outbox_id_seq') at the same time can be assigned IDs in one order and commit in the other. The relay reads them in commit-order through FOR UPDATE SKIP LOCKED, but if you have two relay workers, the network can reorder publishes. If you need ordering, partition by aggregate_id (Kafka key) so events for the same aggregate land on the same partition and stay in order there. Cross-aggregate ordering is not a guarantee Kafka or this pattern will give you.

Outbox table growth is your problem. A high-throughput service can write millions of outbox rows a day. If you never delete them, the partial index stays cheap (it filters on sent_at IS NULL) but the table itself bloats. Add a periodic DELETE FROM outbox WHERE sent_at < now() - interval '7 days' and a Postgres VACUUM strategy that keeps up. Or partition the table by date and drop old partitions. Either way, plan for it before the table is 80GB.

Why the team I know calls them "mystery bugs"

The specific incidents that disappeared after the outbox shipped had a common shape. Order placed, no email. Refund issued, no audit log entry. User signed up, no welcome message. The database had the right state, the downstream system had the wrong state, and the logs were clean: the publish call had succeeded, except the one time it had not, and that one time was the case the on-call was looking for.

The fix changes nothing about your business logic. The contract with downstream services is still "you receive an event when an order is placed." What changes is that the contract is now true.

200 lines and the right schema buy you that, without a library, a vendor, or a CDC pipeline. The complicated version exists. It is called Debezium, and it is excellent at what it does. But most teams do not need it for their first three years of event-driven traffic. They need the version that fits in their head, that they can debug at 3am, that does not introduce a second failure mode bigger than the one it fixed.

If you ship messages between services and you have not done this yet, this is the week.

If this was useful

The Event-Driven Architecture Pocket Guide covers the rest of the patterns that compose with the outbox: sagas, CQRS, idempotent consumers, ordering strategies, and the failure modes (the real ones, the ones that page you) that come with each. If "mystery bugs" describes any portion of your week, this is the book it was written for.

Top comments (2)

arun rajkumar • Apr 30

The "90% fewer mystery bugs" framing is exactly right. The mystery bugs were never the dual write itself — they were the 4-hour reconciliations the next morning trying to figure out why a payment row existed without a corresponding event. We added one detail beyond the relay: a small outbox_dlq table for events the relay has retried N times without success. Otherwise a single bad consumer can wedge the whole pipeline silently.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.