Yasir Jafri

Posted on Jun 28 • Originally published at yasir323.hashnode.dev

The Transactional Outbox Pattern: Solving the Dual-Write Problem

#python #distributedsystems #systemdesign #microservices

There's a particular kind of bug I've learned to be afraid of. It doesn't throw an exception. It doesn't show up in your tests. It passes code review, ships on a Friday, and works perfectly for three weeks. Then one night a customer pays for an order, gets charged, and never receives a confirmation. Support can't reproduce it. The logs look clean. The order is right there in the database.

What happened is that your service did two things that looked like one thing:

async def create_order(payload):
    await db.save(order)  # 1. write to your database
    await broker.publish("order.created", evt)  # 2. notify

These two lines are the most innocent-looking distributed systems bug I know. This article is about why they're broken, why none of the "obvious" fixes actually work, and how the transactional outbox pattern fixes them properly, including the parts of the pattern that the tidy architecture diagrams quietly leave out.

The dual-write problem

Step back and look at what those two lines actually do. They write to two different systems, your database and your message broker, and there is no transaction spanning both. The database doesn't know the broker exists. The broker doesn't know the database exists. You are hoping both succeed, and hope is not a delivery guarantee.

Here's the failure that bites people:

The order is safely in your database. The event was never published. Every downstream system: the email service, the analytics pipeline, the ledger, the warehouse, etc have no idea this order exists. Nothing errored. Nothing retried. The event simply evaporated.

And crashing isn't the only way to lose. Walk through the combinations:

DB commits, then crash before publish → order exists, event lost (the case above).
Publish "succeeds" but the ack times out → you retry and publish a duplicate.
You flip the order - publish first, then save → crash in between, and now you've announced an order that doesn't exist.
Publish succeeds, then the DB transaction rolls back → you've told the world about something that never happened.

There is no ordering of two independent writes that survives a crash in the gap between them. That gap is the whole problem, and it has a name: the dual-write problem.

Why the obvious fixes don't work

When engineers first hit this, they reach for one of three fixes. Each one is worth killing explicitly, because each feels like it should work.

"I'll just reorder them / wrap them in try/except." Reordering moves the gap; it doesn't close it. A try/except around the publish lets you detect a failure, but what then? You can't un-commit the database row - it's already committed. You're left logging "publish failed" and... what, hoping someone reads the log? You've turned a silent failure into a slightly louder one.

"I'll use a two-phase commit (2PC) / distributed transaction." In theory, a distributed transaction coordinator could make the DB write and the broker publish atomic. In practice: most modern brokers (Kafka, SQS, Redis Streams) don't support XA transactions with your database, 2PC is operationally miserable, it couples the availability of two systems together, and it scales badly. The industry walked away from distributed transactions for good reasons. This is not the road back.

"I'll add a retry queue / background job that republishes." Now you're getting warmer - you've intuited that the event needs to be durable somewhere so it can be retried. But where do you put it durably? If you put it in a separate queue, you're back to a dual write (save to DB and save to queue). The insight you're circling is: it needs to be durable in the same place as the order.

That last instinct, taken to its conclusion, is the outbox pattern.

The core idea: turn two writes into one

The transactional outbox pattern rests on a single move: stop publishing to the broker inside your request. Instead, write the event as a row in your own database, in the same transaction as the business data.

Your database already gives you atomicity for writes to itself. Two INSERTs in one transaction either both commit or both roll back - that's the one guarantee you can actually lean on. So we use it:

Two parts:

The write side records the event atomically with the business change. No broker call in the request path. If the transaction commits, the event is guaranteed to exist. If it rolls back, so does the event. The dual write is gone.
A separate relay worker reads unpublished events from the outbox table and forwards them to the broker, marking each as published once the broker confirms. If the relay crashes, it just re-reads the table on restart. Nothing is lost, because the events live in durable, transactional storage.

The outbox table is, in effect, a crash-proof queue that happens to live inside your transactional database. That's the entire trick. Everything else is detail - but the details are where this pattern earns its keep, so let's build it.

The write side, in code

First, the table. Nothing exotic, it's a log of things that need to be published:

from datetime import datetime
from uuid import UUID, uuid4

from sqlalchemy import DateTime, Integer, String, func
from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column


class Base(DeclarativeBase):
    pass


class OutboxEvent(Base):
    __tablename__ = "outbox_event"

    id: Mapped[UUID] = mapped_column(primary_key=True, default=uuid4)
    aggregate_id: Mapped[str] = mapped_column(String, index=True)
    topic: Mapped[str] = mapped_column(String)
    payload: Mapped[dict] = mapped_column(JSONB)

    status: Mapped[str] = mapped_column(String, default="pending")
    attempts: Mapped[int] = mapped_column(Integer, default=0)
    max_attempts: Mapped[int] = mapped_column(Integer, default=8)

    available_at: Mapped[datetime] = mapped_column(
        DateTime(timezone=True), server_default=func.now()
    )
    created_at: Mapped[datetime] = mapped_column(
        DateTime(timezone=True), server_default=func.now()
    )
    published_at: Mapped[datetime | None] = mapped_column(
        DateTime(timezone=True), nullable=True
    )
    last_error: Mapped[str | None] = mapped_column(String, nullable=True)

Hold on to aggregate_id, attempts, and available_at - each one earns its place later.

Now the write itself. The non-negotiable rule: the event row must be inserted in the same transaction as the business row. In SQLAlchemy, that means both adds happen inside one session.begin() block:

from fastapi import Depends, FastAPI
from sqlalchemy.ext.asyncio import AsyncSession

app = FastAPI()


@app.post("/orders")
async def create_order(
    req: CreateOrderRequest,
    session: AsyncSession = Depends(get_session),
):
    order = Order(id=uuid4(), total_paise=req.total_paise)

    async with session.begin():  # one transaction...
        session.add(order)  # ...the business row...
        session.add(
            OutboxEvent(  # ...and the event, together.
                aggregate_id=str(order.id),
                topic="orders.created",
                payload={
                    "order_id": str(order.id),
                    "total_paise": req.total_paise,
                },
            )
        )
    # commit happens here - order and event, atomically or not at all

    return {"id": str(order.id)}

That's the heart of the pattern, and it's almost anticlimactically simple. There's no broker in this handler at all. The request is fast (no network call to Kafka/SQS) and it cannot half-succeed. If the server is struck by lightning right after the commit, the order and its event are both safely on disk.

A small but real discipline: resist the temptation to "just publish here too, for low latency." The moment you add a broker call back into this transaction, you've reintroduced the dual write. Latency is the relay's job now.

The relay: a small worker with a hard job

The relay is a separate, long-running process. Conceptually its loop is trivial:

The naive implementation writes itself:

async def run_once(session_factory, publisher):
    async with session_factory() as session:
        async with session.begin():
            events = (
                await session.execute(
                    select(OutboxEvent)
                    .where(OutboxEvent.status == "pending")
                    .limit(100)
                )
            ).scalars().all()

            for event in events:
                await publisher.publish(event)
                event.status = "published"

This works on your laptop. It will hurt you in production, and the reasons it hurts you are exactly the four questions that separate "I read a blog post" from "I've run this." Let's take them one at a time.

1. How do you run more than one relay?

You will want more than one relay process - for availability (if the only relay dies, events stop flowing) and for throughput (one process may not keep up). So picture two relays running that naive loop at the same instant. Both execute SELECT ... WHERE status = 'pending' LIMIT 100. Both get the same 100 rows. Both publish them. Every consumer receives every event twice.

A plain SELECT takes no locks, so the database has no idea these two relays are fighting over the same work.

The fix is a feature Postgres has had since v9.5 and that I genuinely love: SELECT ... FOR UPDATE SKIP LOCKED.

from sqlalchemy import func, select


async def claim_batch(session, batch_size: int = 100):
    stmt = (
        select(OutboxEvent)
        .where(OutboxEvent.status == "pending")
        .where(OutboxEvent.available_at <= func.now())
        .order_by(OutboxEvent.aggregate_id, OutboxEvent.created_at)
        .limit(batch_size)
        .with_for_update(skip_locked=True)   # the magic
    )
    return (await session.execute(stmt)).scalars().all()

FOR UPDATE locks the rows you select so no other transaction can grab them. But on its own, FOR UPDATE makes the second relay block - it waits for the first relay's transaction to finish, and your two processes serialize into one. You paid for concurrency and got a queue.

SKIP LOCKED is the difference. It tells Postgres: "lock the rows I select, and skip any rows that are already locked by someone else - give me the next available ones instead." Now:

Relay A claims and locks rows 1–100.
Relay B runs the same query, sees 1–100 are locked, skips them, and claims 101–200.
Each publishes its own disjoint batch and commits, releasing its locks.

No row is ever claimed twice, and the relays run fully in parallel. This one clause turns the outbox table into a safe concurrent work queue - it's the same mechanism that Postgres-backed job queues use under the hood. If you take one piece of SQL away from this article, make it this one.

2. How do you keep events in order?

Order #42 emits three events over its life: order.created, order.paid, order.shipped. A consumer that sees shipped before created is in a corrupted state. So events for the same entity must arrive in the order they were produced.

The key phrase is for the same entity. Order #42's events have no ordering relationship with order #99's events - they're independent. Forcing a single global order across all events would serialize your entire system for no reason. So the guarantee we actually want is per-aggregate ordering: events sharing an aggregate_id are delivered in production order; everything else is free to be parallel.

That's why the claim query above has ORDER BY aggregate_id, created_at. But here's the subtlety that trips people up, and it's where Questions 1 and 2 collide: SKIP LOCKED happily lets Relay A grab order.created for #42 while Relay B grabs order.paid for #42 in a different batch. Now two relays are racing to publish the same aggregate's events, and B might win - shipping paid before created. The very concurrency we wanted in Question 1 just broke the ordering we wanted in Question 2.

The reconciliation: make the aggregate, not the row, the unit of concurrency. A single relay must own all currently-pending events for a given aggregate_id at once. In practice you either claim by aggregate (lock every pending row for a set of aggregates together and publish each aggregate's events in sequence) or you partition aggregates across relays by hashing aggregate_id, so a given aggregate is only ever handled by one relay. Different aggregates still flow in parallel - that's your throughput - but a single aggregate is always strictly serial.

I call this out because it's the most important design decision in the whole pattern, and it's the one most tutorials skip. "SKIP LOCKED gives you concurrency; per-aggregate ordering constrains it; you reconcile the two by claiming aggregates rather than rows." Say that in a design review and people will know you've actually run this thing.

(If your domain genuinely doesn't need ordering - say, fire-and-forget notifications and then skip all of this and claim rows freely. Ordering is a cost; only pay it when you need it.)

3. What about events that keep failing?

Publishes fail. The trick is realizing there are two kinds of failure and they need opposite treatment.

Transient failures: the broker is briefly down, a network blip, a rate limit. These succeed if you try again later. But "try again" needs manners: retrying instantly and forever just hammers a broker that's already struggling. So you back off exponentially - wait 1s, then 2s, 4s, 8s and you express that in the data, not with sleep():

from datetime import datetime, timedelta, timezone


def next_attempt_at(attempts: int) -> datetime:
    delay = min(2 ** attempts, 300)  # cap at 5 minutes
    return datetime.now(timezone.utc) + timedelta(seconds=delay)

On a failure you bump attempts and set available_at = next_attempt_at(attempts). Remember the claim query filters available_at <= now() - so a failed row simply becomes invisible until its cooldown passes. Backoff falls out of the schema for free. No timers, no background sleeps, nothing to coordinate across relays.

Permanent failures (poison messages): the event is malformed, points at a topic that doesn't exist, or trips a bug. No number of retries will ever help. And this one is genuinely dangerous, because of Question 2: if you enforce per-aggregate ordering, you cannot publish later events for that aggregate until the stuck one succeeds. So one poison event doesn't just fail itself - it freezes that customer's entire event stream. Forever. This is head-of-line blocking, and it's the kind of thing that takes down a feature at 3am.

The fix is a dead-letter table plus a max_attempts cap. After N attempts, you stop retrying, move the row out of the hot path into outbox_dead_letter (with the failure reason and a timestamp), and let the queue behind it flow again:

async def run_once(session_factory, publisher, batch_size: int = 100):
    async with session_factory() as session:
        async with session.begin():
            events = await claim_batch(session, batch_size)

            for event in events:
                try:
                    await publisher.publish(event)
                    event.status = "published"
                    event.published_at = datetime.now(timezone.utc)
                except Exception as exc:  # noqa: BLE001
                    event.attempts += 1
                    event.last_error = str(exc)[:500]
                    if event.attempts >= event.max_attempts:
                        event.status = "dead"  # → dead-letter
                    else:
                        event.available_at = next_attempt_at(event.attempts)
        # COMMIT here releases the row locks held by SKIP LOCKED

Anything landing in the dead-letter table should page someone - it means a real bug needs a human. The point of the DLQ isn't to hide failures; it's to contain them so a single bad event can't hold the rest hostage.

(One trade-off baked into the loop above: we hold the row locks for the whole publish call. Simple and correct. If your broker is slow and you want shorter locks, the variation is to flip rows to a publishing state and commit quickly before publishing - slightly more bookkeeping, shorter lock windows. Start simple; optimize when a benchmark tells you to.)

4. What can you honestly promise? (Spoiler: not exactly-once)

This is the part I care about most, because it's where engineers are tempted to lie - sometimes to others, often to themselves.

Look closely at the success path: the relay publishes to the broker, the broker acks, then the relay writes status = 'published' and commits. Now crash the relay in between - after the ack, before the commit:

The event reached the broker. But the row never got marked published, so on restart the relay finds it still pending and sends it again. A duplicate.

Here's the uncomfortable truth: you cannot close this window. Closing it would require the broker-publish and the DB-update to be one atomic transaction across two systems - which is the dual-write problem all over again, the exact thing this whole pattern exists to avoid. It's turtles all the way down. There is no atomic "publish-and-record" spanning a broker and a database.

So you don't get to eliminate the failure; you only get to choose it:

Mark published before publishing → crash in the window and the row says "done" but nothing went out → lost event. This is at-most-once guarantee. For anything that matters, losing events is unacceptable.
Mark published after the ack → crash in the window and you re-send → duplicate. This is at-least-once guarantee. You never lose; you might double up.

The outbox pattern chooses at-least-once guarantee, because duplicates can be made harmless and lost events can't be recovered. Every event carries a stable id. The consumer deduplicates on it:

# On the consumer side - make reprocessing a no-op.
async def handle(event, session):
    async with session.begin():
        result = await session.execute(
            insert(ProcessedEvent)
            .values(event_id=event.id)
            .on_conflict_do_nothing()  # already seen? skip.
        )
        if result.rowcount == 0:
            return  # duplicate - ignore
        await do_the_actual_work(event)

A one-row INSERT ... ON CONFLICT DO NOTHING (or your broker's native dedup, like SQS FIFO's MessageDeduplicationId) turns a duplicate delivery into a no-op. The observable behavior becomes "processed once." That combination - at-least-once delivery from the relay, deduplication on the consumer - is the strongest honest guarantee there is. The proper name for it is effectively-once, and you should be suspicious of anyone who promises you the other thing. If a system claims true exactly-once delivery across a database, a relay, and a broker, either it's wrong or it's hiding a distributed transaction nobody wanted.

Write "exactly-once" in your docs and you'll be debugging a duplicate charge while re-reading your own marketing. Write "at-least-once, dedupe on event id" and you'll sleep better.

When not to reach for this

A pattern you can't argue against is a pattern you don't understand, so: the outbox is not free. It adds a table, a worker to operate and monitor, a dead-letter queue to watch, and end-to-end latency bounded by your poll interval. If you're building a small monolith where the "downstream consumer" is another function in the same process and the same transaction, you don't need any of this - just call the function. If you can tolerate occasionally losing an event (some analytics, some best-effort notifications), the bookkeeping may not be worth it.

Reach for the outbox when an event leaving your service matters - money moved, an order placed, a contract signed - and a lost or duplicated event has real consequences. That's exactly when those two innocent lines from the top of this article will eventually betray you, and exactly when it's worth the table and the worker to make sure they don't.

The one-paragraph version

If you only remember one thing: never write to your database and your broker as two separate steps. Write the event into an outbox table in the same transaction as your data, and let a separate relay forward it. Use SELECT ... FOR UPDATE SKIP LOCKED so multiple relays can share the work without stepping on each other. Order by aggregate_id and claim per-aggregate when ordering matters. Back off and dead-letter the events that won't go through, so one poison message can't block the rest. And tell the truth about your delivery guarantee: at-least-once, with consumers deduplicating on a stable event id. Do that, and the bug that loses a customer's confirmation email at 2am simply stops being possible.

If this was useful, I write about backend reliability, payments, and the unglamorous distributed-systems details that only show up in production. The idempotency-key pattern - the inbound cousin of the outbox - is next.

DEV Community