Outbox Pattern: When CDC Beats Polling, When Polling Beats CDC

#kafka #architecture #eventdriven #backend

Book: Event-Driven Architecture Pocket Guide: Saga, CQRS, Outbox, and the Traps Nobody Warns You About
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

The outbox pattern is a 2-line config in Debezium. It's also the wrong default for half the teams shipping it. Polling beats CDC on five specific cases, CDC is right in the other five, and the decision usually gets made by whoever read the blog post most recently.

There's a version of this article you've read before: "use CDC, it's faster, here's a Kafka Connect manifest." That article is fine if you already run Kafka Connect, your traffic justifies it, and your team can debug a connector at 2am. If any of those is shaky, polling is the saner default and you can graduate to CDC later without rewriting your producers.

So: same outcome (events on a bus, exactly-once-ish, no dual writes), two very different operational shapes. Here's the matrix.

The two flavors

Both start with the same table. A transaction writes a domain change plus an outbox_events row in the same commit. The split is how those rows leave the database.

CREATE TABLE outbox_events (
  id          BIGSERIAL PRIMARY KEY,
  aggregate   TEXT NOT NULL,
  event_type  TEXT NOT NULL,
  payload     JSONB NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  published_at TIMESTAMPTZ
);

CREATE INDEX ON outbox_events (published_at) WHERE published_at IS NULL;

Polling-based outbox: a worker reads unpublished rows, ships them to your bus, marks them published. One process. No new infra.

CDC-based outbox: Debezium tails the Postgres WAL, every insert into outbox_events lands on a Kafka topic. No worker process, but you now run Kafka Connect.

Same contract. Wildly different cost shape.

Latency

CDC reads the WAL as Postgres writes it. End-to-end latency to a Kafka consumer sits around 50-200ms in a healthy cluster. Most of that is replication ack and connector batching.

A polling worker checks the table on a tick. If the tick is 1 second, your p50 is 500ms and your p99 is whatever your batch size and processing time add to that. Tighten the tick to 100ms and you're hammering the table with index scans most of which return empty.

When 200ms vs 500ms matters: real-time pricing, fraud signals, order routing where downstream SLAs are tight. When it doesn't: invoicing, email triggers, analytics ingestion, anything humans look at within minutes. Most business events fall in the second bucket. Be honest about which you're actually in.

Ops

A polling worker is a process. It has a deployment, a healthcheck, a Prometheus counter for lag. If it dies, you restart it. The on-call playbook is one page.

# outbox_worker.py — the entire thing
import psycopg
import json
from confluent_kafka import Producer
from time import sleep

BATCH = 500
TICK_MS = 200

def run(db_url: str, kafka_brokers: str):
    conn = psycopg.connect(db_url, autocommit=False)
    producer = Producer({"bootstrap.servers": kafka_brokers,
                         "enable.idempotence": True,
                         "acks": "all"})

    while True:
        with conn.cursor() as cur:
            # SKIP LOCKED so multiple workers don't fight on the same rows
            cur.execute("""
                SELECT id, aggregate, event_type, payload
                FROM outbox_events
                WHERE published_at IS NULL
                ORDER BY id
                LIMIT %s
                FOR UPDATE SKIP LOCKED
            """, (BATCH,))
            rows = cur.fetchall()
            if not rows:
                conn.rollback()
                sleep(TICK_MS / 1000)
                continue

            ids = []
            for row_id, aggregate, event_type, payload in rows:
                producer.produce(
                    topic=f"events.{aggregate}",
                    key=aggregate.encode(),
                    value=json.dumps({"type": event_type, "data": payload}).encode(),
                )
                ids.append(row_id)

            # Block until Kafka confirms. Only then mark published.
            producer.flush(timeout=10)
            cur.execute(
                "UPDATE outbox_events SET published_at = now() WHERE id = ANY(%s)",
                (ids,),
            )
            conn.commit()

That's it. Add structured logging, a metric for len(rows) per tick, a healthcheck endpoint, and you ship.

CDC's surface area is bigger. You need a Kafka Connect cluster (or Strimzi, or Confluent Cloud), a Debezium connector config, schema registry if you're using Avro, a way to monitor connector lag, and a story for what happens when the connector dies mid-snapshot. The connector itself isn't hard. The supporting cast is.

# debezium-outbox.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: outbox-postgres
  labels:
    strimzi.io/cluster: connect-cluster
spec:
  class: io.debezium.connector.postgresql.PostgresConnector
  tasksMax: 1
  config:
    database.hostname: postgres-primary.prod.svc
    database.port: 5432
    database.user: debezium
    database.password: ${secrets:postgres/debezium-pw}
    database.dbname: orders
    topic.prefix: orders-cdc
    plugin.name: pgoutput
    publication.autocreate.mode: filtered
    slot.name: debezium_outbox

    # Only capture the outbox table, ignore everything else
    table.include.list: public.outbox_events

    # The outbox SMT extracts the payload and routes per aggregate
    transforms: outbox
    transforms.outbox.type: io.debezium.transforms.outbox.EventRouter
    transforms.outbox.route.by.field: aggregate
    transforms.outbox.route.topic.replacement: events.${routedByValue}
    transforms.outbox.table.field.event.key: aggregate
    transforms.outbox.table.field.event.payload: payload

    # Skip the initial snapshot — the outbox table is meant to be ephemeral
    snapshot.mode: never

Three lines you'll regret skipping: slot.name (so you can track and drop the slot manually if the connector dies and Postgres starts growing WAL forever), publication.autocreate.mode: filtered (so the publication only covers what you actually want), and snapshot.mode: never (snapshotting an outbox table that's already been drained is pointless and slow).

Cost in headcount: a polling worker needs zero new skills on the team. CDC needs at least one person who can debug Kafka Connect on a bad day, and that person is rarely cheap.

Schema evolution

You renamed customer_id to account_id. What happens?

Polling: the worker reads JSONB, ships it. The worker doesn't know or care what's in the payload. Consumers might break, but the pipeline keeps moving. You roll out a producer change, then a consumer change, in whatever order makes sense.

CDC with the EventRouter SMT: the connector inspects each row to extract payload and aggregate fields. If your outbox row is already shaped as {aggregate, event_type, payload}, you're fine. The column names stay stable, the payload field is opaque. Most teams set their outbox table up this way precisely because it survives schema drift.

But if you're using CDC against your domain tables directly (skipping the outbox table, "let's just stream the orders table to Kafka"), schema evolution turns into a tax. A column rename triggers a schema-registry compatibility check, downstream consumers break, and the Debezium docs page on schema changes becomes your bedtime reading.

The lesson: CDC against an outbox table with a stable shape is fine. CDC against your live domain tables is a coupling trap dressed as an integration pattern. Don't fall for it.

Multi-region

This is where CDC's geographic story gets interesting and the marketing slides skip the asterisks.

Debezium connectors can tail per-region Postgres replicas and write to per-region Kafka clusters with MirrorMaker 2 stitching the topics together. The connector knows where it's running, the consumers in each region can read locally, and end-to-end latency stays inside the region for the common case. This is real, it works, and it's why companies with strict data-residency or cross-region failover requirements end up on CDC even when their throughput doesn't demand it.

Polling can do multi-region too, with caveats. You run a worker per region, each pointed at the regional database, each writing to a regional bus. That's fine until you start trying to coordinate dedup across regions, at which point you're rebuilding parts of what Debezium gives you for free.

For a single region, polling has zero geographic story to maintain. That's the point: nothing to coordinate. For active-active across three continents with strict residency, CDC is the path of less resistance.

When polling wins

You don't already run Kafka. Standing up Kafka and Connect to get the outbox pattern is a 6-month detour. Polling to RabbitMQ or NATS or SQS lets you ship next week.
You're under ~5k events/sec. A single polling worker on a t3.medium handles 10k events/sec with room. CDC isn't faster at this scale, it's just more moving parts.
Single region. The geographic argument disappears.
Small team. One worker beats one cluster when the team is five engineers and one of them is the on-call.
Schema is still wobbling. Early-stage products that rename fields every sprint should not be touching the WAL directly.

When CDC wins

You already operate Kafka Connect. Marginal cost of one more connector is small. Don't run a custom worker if you have a connector platform.
Sustained high throughput. Past ~20k events/sec, polling needs sharded workers, partition coordination, and you've rebuilt Connect badly.
Multi-region with residency. The geographic story above.
Schema is stable. Mature systems with versioned events and a schema registry get real value from the SMT pipeline.
You want exactly-once with Kafka transactions end-to-end. Polling can do effectively-once with idempotent producers, but Kafka's transactional APIs and Connect's offset commits give you a cleaner contract.

The gotcha: transaction ordering

This one bites every team that switches from polling to CDC for "performance" without reading the WAL semantics.

A polling worker reading outbox_events ordered by id ASC sees events in the order their inserts committed. If transaction A inserts events 1 and 2 and transaction B inserts events 3 and 4, the worker emits 1, 2, 3, 4. Per-aggregate ordering is preserved because the inserts happened inside the same transaction as the domain write.

Debezium reads the WAL. The WAL records changes in commit order, but each commit is decomposed into per-change events on the topic. If two transactions interleave their writes (A inserts row 1, B inserts row 3, A inserts row 2, B inserts row 4, both commit), the Kafka topic gets 1, 3, 2, 4 ordered by WAL position. Per-row order inside a single transaction is preserved, but the appearance of events across transactions can reorder relative to what a polling worker would emit.

For most use cases this is fine. Consumers key off the aggregate ID and process per-key in order. But if your consumer assumes "if I see event for order 42 in state SHIPPED, I've already seen state PAID for some earlier order," you'll get weird intermittent failures. The fix is to never write cross-aggregate ordering assumptions into consumers, and to test that assumption with chaos injection rather than discovering it in production.

Polling preserves the illusion of total ordering by reading from a single index in commit order. CDC doesn't, and the WAL is correct — your consumers are the thing that needs to handle reordering.

The actual answer

If you already run Kafka Connect and ship more than ~10k events/sec across regions, CDC is the right default and the connector config above gets you 80% there.

If you don't, the 60-line polling worker is the right default. Ship it, monitor lag, and revisit when you actually feel the ceiling. Most teams never do.

What's your current outbox setup — polling worker, Debezium connector, or are you still doing dual writes and hoping for the best? Curious which case in the matrix above lines up with your team.

If this was useful

The outbox pattern is one of about a dozen integration patterns that get described as "just do X" when the real answer is "it depends on five things." The Event-Driven Architecture Pocket Guide walks through saga compensation, CQRS read-model rebuilds, and the rest of the outbox/CDC tradeoff space with the same matrix shape, including a chapter on transaction ordering across replication boundaries that goes deeper than the gotcha above.