DEV Community

Cover image for CDC Patterns: Outbox, Timestamps, Triggers, Log-Based — Which Wins When
Gowtham Potureddi
Gowtham Potureddi

Posted on

CDC Patterns: Outbox, Timestamps, Triggers, Log-Based — Which Wins When

cdc patterns are the pick-one architectural decision that decides whether your downstream analytics warehouse is minutes behind reality or entire days behind — and it is the single component senior data engineers get wrong most often because "just use Debezium" is not always an option. Every operational database mutation your business writes — a customer address change, an order cancellation, a soft-deleted row — has to reach the warehouse, the search index, the feature store, and the audit trail without losing deletes, without re-reading the entire source table every night, and without saturating the source database's IOPS budget in the process. The engineering trade-off does not live in "should we do CDC" — every stack with more than one downstream system needs it — but in which change data capture pattern you pick and what it costs the source database.

This guide is the senior-DE walkthrough you wished existed the first time an interviewer asked "walk me through the four cdc patterns and their delete-handling story," or "your DBA won't grant logical replication — what's the fallback?", or "explain the outbox pattern and why it solves the dual-write problem." It walks through the four canonical patterns — timestamp cdc (polling updated_at > watermark), trigger cdc (shadow / audit table), log-based cdc (WAL / binlog / oplog readers), and the transactional outbox pattern (single-transaction fan-out) — the "four axes" interviewers actually probe (source-DB permission, latency, delete visibility, source-side load), the canonical config for each, and the hybrid pattern where log-based cdc reads an outbox table for aggregate events. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

PipeCode blog header for CDC patterns — bold white headline 'CDC Patterns' over a hero composition of four small glyph medallions (clock, trigger, WAL, outbox) arranged on a wheel around a central purple 'pick one' seal, on a dark gradient.

When you want hands-on reps immediately after reading, drill the SQL practice library →, rehearse on the ETL practice library →, and sharpen the streaming axis with the streaming practice library →.


On this page


1. Why the CDC pattern choice determines everything downstream

Four patterns, four wildly different cost curves — the choice binds you for years

The one-sentence invariant: change data capture is a picking exercise between polling the source table by updated_at, tailing a shadow audit table populated by triggers, tailing the database's write-ahead log via native replication, or reading a transactionally-consistent outbox table your application writes into — and each pattern trades source-DB permission requirements against latency, delete visibility, and source load in a way that cannot be undone downstream. The pattern you pick in month one becomes the pattern you fight to migrate away from in year three, because every downstream consumer, dashboard, and reconciliation job hard-codes assumptions about the shape of the CDC stream — whether deletes appear as tombstones or silently vanish, whether events are ordered by log-sequence-number or by wall clock, whether replay is a slot-rewind operation or a re-poll from a watermark.

The four axes interviewers actually probe.

  • Source-DB permission. Log-based CDC needs wal_level=logical (Postgres), log_bin=ON + binlog_format=ROW (MySQL), or replica-set oplog access (Mongo). If your DBA won't (or your managed DB can't) grant those, log-based is off the table and you fall back to timestamp or trigger. Interviewers open with this question because it separates people who understand the operational reality from those who've only read the Debezium docs.
  • Latency. Timestamp CDC is bounded by the poll interval — minutes or hours. Trigger CDC adds a bounded per-DML write and a small tail delay — seconds. Log-based CDC is sub-second in steady state. Outbox depends on the reader; sub-second when paired with log-based, minutes when paired with polling.
  • Delete handling. Timestamp CDC is blind to physical deletes — a DELETE FROM orders WHERE id = 42 leaves no updated_at to poll against, and the row silently vanishes downstream. Triggers, log-based, and outbox all capture deletes. Getting this axis wrong ships a subtle correctness bug that only surfaces during audit season.
  • Source load. Timestamp CDC adds one indexed scan per poll — cheap on updated_at-indexed tables. Trigger CDC doubles the write cost on every INSERT/UPDATE/DELETE — a hidden 2× write-amplification tax. Log-based CDC adds zero load on the source tables (the WAL is written anyway) but a small CPU cost on the replication reader. Outbox adds one extra write per transaction — measurable but bounded.

The 2026 reality — log-based dominates but the other three are alive and well.

  • Log-based is the default for any greenfield Postgres, MySQL, or MongoDB deployment where the DBA cooperates. Debezium, AWS DMS, and native connectors (pglogical, wal2json, MongoDB change streams) are production-hardened. If you can pick, you pick log-based.
  • Timestamp / high-watermark ships in every legacy warehouse ETL pipeline: Airflow → source SELECT → S3 → Snowflake. Cheap, portable, works against any relational database — SQL Server, Oracle, DB2, MariaDB, SQLite. Blind to deletes; requires application-level soft-delete or a nightly full-table reconciliation to catch missing rows.
  • Trigger-based is the fallback when logical replication is blocked — some managed Postgres tiers, ancient MySQL versions, or air-gapped installations. The trigger + shadow-table pattern is decades old and still occasionally the only option.
  • Outbox is the pattern that solves the dual-write problem — where the application must write a business row and publish an event, and any failure mode ("wrote the row, crashed before publishing" or vice versa) leaves the two out of sync. The outbox writes both inside one transaction; a downstream reader (usually log-based) publishes the events. This is the pattern senior interviewers probe for microservice teams.

What interviewers listen for.

  • Do you name all four patterns without prompting? — senior signal.
  • Do you say "timestamp CDC is blind to deletes" in the first sentence when timestamp comes up? — required answer.
  • Do you push back on "just use Debezium" with the permission question — "does the DBA grant logical replication?" — senior signal.
  • Do you name the outbox pattern as the answer to the dual-write problem, not as "an alternative to Debezium"? — senior signal.
  • Do you describe CDC as "a stream of ordered mutation events" rather than as vague "syncing data"? — required answer.

Worked example — the four-axis comparison table

Detailed explanation. The single most useful artifact for a CDC interview is a memorised 4×4 comparison table. Every senior CDC discussion converges on this table within the first ten minutes; having it in your head is what separates a fluent answer from a stumbling one. Walk through building the table for a hypothetical orders table that needs to reach a Snowflake warehouse.

  • Source table. public.orders (id, customer_id, total_cents, status, created_at, updated_at) on Postgres 16.
  • Downstream. Snowflake RAW.orders — full history including deletes.
  • Target latency. From "the DBA needs it in 24h" (batch) to "sub-second" (streaming).
  • Downstream consumers. Warehouse (batch OK), search index (needs seconds), audit log (needs every DML including delete).

Question. Build the four-pattern comparison for the orders table and pick the pattern each downstream consumer should subscribe to.

Input.

Pattern Permission needed Latency Delete visible? Source load
Timestamp none (SELECT on table) minutes–hours no (unless soft-delete) 1 indexed scan per poll
Trigger CREATE TRIGGER seconds yes +1 write per DML (2×)
Log-based wal_level=logical + REPLICATION sub-second yes ~0 on source tables
Outbox app-level (write to outbox in txn) reader-dependent yes +1 write per business txn

Code.

-- Postgres source table (all four patterns will target this)
CREATE TABLE public.orders (
    id            BIGSERIAL PRIMARY KEY,
    customer_id   BIGINT       NOT NULL,
    total_cents   BIGINT       NOT NULL,
    status        TEXT         NOT NULL,
    created_at    TIMESTAMPTZ  NOT NULL DEFAULT now(),
    updated_at    TIMESTAMPTZ  NOT NULL DEFAULT now()
);

-- Index on updated_at is mandatory for timestamp CDC
CREATE INDEX idx_orders_updated_at ON public.orders (updated_at);

-- Trigger to keep updated_at fresh (needed for timestamp CDC)
CREATE OR REPLACE FUNCTION public.bump_updated_at() RETURNS TRIGGER AS $$
BEGIN
    NEW.updated_at = now();
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_orders_bump_updated_at
BEFORE UPDATE ON public.orders
FOR EACH ROW EXECUTE FUNCTION public.bump_updated_at();
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The orders table needs an updated_at column and an index on it before timestamp CDC becomes viable. Without the index every poll scans the whole table — a per-poll cost that goes non-linear at multi-million-row scale.
  2. The bump_updated_at trigger is the baseline even in a non-CDC world — it guarantees updated_at reflects the most recent mutation. Without this trigger, an application that forgets to write updated_at = now() loses rows from every timestamp-CDC poll.
  3. For log-based CDC, no schema change is needed on orders itself — the WAL captures every DML regardless. For trigger CDC we'll add a shadow table + triggers in section 3. For outbox we'll add a separate outbox table in section 5.
  4. The choice of pattern is downstream-consumer-driven. Warehouse can tolerate minutes → timestamp CDC works. Search index needs seconds → trigger or log-based. Audit log needs every DML including delete → trigger, log-based, or outbox.
  5. In practice, most senior architectures pick log-based as the primary CDC channel, then layer outbox on top for events that don't correspond 1:1 with a source-table row (e.g. OrderShipped when shipment is derived from orders.status transition).

Output.

Consumer Recommended pattern Why
Snowflake warehouse (batch) timestamp CDC Cheap; poll every 15 min; add nightly full-reconcile for deletes
Search index log-based CDC Sub-second latency; deletes handled
Audit log log-based or trigger Every DML captured; log-based preferred if allowed
Application event bus outbox pattern Solves dual-write; consumers get typed events

Rule of thumb. Never pick a CDC pattern based on "which one is trendy." Pick it based on (permission × latency × delete-visibility × source-load) — the four axes. Write the table on a whiteboard first; the pattern falls out of the constraints.

Worked example — what interviewers actually probe

Detailed explanation. The senior data-engineering CDC interview has a predictable structure: the interviewer opens with an ambiguous question ("how would you get data from our Postgres OLTP into the warehouse?"), then progressively narrows to test whether you know the axes. The candidates who name the pattern in sentence one score highest; the candidates who describe "an ETL pipeline" score lowest. Walk through the interview grading rubric.

  • Ambiguous opener. "How would you sync our production Postgres to the warehouse?" — invites you to name a pattern.
  • Follow-up 1. "What if the DBA won't grant logical replication?" — probes permission axis.
  • Follow-up 2. "How does your pattern handle deletes?" — probes delete axis.
  • Follow-up 3. "What's the load on Postgres?" — probes source-load axis.
  • Follow-up 4. "How would you replay the last week?" — probes replay semantics.

Question. Draft a 5-minute senior CDC answer that covers all four axes without waiting to be asked.

Input.

Interview signal Weak answer Senior answer
Pattern named "we'd use Airflow" "log-based CDC via Debezium if logical replication is allowed"
Permission "not sure" "requires wal_level=logical + REPLICATION role"
Delete handling "we could add a soft-delete column" "log-based captures delete tombstones natively"
Source load "Postgres can handle it" "zero source-table load; small CPU on the replication reader"
Replay "restart Airflow" "reset the replication slot LSN, or bootstrap via pg_dump snapshot"

Code.

Senior CDC answer template (5 minutes)
======================================

Minute 1 — name the pattern up front
  "I'd default to log-based CDC via Debezium, publishing to Kafka."

Minute 2 — permission
  "This requires wal_level=logical on the Postgres primary, a REPLICATION
   role for the connector, and a replication slot per connector. If the
   DBA won't grant that (managed tier limitation, corporate policy),
   the fallback is timestamp CDC (poll updated_at) with a nightly
   full-reconcile job to catch deletes."

Minute 3 — delete handling
  "Log-based captures INSERT/UPDATE/DELETE natively — deletes appear as
   tombstone events with the old row image. Timestamp CDC is blind to
   physical deletes, so we'd either soft-delete or reconcile."

Minute 4 — source load
  "Log-based adds ~zero load on source tables — the WAL is written
   anyway. The reader consumes ~2-3% CPU on the DB host. Trigger CDC
   doubles the write cost on every DML. Timestamp CDC is one indexed
   scan per poll interval."

Minute 5 — replay + failure semantics
  "Debezium tracks progress via LSN in the replication slot. Replay is
   'reset the slot to a prior LSN and re-emit.' The initial snapshot
   uses SNAPSHOT ISOLATION so it's a consistent point-in-time cut. On
   failure the slot holds WAL retention until the reader catches up —
   we monitor slot lag with pg_replication_slots."
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Minute 1 is the crucial framing. Naming the pattern immediately — "log-based CDC via Debezium" — signals you're a decision-maker, not a task-runner. Weak candidates dive into tools ("we'd use Kafka Connect and…") before naming the pattern.
  2. Minute 2 addresses the permission axis before the interviewer asks. This preempts the common trap where you commit to log-based, then admit you don't know how to fall back. Naming the fallback up front is senior signal.
  3. Minute 3 is the delete-handling probe. Every CDC pattern except timestamp captures deletes; naming the log-based tombstone event and the timestamp reconcile workaround shows you've built both.
  4. Minute 4 is the source-load argument. "Zero load on source tables" is the log-based sales pitch; "one indexed scan per poll" is the timestamp defense; "doubles write cost" is the trigger warning. Say all three unprompted.
  5. Minute 5 covers replay and failure — the reliability axis. The LSN-based replay story for log-based is qualitatively different from the watermark-file replay for timestamp; showing you know both is a senior signal.

Output.

Grading criterion Weak score Senior score
Names pattern in minute 1 rare mandatory
Names permission fallback rare required
Names delete handling occasional mandatory
Names source-load numbers rare senior signal
Names replay semantics rare senior signal

Rule of thumb. The senior CDC answer is a 5-minute monologue that covers all four axes without waiting for the follow-ups. Rehearse it once; deploy it every time.

Worked example — the "pick the pattern" decision tree

Detailed explanation. Given a new source system, the senior architect runs a 4-question decision tree in their head. Codifying the tree makes the interview answer reproducible: any interviewer can hand you a scenario and you can walk the tree out loud. Walk through the tree with three canonical scenarios: managed RDS Postgres with full permissions, a locked-down Oracle 11g, and a microservice team that needs event routing.

  • Q1. Do you have logical-replication / binlog / oplog access? → yes = go to Q2; no = go to Q3.
  • Q2. Is sub-second latency required or acceptable? → yes = log-based; no = still log-based (it's the default when allowed).
  • Q3. Is CREATE TRIGGER allowed? → yes = trigger CDC; no = go to Q4.
  • Q4. Is updated_at maintained on every DML? → yes = timestamp CDC; no = must add a soft-delete + updated_at column first.
  • Q5 (parallel branch). Does the app write a business row and need to emit an event atomically? → yes = outbox (on top of whatever primary CDC you picked).

Question. Walk the decision tree for the three scenarios and record the pattern each ends up with.

Input.

Scenario Q1 (log access?) Q3 (trigger?) Q4 (updated_at?) Q5 (dual-write?)
Managed RDS Postgres yes maybe
Oracle 11g locked down no no yes no
Microservice + Kafka yes yes

Code.

# Decision-tree helper (illustrative)
def pick_cdc_pattern(has_log_access: bool,
                      allows_triggers: bool,
                      has_updated_at: bool,
                      needs_dual_write: bool) -> list[str]:
    """Return the primary CDC pattern(s) for a source system."""
    patterns = []

    if has_log_access:
        patterns.append("log-based")
    elif allows_triggers:
        patterns.append("trigger")
    elif has_updated_at:
        patterns.append("timestamp + nightly reconcile")
    else:
        raise ValueError("no viable CDC pattern; add updated_at first")

    if needs_dual_write:
        patterns.append("outbox (on top)")

    return patterns


# Walk the three scenarios
print(pick_cdc_pattern(True,  False, True,  False))
# → ['log-based']

print(pick_cdc_pattern(False, False, True,  False))
# → ['timestamp + nightly reconcile']

print(pick_cdc_pattern(True,  False, True,  True))
# → ['log-based', 'outbox (on top)']
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Scenario 1 — Managed RDS Postgres with wal_level=logical enabled and REPLICATION role granted. The decision tree short-circuits at Q1 → log-based. This is the modern default.
  2. Scenario 2 — Oracle 11g with no LogMiner access, no CREATE TRIGGER grant (locked-down DBA), but the updated_at column exists on every table. Q1 = no, Q3 = no, Q4 = yes → timestamp CDC with a nightly full-reconcile job to catch deletes. This is the legacy default.
  3. Scenario 3 — a microservice team with full Postgres access that needs to emit typed domain events (OrderPlaced, OrderShipped) to Kafka. Q1 = yes → log-based; Q5 = yes → also outbox for the domain events. The hybrid pattern uses log-based CDC as the reader tailing the outbox table.
  4. The parallel Q5 branch (outbox) is orthogonal to the primary pattern. You can pair outbox with any primary CDC pattern; log-based + outbox is by far the most common hybrid in modern microservice architectures.
  5. If none of Q1-Q4 pass, the answer is "you must first add updated_at and a soft-delete convention to the source schema." Refuse to pick a CDC pattern until the source schema supports one.

Output.

Scenario Primary pattern Add-on
Managed RDS Postgres log-based
Oracle 11g locked timestamp + reconcile
Microservice + Kafka log-based outbox

Rule of thumb. The four-question decision tree is a whiteboard-friendly answer. Practice walking it end-to-end so an interviewer can hand you any scenario and get a pattern name in under 60 seconds.

Senior interview question on CDC pattern selection

A senior interviewer often opens with: "You inherit a Postgres 12 → Snowflake pipeline that currently full-refreshes the orders table nightly. The business wants sub-second freshness and physical deletes to appear downstream. Walk me through the CDC pattern you'd migrate to, the source-DB changes you'd request, and the failure modes you'd guard against."

Solution Using log-based CDC with a bootstrap snapshot and slot monitoring

-- Step 1 — request DBA changes on Postgres primary
-- postgresql.conf
-- wal_level             = logical
-- max_replication_slots = 10
-- max_wal_senders       = 10
-- (requires restart)

-- Step 2 — create replication user + publication for the orders table
CREATE ROLE cdc_reader WITH LOGIN REPLICATION PASSWORD 'strong-secret';
GRANT SELECT ON public.orders TO cdc_reader;

CREATE PUBLICATION orders_pub FOR TABLE public.orders;

-- Step 3 — create replication slot (Debezium will do this on startup,
-- but you can pre-create for control)
SELECT pg_create_logical_replication_slot('debezium_orders', 'pgoutput');
Enter fullscreen mode Exit fullscreen mode
# Step 4 — Debezium Postgres connector config
name: orders-cdc
config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector
  plugin.name: pgoutput
  database.hostname: db-primary.internal
  database.port: 5432
  database.user: cdc_reader
  database.password: ${env:CDC_PASSWORD}
  database.dbname: production
  slot.name: debezium_orders
  publication.name: orders_pub
  snapshot.mode: initial          # bootstrap: pg_dump-style snapshot first,
                                  # then tail the WAL
  table.include.list: public.orders
  topic.prefix: prod
  heartbeat.interval.ms: 10000    # every 10s emit a heartbeat to advance
                                  # the slot even during quiet periods
  key.converter: org.apache.kafka.connect.json.JsonConverter
  value.converter: org.apache.kafka.connect.json.JsonConverter
Enter fullscreen mode Exit fullscreen mode
-- Step 5 — slot lag monitoring query (alert on lag > 100 MB)
SELECT slot_name,
       active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag,
       pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes
FROM   pg_replication_slots
WHERE  slot_name = 'debezium_orders';
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Before (nightly full-refresh) After (log-based CDC)
Latency ~24 h sub-second (steady state)
Delete visibility no (rows silently vanish) yes (tombstone events)
Source load full-table scan nightly ~0 on orders; +2% CPU on reader
Snowflake sync mechanism truncate + reload append + merge (dedupe by LSN)
Bootstrap N/A one-time snapshot.mode=initial
Replay re-run full-refresh reset slot LSN, re-emit
Failure surface job crashes → next night retry slot backs up → WAL retention until reader recovers

After the migration, orders mutations appear in Kafka within ~200 ms of commit, deletes ship as tombstone events (op: d with the pre-image), and Snowflake's downstream MERGE handles insert/update/delete idempotently by LSN. The nightly full-refresh disappears from the Airflow DAG.

Output:

Metric Before After
Warehouse freshness 24 h < 1 s
Delete detection none native
Source-DB load full scan × 1/day continuous 2% CPU
WAL retention risk none slot must not lag >1 GB
Snowflake merge SQL truncate + insert MERGE ON id ordered by LSN

Why this works — concept by concept:

  • Log-based CDC — the WAL is Postgres's durable ordered event log. Every INSERT/UPDATE/DELETE is written to it before commit. A logical replication reader decodes those records into typed change events without ever touching the source table.
  • Publication + replication slotPUBLICATION scopes which tables the reader sees; REPLICATION SLOT durably tracks the reader's position (LSN). The slot guarantees no data loss even if the reader is offline for hours — Postgres retains WAL until the slot catches up.
  • Snapshot bootstrapsnapshot.mode=initial performs a consistent point-in-time snapshot (via SET TRANSACTION SNAPSHOT), emits every existing row, records the LSN at snapshot start, then tails the WAL from that LSN. This solves the "how do I initialise the warehouse?" question without an outage.
  • Heartbeat interval — quiet tables produce no WAL, so the slot's confirmed_flush_lsn never advances, so WAL retention grows unboundedly. Heartbeats every 10 s force the slot forward.
  • Cost — one replication slot (~free), 2-3% CPU on the reader box, mandatory monitoring of pg_replication_slots.lag. The eliminated cost is the nightly full-table scan (O(rows) per night) and the developer time spent on soft-delete gymnastics. Net O(1) per commit versus O(N) per night.

SQL
Topic — sql
SQL CDC and change-stream problems

Practice →

ETL Topic — etl ETL problems on incremental ingestion

Practice →


2. Timestamp / high-watermark CDC

SELECT WHERE updated_at > last_watermark is the CDC pattern every legacy warehouse still runs — cheap, portable, and blind to deletes

The mental model in one line: timestamp cdc is the pattern where a poller reads rows whose updated_at (or modified_at, or last_modified) exceeds the last-recorded high-watermark, writes the delta downstream, then advances the watermark to the maximum updated_at observed — it works against any relational database that has an updated_at column and an index, but it cannot see physical deletes and it cannot recover late-arriving updates without help. Every senior data engineer has built one; every senior data engineer has been burned by one; and it remains the correct answer for a large fraction of legacy ETL pipelines.

Iconographic timestamp CDC diagram — a source table with an updated_at column highlighted orange, a clock-hand slicing above a watermark-file card, and downstream rows landing in an S3 cylinder.

The four axes for timestamp CDC.

  • Permission. SELECT on the source table. Nothing else. The most permission-frugal pattern, which is why it survives in air-gapped, locked-down, and legacy environments where the DBA will not grant CREATE TRIGGER or REPLICATION.
  • Latency. Bounded by the poll interval. 15 minutes is the common default; 1 minute is aggressive; sub-minute is either wasteful (each poll re-scans a large index range) or requires log-based CDC. The upstream data lands "at most poll_interval old."
  • Delete handling. None. A DELETE FROM orders WHERE id = 42 leaves no trace queryable by WHERE updated_at > watermark. The row simply stops appearing in future polls. This is the pattern's fundamental limitation.
  • Source load. One indexed range scan per poll — cheap if updated_at is indexed, catastrophic if not. Every senior architect confirms the index exists before signing off on timestamp CDC.

The watermark checkpoint — the single durable piece of state.

  • What it is. A file (or DynamoDB row, or Postgres row) recording the maximum updated_at value the poller has already processed. On the next poll, the query is WHERE updated_at > watermark.
  • Where to store it. Airflow XCom, a JSON file on S3, a small Postgres table (cdc_watermarks(table_name, last_ts)), Redis. All work; pick durability over convenience.
  • Idempotency. The poll query is idempotent — re-running it with the same watermark yields the same delta. The commit-then-update-watermark step is not idempotent; the safe pattern is write-delta → advance-watermark-atomically using a transaction.
  • Bootstrap. First run sets watermark to 1970-01-01 (or NULL) and captures the entire table. Subsequent runs are incremental.

The three failure modes senior engineers pre-empt.

  • Clock skew. If two application servers write to Postgres with slightly different clocks, an updated_at = 12:00:00.5 row can commit after an updated_at = 12:00:01 row — but the poll query orders by updated_at, so the earlier-clocked row is missed if the watermark has already advanced past 12:00:00.5. Mitigation: use updated_at = clock_timestamp() inside a BEFORE UPDATE trigger (Postgres-side clock, not app-side); or use a monotonic counter (update_seq BIGSERIAL) instead of a timestamp.
  • Late-arriving rows. A long-running transaction that committed at 12:00:00 but inserted rows with updated_at = 11:59:00 will be missed if the poller has already advanced past 12:00:00. Mitigation: query WHERE updated_at > watermark AND updated_at <= watermark_horizon where watermark_horizon = now() - INTERVAL '5 minutes' — a "safety window" that excludes recent commits from being processed until they've had time to settle.
  • Physical deletes. The fundamental blind spot. Mitigation options: (a) soft-delete convention — never DELETE, always UPDATE with deleted_at = now() and treat deleted rows as normal updates; (b) periodic full-table reconciliation — nightly, dump the full row set, diff against the warehouse, catch missing rows; (c) accept the inconsistency for tables where deletes are rare.

Common interview probes on timestamp CDC.

  • "How does timestamp CDC handle deletes?" — required answer is "it doesn't; use soft-delete or reconcile."
  • "How do you handle late-arriving rows?" — safety-window pattern (updated_at <= now() - 5m).
  • "Where do you store the watermark?" — durably, transactionally-committed alongside the delta write.
  • "When would you pick timestamp over log-based?" — permission constraints, small-scale, batch-tolerant warehouse pipelines.

Worked example — incremental Postgres → S3 by updated_at

Detailed explanation. The canonical timestamp CDC pipeline: an Airflow DAG polls Postgres every 15 minutes, reads rows where updated_at > watermark, writes them as JSON to S3, then advances the watermark. It handles the "no deletes" limitation by adding a nightly full-reconcile that catches any rows that disappeared. Walk through the DAG.

  • Frequency. Every 15 minutes for incremental; nightly for full reconcile.
  • Watermark storage. Airflow XCom (persistent metadata DB).
  • Safety window. 5-minute lookback to catch late commits.
  • Deletes. Nightly full-reconcile diffs S3 against source; flags disappeared rows.

Question. Implement the Airflow task that runs one incremental poll and advances the watermark atomically.

Input.

Parameter Value
Source Postgres public.orders
Watermark column updated_at
Watermark storage Airflow XCom
Poll interval 15 min
Safety window 5 min

Code.

# Airflow task — one incremental CDC poll for public.orders
import json
import psycopg2
from datetime import datetime, timedelta, timezone
from airflow.decorators import task

WATERMARK_XCOM_KEY = "cdc_orders_watermark"

@task
def poll_orders_incremental(ti, **_):
    """Read orders modified since last watermark; write to S3; advance watermark."""

    # 1. Read the previous watermark (bootstrap to epoch on first run)
    prev = ti.xcom_pull(key=WATERMARK_XCOM_KEY, include_prior_dates=True)
    prev_ts = datetime.fromisoformat(prev) if prev else datetime(1970, 1, 1, tzinfo=timezone.utc)

    # 2. Safety horizon — exclude commits from the last 5 minutes
    horizon = datetime.now(timezone.utc) - timedelta(minutes=5)

    # 3. Query the delta
    conn = psycopg2.connect("host=db-primary port=5432 dbname=production user=cdc_reader")
    cur = conn.cursor()
    cur.execute("""
        SELECT id, customer_id, total_cents, status, created_at, updated_at
        FROM   public.orders
        WHERE  updated_at >  %s
          AND  updated_at <= %s
        ORDER  BY updated_at ASC
    """, (prev_ts, horizon))
    rows = cur.fetchall()
    conn.close()

    if not rows:
        return prev_ts.isoformat()  # nothing changed; do not advance

    # 4. Write JSON to S3 (one file per poll; timestamped key)
    key = f"cdc/orders/{horizon:%Y/%m/%d/%H%M}.jsonl"
    payload = "\n".join(json.dumps({
        "id": r[0], "customer_id": r[1], "total_cents": r[2],
        "status": r[3], "created_at": r[4].isoformat(), "updated_at": r[5].isoformat()
    }) for r in rows)
    # (s3.put_object elided for brevity)

    # 5. Advance watermark to the MAX(updated_at) actually observed
    #    NOT to `horizon`, because the source might have gaps
    new_ts = max(r[5] for r in rows)
    ti.xcom_push(key=WATERMARK_XCOM_KEY, value=new_ts.isoformat())

    print(f"CDC poll wrote {len(rows)} rows; watermark {prev_ts} -> {new_ts}")
    return new_ts.isoformat()
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Step 1 loads the previous watermark from XCom. On first run (bootstrap), no XCom exists, so we default to epoch — the poll captures the entire history of the table. This is the "initial snapshot" for timestamp CDC; unlike log-based, there's no separate snapshot mechanism.
  2. Step 2 computes the safety horizon: now() - 5 minutes. The poll query excludes rows with updated_at > horizon — this pushes recently-committed rows to the next poll, giving them 5 minutes to settle. This defends against clock-skew and late-committed transactions.
  3. Step 3 executes the delta query. The ORDER BY updated_at ASC is not strictly required for correctness but makes the downstream file naturally sorted, which simplifies dedupe in Snowflake.
  4. Step 4 writes the delta to S3 using a timestamp-embedded key. If the write fails, the task raises and Airflow retries; the watermark is not advanced, so the next attempt processes the same delta again. This is the idempotency guarantee.
  5. Step 5 advances the watermark to max(updated_at) from the actual result set — not to the horizon time. This is subtle: if the last row is at 12:00:05 but the horizon is 12:00:10, advancing to 12:00:10 risks skipping a row that commits at 12:00:07 with an updated_at from the transaction's start time. Advancing only to observed values keeps the watermark honest.

Output.

Poll prev watermark max updated_at seen new watermark Rows written
1 (bootstrap) 1970-01-01 2026-07-04 09:15:22 2026-07-04 09:15:22 12,345
2 2026-07-04 09:15:22 2026-07-04 09:29:47 2026-07-04 09:29:47 87
3 2026-07-04 09:29:47 (no new rows) 2026-07-04 09:29:47 0
4 2026-07-04 09:29:47 2026-07-04 09:44:11 2026-07-04 09:44:11 65

Rule of thumb. Store the watermark durably (Airflow XCom, DynamoDB, or a Postgres row), advance it only after the delta write is committed, and advance to max(observed) — never to a wall-clock horizon. The five-minute safety window is non-negotiable in a multi-writer world.

Worked example — the clock-skew failure mode

Detailed explanation. A distributed application writes to Postgres from 20 app pods, each setting updated_at = time.time() (server-side wall clock) in the UPDATE statement. Two of the pods have clocks that drift by ~2 seconds; the rest are NTP-synced. The timestamp CDC poll misses roughly 0.1% of rows whose updated_at from the skewed pods falls behind the watermark of an already-processed row from the synced pods. Walk through the diagnosis and the fix.

  • Symptom. Downstream Snowflake row count is consistently 0.1% below Postgres row count.
  • Root cause. App-side updated_at uses each pod's wall clock; skew across pods means the strict ordering the watermark assumes doesn't hold.
  • Fix. Move updated_at maintenance to a BEFORE UPDATE trigger using clock_timestamp() — Postgres-side, single-clock, monotonic per transaction.

Question. Design the schema change that removes the clock-skew failure mode, and quantify the residual risk.

Input.

Component Before After
updated_at source app-side time.time() Postgres clock_timestamp() in BEFORE UPDATE trigger
Clock authority 20 pod clocks (drift ~2s) 1 Postgres clock
Residual skew up to 2 s ~0 (single clock)
Late-arriving row rate 0.1% ~0.001% (long-txn edge case only)

Code.

-- Remove the app-side updated_at write; enforce it in Postgres
CREATE OR REPLACE FUNCTION public.bump_updated_at() RETURNS TRIGGER AS $$
BEGIN
    -- clock_timestamp() reads the current wall clock inside the transaction
    -- (as opposed to now()/CURRENT_TIMESTAMP which is fixed at BEGIN time)
    NEW.updated_at = clock_timestamp();
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

DROP TRIGGER IF EXISTS trg_orders_bump_updated_at ON public.orders;
CREATE TRIGGER trg_orders_bump_updated_at
BEFORE INSERT OR UPDATE ON public.orders
FOR EACH ROW EXECUTE FUNCTION public.bump_updated_at();

-- Optional: revoke direct write on updated_at from app roles
REVOKE UPDATE (updated_at) ON public.orders FROM app_writer;
Enter fullscreen mode Exit fullscreen mode
# Application code — no longer writes updated_at at all
def update_order_status(conn, order_id, new_status):
    with conn.cursor() as cur:
        cur.execute("""
            UPDATE public.orders
            SET    status = %s
            WHERE  id = %s
        """, (new_status, order_id))
    # updated_at is bumped automatically by the trigger
    conn.commit()
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The root cause is that updated_at was being written by 20 different clocks (one per pod), so two rows committed 100 ms apart could have updated_at values ordered opposite to their commit order. The watermark, which assumes monotonic-per-commit ordering, silently drops the earlier-timestamped-but-later-committed row.
  2. Moving updated_at maintenance into a BEFORE UPDATE trigger using clock_timestamp() centralises it on the Postgres clock — one clock, no skew. clock_timestamp() (not now()) is the correct function: now() is fixed at transaction start, so multi-statement transactions would all get the same updated_at.
  3. Revoking UPDATE on the updated_at column from the application role is a belt-and-braces defense: even if a developer adds SET updated_at = ... in code, Postgres rejects it. Enforcement at the schema level is far more reliable than enforcement in code review.
  4. The residual risk is a long-running transaction that started at 12:00:00 but commits at 12:00:10: its clock_timestamp() reads at the moment of the trigger fire, which is inside the transaction but before commit. If the poll runs at 12:00:05 with horizon = 12:00:00, it might miss the row. The 5-minute safety window absorbs this in practice.
  5. After the fix, downstream row-count drift drops to well under 0.01% and can be attributed entirely to the "long-transaction inside safety window" edge case, which the periodic reconcile catches.

Output.

Metric Before After
Distinct clock sources 20 pod clocks 1 Postgres clock
updated_at write authority app trigger
Row-count drift 0.1% < 0.01%
Reconcile frequency needed 4x/day 1x/day
Interview signal "app-side timestamp" (weak) "clock_timestamp() in trigger" (senior)

Rule of thumb. Never let application code write updated_at. Enforce it in a BEFORE UPDATE trigger using clock_timestamp(), revoke the column-level UPDATE grant from app roles, and treat any app-side updated_at = as a schema violation. The clock-skew bug is invisible until you look for it.

Worked example — soft-delete workaround for missing physical deletes

Detailed explanation. A products table serves both an operational service and a warehouse. The team's timestamp CDC pipeline replicates INSERT and UPDATE reliably but silently drops DELETE. The fix is the soft-delete convention: replace all DELETE FROM products WHERE id = X with UPDATE products SET deleted_at = now() WHERE id = X. Deletes become updates; timestamp CDC sees them; downstream consumers filter them out. Every senior DE has migrated a codebase to soft-delete at least once.

  • The migration. Add deleted_at TIMESTAMPTZ NULL column. Rewrite all DELETE statements to UPDATE. Add a view that filters WHERE deleted_at IS NULL for read paths.
  • The trade-off. Rows are never physically removed; storage grows without bound. Add a garbage-collector job to hard-DELETE rows older than N days and propagate the hard-delete out-of-band to the warehouse.

Question. Migrate the products table to soft-delete and show how downstream Snowflake handles the deleted-at column for both dashboards and audit queries.

Input.

Component Change
Schema add deleted_at TIMESTAMPTZ NULL
Application code rewrite DELETE → UPDATE deleted_at
Read paths filter WHERE deleted_at IS NULL
Watermark column updated_at (already present)
Warehouse keep deleted rows; filter in dashboards

Code.

-- 1. Add the soft-delete column
ALTER TABLE public.products ADD COLUMN deleted_at TIMESTAMPTZ;
CREATE INDEX idx_products_updated_at ON public.products (updated_at);

-- 2. Trigger bumps updated_at on any UPDATE, including deleted_at writes
CREATE OR REPLACE FUNCTION public.bump_updated_at() RETURNS TRIGGER AS $$
BEGIN
    NEW.updated_at = clock_timestamp();
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_products_bump_updated_at
BEFORE INSERT OR UPDATE ON public.products
FOR EACH ROW EXECUTE FUNCTION public.bump_updated_at();

-- 3. Live view for the application (hides soft-deleted rows)
CREATE OR REPLACE VIEW public.products_live AS
SELECT * FROM public.products WHERE deleted_at IS NULL;

-- 4. Application code now UPDATEs instead of DELETEs
--   Before:  DELETE FROM public.products WHERE id = 42;
--   After:   UPDATE public.products SET deleted_at = now() WHERE id = 42;
Enter fullscreen mode Exit fullscreen mode
-- 5. Snowflake side — MERGE handles insert/update/soft-delete uniformly
MERGE INTO analytics.products AS tgt
USING (
    SELECT * FROM staging.products_cdc_delta
) src
ON tgt.id = src.id
WHEN MATCHED     THEN UPDATE SET
    tgt.name        = src.name,
    tgt.price_cents = src.price_cents,
    tgt.deleted_at  = src.deleted_at,   -- propagate the soft-delete
    tgt.updated_at  = src.updated_at
WHEN NOT MATCHED THEN INSERT (id, name, price_cents, deleted_at, updated_at)
                       VALUES (src.id, src.name, src.price_cents, src.deleted_at, src.updated_at);

-- 6. Dashboards filter out soft-deleted rows
CREATE OR REPLACE VIEW analytics.products_live AS
SELECT * FROM analytics.products WHERE deleted_at IS NULL;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Adding deleted_at TIMESTAMPTZ NULL and rewriting every application-side DELETE as UPDATE ... SET deleted_at = now() turns physical deletes into logical ones. The updated_at trigger fires on the UPDATE, so the row's updated_at advances and the next CDC poll picks it up as an update.
  2. The products_live view centralises the "filter soft-deleted" logic; the application queries products_live instead of products. Retrofitting this view lets you migrate the schema without touching every SELECT in the codebase in one commit.
  3. On the CDC side, the poll now sees the soft-deleted row exactly like an update: it ships the row with the new deleted_at value. No pattern change needed — timestamp CDC's blind spot is closed by the schema convention.
  4. The Snowflake MERGE treats soft-delete as just another column write. Downstream dashboards use their own products_live view; auditors use analytics.products directly to see full history including deletions.
  5. The garbage collector runs monthly: DELETE FROM public.products WHERE deleted_at < now() - INTERVAL '90 days' — this does create a physical delete blind spot again, so the GC job must run through a separate out-of-band mechanism (a nightly job that emits DELETE events to Snowflake directly, or a full-table reconcile).

Output.

Row lifecycle Before soft-delete After soft-delete
INSERT captured captured
UPDATE captured captured
DELETE (app-side) silently lost captured as deleted_at set
DELETE (GC job) silently lost requires reconcile or side channel
Warehouse history incomplete complete (until GC purge)

Rule of thumb. For any table under timestamp CDC, adopt the soft-delete convention: add deleted_at, rewrite DELETEs as UPDATEs, expose a _live view for read paths, and treat any physical DELETE as a policy violation that must go through a special GC path. This is the cheapest fix for the "no deletes" blind spot.

Senior interview question on timestamp CDC

A senior interviewer might ask: "You need to ingest a 2-billion-row events table from Postgres 12 into Snowflake with 15-minute freshness. The DBA won't grant logical replication. Walk me through the timestamp CDC design, the watermark storage, the safety window, the delete handling, and the operational monitoring."

Solution Using timestamp CDC with a durable watermark, safety window, and nightly reconcile

# cdc_events_incremental.py — production timestamp CDC for the events table
import json
import psycopg2
from datetime import datetime, timedelta, timezone

WATERMARK_TABLE = "cdc_watermarks"  # persistent in Postgres itself

def get_watermark(conn, table: str) -> datetime:
    with conn.cursor() as cur:
        cur.execute("""
            SELECT last_updated_at
            FROM   cdc_watermarks
            WHERE  table_name = %s
            FOR UPDATE
        """, (table,))
        row = cur.fetchone()
        return row[0] if row else datetime(1970, 1, 1, tzinfo=timezone.utc)

def advance_watermark(conn, table: str, new_ts: datetime):
    with conn.cursor() as cur:
        cur.execute("""
            INSERT INTO cdc_watermarks(table_name, last_updated_at)
            VALUES (%s, %s)
            ON CONFLICT (table_name) DO UPDATE
              SET last_updated_at = EXCLUDED.last_updated_at
        """, (table, new_ts))

def poll_events(conn) -> int:
    prev    = get_watermark(conn, "events")
    horizon = datetime.now(timezone.utc) - timedelta(minutes=5)

    with conn.cursor(name="events_delta") as cur:   # server-side cursor for 2B rows
        cur.itersize = 10000
        cur.execute("""
            SELECT id, event_ts, user_id, event_type, payload, updated_at
            FROM   public.events
            WHERE  updated_at >  %s
              AND  updated_at <= %s
            ORDER  BY updated_at ASC
        """, (prev, horizon))

        n = 0
        max_ts = prev
        for row in cur:
            write_row_to_s3(row)     # buffered writer, flushes every 10k rows
            if row[5] > max_ts:
                max_ts = row[5]
            n += 1

    if max_ts > prev:
        advance_watermark(conn, "events", max_ts)
        conn.commit()

    return n
Enter fullscreen mode Exit fullscreen mode
-- Durable watermark table (Postgres itself; transactional)
CREATE TABLE IF NOT EXISTS cdc_watermarks (
    table_name       TEXT PRIMARY KEY,
    last_updated_at  TIMESTAMPTZ NOT NULL,
    last_poll_at     TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Nightly reconcile — full row-count + PK diff, catches deletes
WITH source AS (
    SELECT id FROM public.events
),
warehouse AS (
    SELECT id FROM external_snowflake.events   -- via foreign data wrapper or reverse pull
)
SELECT 'missing_in_warehouse' AS kind, s.id
FROM   source s
LEFT JOIN warehouse w ON w.id = s.id
WHERE  w.id IS NULL
UNION ALL
SELECT 'physically_deleted'   AS kind, w.id
FROM   warehouse w
LEFT JOIN source s ON s.id = w.id
WHERE  s.id IS NULL;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Value Reasoning
Watermark storage cdc_watermarks Postgres table transactional; survives Airflow crashes
Safety window 5 minutes absorbs late-committing transactions
Cursor mode server-side, itersize=10000 never load 2B rows into Python memory
Watermark advance max(updated_at) observed never to horizon; stays honest
Advance-and-commit one transaction delta-write + watermark-advance atomic
Delete handling nightly PK diff catches physical deletes

After deployment, the pipeline lands Postgres deltas in Snowflake within 15-20 minutes of commit; the nightly reconcile flags any missing rows for backfill; the watermark table survives Airflow restarts and worker failures because it lives in the source Postgres itself.

Output:

Metric Value
Rows scanned per poll 500-5000 (typical)
Poll latency 30-60 s
Freshness p99 15-20 min
Delete detection lag up to 24 h (nightly reconcile)
Watermark durability transactional in source Postgres

Why this works — concept by concept:

  • Server-side cursor + itersize — for a 2-billion-row source, psycopg2 client-side cursors would load the entire result set into Python memory. conn.cursor(name=...) opens a server-side cursor that streams rows in batches of itersize, keeping memory flat regardless of result-set size.
  • Watermark in source Postgres — placing cdc_watermarks in the same Postgres as the source data lets you commit the delta write and the watermark advance in one transaction. That atomicity is the only way to guarantee "processed row => watermark advanced" without a two-phase commit.
  • Safety window (5 min) — the horizon = now() - 5 min window absorbs late-committing transactions. Without it, a long-running write that touches an old timestamp gets skipped.
  • Advance to observed max — advancing the watermark to max(observed updated_at) rather than horizon avoids the "advance past a row we didn't see" bug. Combined with the safety window, this is the correct advance policy.
  • Cost — one indexed range scan per poll (O(delta), not O(table)) plus one nightly PK diff (O(table)). Compared to a nightly full refresh, this is 15-minute freshness at 1% of the compute cost. Compared to log-based CDC it's operationally simpler but blind to physical deletes without the reconcile.

SQL
Topic — sql
SQL incremental-load and watermark problems

Practice →

ETL Topic — etl ETL problems on batch CDC pipelines

Practice →


3. Trigger-based CDC

BEFORE|AFTER INSERT|UPDATE|DELETE triggers writing to a shadow / audit table — the fallback when logical replication is off-limits

The mental model in one line: trigger cdc is the pattern where per-row triggers on every INSERT/UPDATE/DELETE write a corresponding row into a shadow (audit) table, and a downstream reader tails that shadow table — it captures every DML including deletes, works without any special replication grants, and pays for that capability with a bounded but non-trivial per-DML write amplification tax that senior engineers must quantify before shipping. Every senior DE has built at least one trigger CDC pipeline against a locked-down source database; the pattern is decades old and remains the correct answer for a specific slice of environments.

Iconographic trigger CDC diagram — a source table on the left with three trigger glyphs on its edge firing to a shadow audit table on the right, plus a warning chip 'trigger cost added to every DML'.

The four axes for trigger CDC.

  • Permission. CREATE TRIGGER on the source tables plus INSERT on the shadow table. This is a lower bar than logical replication but a higher bar than SELECT-only — some managed / SOC-compliance environments still refuse triggers.
  • Latency. Seconds. The trigger fires inside the source transaction so the shadow-table row is committed atomically with the business row. The reader tails the shadow table on a short poll (1-10 seconds) or via LISTEN/NOTIFY.
  • Delete handling. Native. DELETE triggers fire on physical deletes; the shadow row carries the pre-image plus op = 'D'. This is the pattern's headline advantage over timestamp CDC.
  • Source load. This is where trigger CDC pays. Every INSERT/UPDATE/DELETE now writes two rows (one to the business table, one to the shadow table). Write amplification is exactly 2×. On a write-heavy table this is a meaningful IOPS tax; on a read-heavy table it's negligible.

The shadow table shape — what every audit table needs.

  • Primary key. Auto-increment integer (audit_id BIGSERIAL) — the reader's watermark tracks this, not a timestamp. Monotonic, gap-free, cheap to index.
  • Business PK. The primary key of the source row so the reader can identify which row changed.
  • Op column. 'I', 'U', 'D' (or 'INSERT', 'UPDATE', 'DELETE') — captures the DML kind.
  • Before/after images. For UPDATE, store both the old and new row (JSONB or separate columns) so downstream can emit change events. For DELETE, only the old image.
  • Metadata. changed_at TIMESTAMPTZ DEFAULT clock_timestamp(), changed_by TEXT DEFAULT current_user, txid BIGINT DEFAULT txid_current() — audit-grade provenance.

The trigger function — six lines that pay the tax.

  • BEFORE vs AFTER. Use AFTER triggers for CDC — the row values in NEW/OLD are the committed values. BEFORE triggers see uncommitted values that another BEFORE trigger might still mutate.
  • FOR EACH ROW vs FOR EACH STATEMENT. Always FOR EACH ROW for CDC — you need per-row change events. Statement-level triggers are for enforcing schema-wide invariants.
  • The function body. Insert one row into the shadow table with TG_OP, PK, before-image (from OLD), after-image (from NEW), and metadata. Return NULL for AFTER triggers.
  • The tax. ~50-150 microseconds per DML (varies by row size). Multiply by DML rate to get the CPU cost.

The reader — polling vs LISTEN/NOTIFY.

  • Polling. Every N seconds, SELECT ... FROM audit_orders WHERE audit_id > watermark ORDER BY audit_id LIMIT M. Simple, robust. Latency = poll interval.
  • LISTEN/NOTIFY. The AFTER trigger also runs pg_notify('audit_orders_channel', ...). The reader LISTENs and processes as notifications arrive. Sub-second latency; more complex failure semantics (missed notifications on reader disconnect require polling backup).

Common interview probes on trigger CDC.

  • "What's the source-DB cost of trigger CDC?" — required answer: exactly 2× write amplification per DML.
  • "How do you handle updates in the shadow table?" — before/after images (OLD and NEW).
  • "When is trigger CDC preferable to log-based?" — when logical replication is not grantable but CREATE TRIGGER is.
  • "How do you monitor trigger CDC?" — shadow-table lag (max(audit_id) - watermark); alert if lag grows.

Worked example — Postgres audit trigger for the orders table

Detailed explanation. The canonical trigger CDC setup: an audit_orders shadow table, an AFTER INSERT OR UPDATE OR DELETE trigger that writes one shadow row per DML, and an Airflow tail that reads the shadow table by monotonic audit_id. Build the whole thing from scratch.

  • Shadow table. audit_orders(audit_id BIGSERIAL PK, op CHAR(1), order_id BIGINT, before JSONB, after JSONB, changed_at, changed_by, txid).
  • Trigger. AFTER INSERT OR UPDATE OR DELETE ON orders, FOR EACH ROW.
  • Reader. Airflow task tailing audit_orders WHERE audit_id > watermark.

Question. Write the shadow-table DDL, the trigger function, the trigger binding, and the reader task.

Input.

Object Purpose
audit_orders shadow table with pre/post images
audit_orders_trigger() trigger function; runs once per row
trg_orders_audit trigger binding on orders
tail_audit_orders Airflow reader task

Code.

-- 1. Shadow / audit table
CREATE TABLE public.audit_orders (
    audit_id    BIGSERIAL   PRIMARY KEY,
    op          CHAR(1)     NOT NULL,           -- 'I', 'U', 'D'
    order_id    BIGINT      NOT NULL,
    before      JSONB,                          -- old row image (UPDATE, DELETE)
    after       JSONB,                          -- new row image (INSERT, UPDATE)
    changed_at  TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp(),
    changed_by  TEXT        NOT NULL DEFAULT current_user,
    txid        BIGINT      NOT NULL DEFAULT txid_current()
);

CREATE INDEX idx_audit_orders_audit_id ON public.audit_orders (audit_id);

-- 2. Trigger function — six lines that do all the work
CREATE OR REPLACE FUNCTION public.audit_orders_trigger()
RETURNS TRIGGER AS $$
BEGIN
    IF (TG_OP = 'INSERT') THEN
        INSERT INTO public.audit_orders(op, order_id, after)
        VALUES ('I', NEW.id, to_jsonb(NEW));
    ELSIF (TG_OP = 'UPDATE') THEN
        INSERT INTO public.audit_orders(op, order_id, before, after)
        VALUES ('U', NEW.id, to_jsonb(OLD), to_jsonb(NEW));
    ELSIF (TG_OP = 'DELETE') THEN
        INSERT INTO public.audit_orders(op, order_id, before)
        VALUES ('D', OLD.id, to_jsonb(OLD));
    END IF;
    RETURN NULL;    -- AFTER triggers ignore return value
END;
$$ LANGUAGE plpgsql;

-- 3. Trigger binding — one trigger, all three DML kinds
CREATE TRIGGER trg_orders_audit
AFTER INSERT OR UPDATE OR DELETE ON public.orders
FOR EACH ROW EXECUTE FUNCTION public.audit_orders_trigger();
Enter fullscreen mode Exit fullscreen mode
# 4. Airflow reader — tails audit_orders by monotonic audit_id
import psycopg2
from airflow.decorators import task

@task
def tail_audit_orders(ti):
    conn = psycopg2.connect("host=db-primary port=5432 dbname=production user=cdc_reader")
    with conn.cursor() as cur:
        # Watermark stored transactionally in Postgres
        cur.execute("SELECT COALESCE(MAX(audit_id), 0) FROM cdc_watermarks_audit WHERE table_name = 'audit_orders'")
        watermark = cur.fetchone()[0]

        cur.execute("""
            SELECT audit_id, op, order_id, before, after, changed_at
            FROM   public.audit_orders
            WHERE  audit_id > %s
            ORDER  BY audit_id ASC
            LIMIT  10000
        """, (watermark,))
        rows = cur.fetchall()

    for row in rows:
        emit_to_downstream(row)

    if rows:
        new_watermark = rows[-1][0]
        with conn.cursor() as cur:
            cur.execute("""
                INSERT INTO cdc_watermarks_audit(table_name, max_audit_id)
                VALUES ('audit_orders', %s)
                ON CONFLICT (table_name) DO UPDATE
                  SET max_audit_id = EXCLUDED.max_audit_id
            """, (new_watermark,))
        conn.commit()

    return len(rows)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The audit_orders shadow table has a BIGSERIAL audit_id primary key — this is the reader's monotonic watermark. Timestamps are unreliable (clock skew, transaction commit order); a sequence is not. The before/after JSONB columns give downstream consumers the full row images without a schema-per-column CDC contract.
  2. The trigger function branches on TG_OP. For INSERT, only after is meaningful. For UPDATE, we store both before (from OLD) and after (from NEW) — this is the change payload downstream consumers need to compute deltas. For DELETE, only before (from OLD) — the row is gone.
  3. to_jsonb(NEW) and to_jsonb(OLD) capture the entire row as a JSONB blob. This is preferable to per-column shadow tables (which require a schema migration to audit_orders every time you add a column to orders).
  4. The trigger is AFTER INSERT OR UPDATE OR DELETE ... FOR EACH ROW — one trigger definition covers all three DML kinds. AFTER guarantees the shadow row commits atomically with the business row: either both commit or both roll back, no dual-write problem.
  5. The Airflow reader tails by audit_id > watermark. Because audit_id is a BIGSERIAL, it's monotonic and gap-free (within a single writer; gaps can appear from rolled-back transactions but never as "audit_id < watermark that appears later"). This makes watermark advancement trivially correct.

Output.

DML event Row in audit_orders audit_id
INSERT INTO orders VALUES (1, ...) ('I', 1, NULL, {...}) 1
UPDATE orders SET status='shipped' WHERE id=1 ('U', 1, {before}, {after}) 2
DELETE FROM orders WHERE id=1 ('D', 1, {before}, NULL) 3

Rule of thumb. For any trigger CDC deployment, use a BIGSERIAL audit_id as the watermark, store full row images as JSONB (not per-column), and make the trigger AFTER ... FOR EACH ROW. These three choices remove entire classes of correctness bugs.

Worked example — quantifying the write-amplification tax

Detailed explanation. Every senior architect must quantify the source-DB cost before signing off on trigger CDC. The tax is deterministic: every INSERT/UPDATE/DELETE writes one extra row. Multiply by DML rate to get the CPU + IOPS cost. Walk through the calculation for a moderate-write-rate orders table.

  • DML rate. 500 writes/sec (mixed INSERT/UPDATE/DELETE).
  • Row size. ~500 bytes for orders; ~1.5 KB for the JSONB audit_orders row (both images + metadata).
  • Write amplification. 500 writes/sec → 500 shadow-table inserts/sec.

Question. Quantify the added CPU, IOPS, and storage cost of trigger CDC on the orders table.

Input.

Metric orders (base) audit_orders (added)
Write rate 500/s 500/s
Row size 500 B 1500 B
IOPS per row 1 WAL write 1 WAL write
Storage per day 21.6 GB 64.8 GB

Code.

-- Measure the actual trigger cost on a running system
BEGIN;
SELECT clock_timestamp() AS t0 \gset

-- Run 10,000 UPDATEs
DO $$
BEGIN
    FOR i IN 1..10000 LOOP
        UPDATE public.orders SET status = 'benchmark' WHERE id = 1;
    END LOOP;
END$$;

SELECT clock_timestamp() AS t1 \gset
COMMIT;

-- Compute wall-clock cost per DML with trigger
SELECT extract(epoch FROM (:'t1'::timestamptz - :'t0'::timestamptz)) / 10000.0 AS sec_per_dml_with_trigger;

-- Disable the trigger and repeat
ALTER TABLE public.orders DISABLE TRIGGER trg_orders_audit;

-- ... rerun the 10,000 UPDATEs ...

-- Re-enable
ALTER TABLE public.orders ENABLE TRIGGER trg_orders_audit;
Enter fullscreen mode Exit fullscreen mode
# Storage projection — 30 days of trigger CDC
DML_PER_SEC   = 500
AUDIT_ROW_KB  = 1.5
SECONDS_DAY   = 86400

daily_gb  = (DML_PER_SEC * SECONDS_DAY * AUDIT_ROW_KB) / (1024 * 1024)
monthly_gb = daily_gb * 30
print(f"Trigger CDC storage: {daily_gb:.1f} GB/day, {monthly_gb:.1f} GB/month")
# → Trigger CDC storage: 61.9 GB/day, 1856.4 GB/month
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The benchmark measures the wall-clock cost per UPDATE with and without the trigger. In our measurements against a modest 4-vCPU Postgres, the trigger adds roughly 80-150 microseconds per DML. At 500 DML/sec that's 40-75 ms/sec of extra CPU — under 10% of one core.
  2. IOPS-wise, each shadow-table INSERT generates one WAL record and (eventually) one heap-page flush. The doubling is exact: 500 base DML → 500 shadow inserts → 1000 total WAL records/sec.
  3. Storage is the sneakiest cost. At 500 DML/sec × 1.5 KB per shadow row × 86400 sec/day = ~62 GB/day. Trigger CDC accumulates roughly 2 TB/month of audit data. This forces a retention policy: TTL the shadow table, archive to S3, or partition by day and drop old partitions.
  4. The trigger cost scales linearly with DML rate. At 5000 DML/sec the CPU cost hits 40-75% of one core — still tractable but no longer negligible. At 50000 DML/sec you cannot afford trigger CDC on a shared box.
  5. Compared to log-based CDC (which adds ~0 source-table load), trigger CDC is measurably more expensive on write-heavy workloads. Compared to timestamp CDC (which polls but doesn't add per-DML cost), it's measurably slower but catches deletes.

Output.

Cost dimension Trigger CDC Log-based CDC Timestamp CDC
CPU per DML +80-150 μs ~0 (WAL was written anyway) 0 (polling only)
IOPS amplification 1× (polling scan)
Storage (shadow / slot) ~60 GB/day @ 500 DML/s ~0 (WAL retention only) 0
Retention obligation manual TTL required slot lag monitoring none

Rule of thumb. Before shipping trigger CDC, run the wall-clock benchmark and compute the daily shadow-table storage. If daily shadow growth exceeds 10% of the source-table daily growth or the wall-clock trigger cost exceeds 5% of the source-DB CPU budget, negotiate for log-based CDC instead.

Worked example — LISTEN/NOTIFY for sub-second latency

Detailed explanation. The default trigger CDC reader polls the shadow table every N seconds. For sub-second latency, add pg_notify to the trigger function and a LISTEN client on the reader side. The result is push-based: writes wake the reader within milliseconds. The trade-off is more complex failure semantics — missed notifications on reader disconnect require a polling backup.

  • The push. Trigger runs pg_notify('audit_orders_channel', audit_id::text) after inserting the shadow row.
  • The pull backup. Reader still polls every 10 seconds as a safety net against missed notifications.
  • The reader. LISTEN on the channel; wake on notification; process by audit_id watermark.

Question. Extend the trigger and reader to use LISTEN/NOTIFY for push-based delivery with a polling backup.

Input.

Component Change
Trigger function add pg_notify after shadow insert
Reader run LISTEN; wake on notification; poll every 10s as backup
Notification payload audit_id as text (short; fits in NOTIFY 8KB limit)
Latency sub-second (was: 1-10s poll)

Code.

-- Extended trigger function — pushes a notification after the shadow insert
CREATE OR REPLACE FUNCTION public.audit_orders_trigger()
RETURNS TRIGGER AS $$
DECLARE
    new_audit_id BIGINT;
BEGIN
    IF (TG_OP = 'INSERT') THEN
        INSERT INTO public.audit_orders(op, order_id, after)
        VALUES ('I', NEW.id, to_jsonb(NEW))
        RETURNING audit_id INTO new_audit_id;
    ELSIF (TG_OP = 'UPDATE') THEN
        INSERT INTO public.audit_orders(op, order_id, before, after)
        VALUES ('U', NEW.id, to_jsonb(OLD), to_jsonb(NEW))
        RETURNING audit_id INTO new_audit_id;
    ELSIF (TG_OP = 'DELETE') THEN
        INSERT INTO public.audit_orders(op, order_id, before)
        VALUES ('D', OLD.id, to_jsonb(OLD))
        RETURNING audit_id INTO new_audit_id;
    END IF;

    -- Wake any LISTEN readers
    PERFORM pg_notify('audit_orders_channel', new_audit_id::text);

    RETURN NULL;
END;
$$ LANGUAGE plpgsql;
Enter fullscreen mode Exit fullscreen mode
# Reader — LISTEN + poll backup
import psycopg2
import psycopg2.extensions
import select
import time

def run_reader():
    conn = psycopg2.connect("host=db-primary port=5432 dbname=production user=cdc_reader")
    conn.set_isolation_level(psycopg2.extensions.ISOLATION_LEVEL_AUTOCOMMIT)

    with conn.cursor() as cur:
        cur.execute("LISTEN audit_orders_channel;")

    last_poll = time.time()
    POLL_BACKUP_SEC = 10

    while True:
        # Wait for either a notification or the poll-backup timeout
        elapsed  = time.time() - last_poll
        timeout  = max(0.0, POLL_BACKUP_SEC - elapsed)
        ready, _, _ = select.select([conn], [], [], timeout)

        if ready:
            conn.poll()
            while conn.notifies:
                notify = conn.notifies.pop(0)
                # notify.payload contains the audit_id as text; we ignore it
                # and just process everything > watermark (deduplication)
        else:
            # timeout — polling backup
            pass

        # Either path: drain new rows since watermark
        drain_shadow_table(conn)
        last_poll = time.time()

def drain_shadow_table(conn):
    with conn.cursor() as cur:
        cur.execute("SELECT COALESCE(MAX(audit_id), 0) FROM cdc_watermarks_audit WHERE table_name='audit_orders'")
        watermark = cur.fetchone()[0]
        cur.execute("""
            SELECT audit_id, op, order_id, before, after
            FROM   public.audit_orders
            WHERE  audit_id > %s
            ORDER  BY audit_id
            LIMIT  10000
        """, (watermark,))
        for row in cur:
            emit_to_downstream(row)
        # advance watermark ...
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The extended trigger function captures the newly-inserted audit_id via RETURNING audit_id INTO new_audit_id, then calls pg_notify('audit_orders_channel', new_audit_id::text). The notification fires after the shadow row is committed, so any listener that reacts is guaranteed to see the row.
  2. The reader runs LISTEN audit_orders_channel and blocks on select.select([conn], ...) waiting for either a notification or a timeout. On notification wakeup, the connection's conn.notifies list is drained. On timeout, the polling backup runs anyway.
  3. Whether wakeup is push (notification) or pull (timeout), the reader always calls drain_shadow_table, which queries WHERE audit_id > watermark ORDER BY audit_id LIMIT 10000. This means missed notifications are recovered by the next poll — the polling backup is the safety net.
  4. The notification payload (new_audit_id::text) is essentially unused — the reader could just as well drain on any wakeup. We include it for debug logging. Postgres limits NOTIFY payload to 8KB; putting the whole row in there is a temptation to resist.
  5. The failure story: (a) reader crash → notifications sent while reader is down are lost, but polling backup on restart catches up; (b) network partition → LISTEN reconnects, polling catches up; (c) pg_notify failure → transaction rolls back (both business and shadow rows). No lost data in any case.

Output.

Latency profile Push (LISTEN) Poll backup
Steady state ~50 ms never fires
After reader restart polling until next NOTIFY catches missed events
Network partition LISTEN reconnects drives correctness
Notification lost not covered by LISTEN alone catches within 10s

Rule of thumb. For sub-second trigger CDC, add pg_notify inside the trigger and keep the polling backup. Never rely on notifications alone — they can be silently dropped on client-side buffer overflow or reader disconnect.

Senior interview question on trigger CDC

A senior interviewer might ask: "Your managed Postgres tier doesn't offer logical replication but does allow CREATE TRIGGER. Design a trigger CDC pipeline for a 500 DML/sec orders table with sub-second latency to Kafka. Include the shadow table, the trigger, the reader, the write-amplification budget, and the on-call runbook when the shadow table grows unboundedly."

Solution Using an audit table + LISTEN/NOTIFY reader + partitioned retention

-- 1. Partitioned shadow table — daily partitions for cheap retention
CREATE TABLE public.audit_orders (
    audit_id    BIGSERIAL   NOT NULL,
    op          CHAR(1)     NOT NULL,
    order_id    BIGINT      NOT NULL,
    before      JSONB,
    after       JSONB,
    changed_at  TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp(),
    changed_by  TEXT        NOT NULL DEFAULT current_user,
    txid        BIGINT      NOT NULL DEFAULT txid_current(),
    PRIMARY KEY (audit_id, changed_at)
) PARTITION BY RANGE (changed_at);

-- Create today's + tomorrow's partition; automate creation via pg_partman
CREATE TABLE public.audit_orders_20260704
  PARTITION OF public.audit_orders
  FOR VALUES FROM ('2026-07-04') TO ('2026-07-05');
CREATE TABLE public.audit_orders_20260705
  PARTITION OF public.audit_orders
  FOR VALUES FROM ('2026-07-05') TO ('2026-07-06');

-- 2. Trigger + notification
CREATE OR REPLACE FUNCTION public.audit_orders_trigger()
RETURNS TRIGGER AS $$
DECLARE
    new_audit_id BIGINT;
BEGIN
    IF TG_OP = 'INSERT' THEN
        INSERT INTO public.audit_orders(op, order_id, after)
        VALUES ('I', NEW.id, to_jsonb(NEW))
        RETURNING audit_id INTO new_audit_id;
    ELSIF TG_OP = 'UPDATE' THEN
        INSERT INTO public.audit_orders(op, order_id, before, after)
        VALUES ('U', NEW.id, to_jsonb(OLD), to_jsonb(NEW))
        RETURNING audit_id INTO new_audit_id;
    ELSIF TG_OP = 'DELETE' THEN
        INSERT INTO public.audit_orders(op, order_id, before)
        VALUES ('D', OLD.id, to_jsonb(OLD))
        RETURNING audit_id INTO new_audit_id;
    END IF;
    PERFORM pg_notify('audit_orders_channel', new_audit_id::text);
    RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_orders_audit
AFTER INSERT OR UPDATE OR DELETE ON public.orders
FOR EACH ROW EXECUTE FUNCTION public.audit_orders_trigger();

-- 3. Retention — drop partitions older than 7 days (after Kafka has committed)
DROP TABLE IF EXISTS public.audit_orders_20260627;
Enter fullscreen mode Exit fullscreen mode
# 4. Kafka producer that tails audit_orders (LISTEN + poll)
# (elided; identical to the reader in the LISTEN example above,
#  with kafka_producer.send() replacing emit_to_downstream())
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Layer Config Result
Partitioning daily RANGE(changed_at) DROP PARTITION = O(1) delete
Retention 7 days 62 GB/day × 7 = ~430 GB shadow footprint
Notification pg_notify per trigger ~50 ms end-to-end latency
Backup poll 10 s catches missed notifications
Reader watermark max(audit_id) in cdc_watermarks_audit monotonic; gap-tolerant

After the rollout, the orders DML rate stays at 500/s; audit_orders receives 500/s; Kafka sees each event within 50 ms of source commit; the 7-day retention keeps shadow-table footprint bounded at ~430 GB. The DBA sees no logical replication footprint; the SOC auditor sees a durable audit trail with changed_by and txid.

Output:

Metric Value
End-to-end latency (source commit → Kafka) ~50 ms
Source CPU overhead (trigger) ~8% of one core
Shadow-table daily growth ~62 GB
Shadow-table retention 7 days
Delete detection native (DELETE trigger fires)
DBA permission required CREATE TRIGGER only

Why this works — concept by concept:

  • AFTER trigger + shadow table — the AFTER timing guarantees the shadow row commits atomically with the business row; no dual-write. The shadow table is the durable ordered log the reader tails.
  • RANGE partitioning by changed_at — retention becomes DROP TABLE audit_orders_YYYYMMDD, which is O(1) rather than the O(rows) cost of DELETE FROM audit_orders WHERE changed_at < .... Partitioning is non-negotiable at 500 DML/s.
  • pg_notify inside the trigger — push-based wake-up for sub-second latency. The 8KB payload cap is respected by sending only audit_id::text; the reader drains the actual rows.
  • Backup poll every 10s — LISTEN can miss notifications on client-side buffer overflow or reader restart. The poll is the correctness safety net; LISTEN is the latency optimisation on top.
  • Cost — 2× write amplification on the source (deterministic), ~8% CPU on the trigger, ~430 GB storage for 7-day retention, LISTEN + polling reader. Every cost is bounded and quantified. The eliminated cost is the "we can't get logical replication" outage.

SQL
Topic — sql
SQL trigger and audit-table problems

Practice →

Design Topic — design Design problems on audit and CDC systems

Practice →


4. Log-based CDC — WAL / binlog / oplog

log-based cdc reads the database's write-ahead log directly — zero source-table load, captures every DML, and is the 2026 default when the DBA cooperates

The mental model in one line: log-based cdc is the pattern where a downstream reader tails the database's internal write-ahead log (Postgres WAL, MySQL binlog, Mongo oplog) via the native replication protocol, decodes each record into a typed change event, and forwards it — the source-table load is essentially zero because the WAL is written for durability anyway, deletes are captured natively as tombstone events, ordering is guaranteed by the log sequence, and the pattern is the default answer for any greenfield deployment where logical replication is grantable. Debezium, AWS DMS, Fivetran HVR, and native connectors (pglogical, wal2json, MongoDB change streams) all implement this pattern.

Iconographic log-based CDC diagram — a source database with a WAL log-tape ribbon exiting its side, a reader following the tape, and a downstream JSON stream with deletes preserved.

The four axes for log-based CDC.

  • Permission. Postgres: wal_level = logical + REPLICATION role + replication slot. MySQL: log_bin=ON, binlog_format=ROW, binlog_row_image=FULL, replication user. MongoDB: replica-set member with oplog read. All three require DBA cooperation and (for Postgres/MySQL) a server restart to change wal_level / log_bin.
  • Latency. Sub-second in steady state. The reader lags the primary by the network hop + decode time — typically 50-200 ms.
  • Delete handling. Native and complete. DELETE events carry the pre-image; the reader emits a tombstone event that downstream consumers use to invalidate caches, remove rows from search indices, etc.
  • Source load. Essentially zero on the source tables — the WAL is written for durability regardless of CDC. The reader adds ~2-5% CPU on the DB host for the replication decoder plus ~10-50 MB memory. Compared to trigger CDC's 2× write amplification, this is a step-change improvement.

The WAL / binlog / oplog anatomy.

  • Postgres WAL. The write-ahead log records every heap-page change. Logical decoding (via pgoutput or wal2json plugin) converts physical WAL records into logical row-level events. The output plugin choice determines the format (Debezium uses pgoutput; some legacy stacks use wal2json).
  • MySQL binlog. Row-based binlog format (binlog_format=ROW) writes one binlog event per changed row. binlog_row_image=FULL includes both before and after images. Debezium's MySQL connector reads via the standard replication protocol.
  • MongoDB oplog. A capped collection (local.oplog.rs) recording every write. MongoDB 3.6+ exposes a higher-level change stream API that wraps the oplog and adds resumability tokens.

The replication slot / binlog position / resume token — the durable position tracker.

  • Postgres replication slot. Named durable state (pg_create_logical_replication_slot) that tracks the reader's last-confirmed WAL position (LSN). Postgres retains WAL until the slot's confirmed_flush_lsn advances past it — safe against reader downtime, dangerous if the reader is offline indefinitely (WAL fills the disk).
  • MySQL GTID / binlog file+position. Global Transaction ID (GTID) or binlog file + position tracks the reader's position. Binlog retention is time-based (binlog_expire_logs_seconds).
  • MongoDB resume token. Opaque binary token returned by change streams; feed it back on reconnect to resume.

The snapshot bootstrap — how the reader learns about pre-existing rows.

  • The problem. The WAL only contains changes from the moment CDC starts. What about the 100 million rows already in the table when CDC was enabled?
  • The solution. Debezium (and DMS) run a consistent snapshot on startup: SET TRANSACTION SNAPSHOT (Postgres) or FLUSH TABLES WITH READ LOCK briefly (MySQL) to record a point-in-time, SELECT every row from every configured table, emit a synthetic INSERT event per row, record the WAL position at snapshot start, then switch to tailing the WAL from that position.
  • The trade-off. Snapshots are I/O-heavy for large tables. Options: initial (snapshot every table), schema_only (skip snapshot; only new changes captured — useful if you don't need history), initial_only (snapshot then stop — useful for one-time backfills), never (no snapshot; assume you've bootstrapped some other way).

Common interview probes on log-based CDC.

  • "How does log-based CDC handle deletes?" — required answer: tombstone events with pre-image.
  • "What's the source load?" — required answer: ~0 on source tables (WAL is written anyway) + 2-5% CPU for the reader.
  • "How do you bootstrap the initial data?" — snapshot mode: consistent snapshot then tail from snapshot LSN.
  • "What's a replication slot? What are its risks?" — durable position tracker; WAL retention on offline reader can fill disk.
  • "How do you replay the last hour of changes?" — reset the slot to an earlier LSN; re-emit from there. On managed DBs, this is often not possible (slots are managed).

Worked example — Postgres logical replication to a downstream JSON stream

Detailed explanation. The canonical log-based Postgres CDC setup: enable wal_level = logical, create a REPLICATION role, create a PUBLICATION for the tables you care about, and a Debezium connector consumes the WAL via pgoutput and publishes JSON to Kafka. Walk through every piece.

  • Enable. wal_level = logical, restart Postgres.
  • Grant. CREATE ROLE cdc_reader WITH LOGIN REPLICATION PASSWORD ...; GRANT SELECT ON ....
  • Publication. CREATE PUBLICATION orders_pub FOR TABLE public.orders.
  • Slot. Debezium creates the replication slot on startup.
  • Connector. Debezium Postgres connector with plugin.name=pgoutput, snapshot.mode=initial.

Question. Provide the Postgres-side setup and the Debezium connector config, then show the change-event shape that lands in Kafka.

Input.

Component Value
Postgres version 16
Plugin pgoutput (built-in)
Publication orders_pub for public.orders
Slot debezium_orders
Snapshot mode initial

Code.

-- 1. postgresql.conf — requires restart
-- wal_level             = logical
-- max_replication_slots = 10
-- max_wal_senders       = 10

-- 2. Postgres-side setup (one-time)
CREATE ROLE cdc_reader WITH LOGIN REPLICATION PASSWORD 'strong-secret';
GRANT SELECT ON public.orders TO cdc_reader;
GRANT USAGE  ON SCHEMA public   TO cdc_reader;

CREATE PUBLICATION orders_pub FOR TABLE public.orders;

-- Verify
SELECT * FROM pg_publication;
SELECT * FROM pg_publication_tables WHERE pubname = 'orders_pub';

-- 3. pg_hba.conf — allow the replication user to open a replication connection
-- host    replication    cdc_reader    10.0.0.0/8    scram-sha-256
Enter fullscreen mode Exit fullscreen mode
# 4. Debezium Postgres connector (Kafka Connect JSON config)
name: orders-cdc
config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector

  # Source
  plugin.name: pgoutput
  database.hostname: db-primary.internal
  database.port: 5432
  database.user: cdc_reader
  database.password: ${env:CDC_PASSWORD}
  database.dbname: production

  # CDC control
  slot.name: debezium_orders
  publication.name: orders_pub
  snapshot.mode: initial
  snapshot.isolation.mode: repeatable_read
  table.include.list: public.orders
  heartbeat.interval.ms: 10000

  # Kafka
  topic.prefix: prod
  key.converter: org.apache.kafka.connect.json.JsonConverter
  value.converter: org.apache.kafka.connect.json.JsonConverter
  value.converter.schemas.enable: false

  # Delete handling — emit tombstones (Kafka log-compaction friendly)
  tombstones.on.delete: true
Enter fullscreen mode Exit fullscreen mode
// 5. Sample Debezium change event for an UPDATE
{
  "before": {"id": 42, "customer_id": 7, "total_cents": 1500, "status": "pending",  "updated_at": 1720051200000000},
  "after":  {"id": 42, "customer_id": 7, "total_cents": 1500, "status": "shipped",  "updated_at": 1720051260000000},
  "source": {
    "version": "2.5.0.Final",
    "connector": "postgresql",
    "name": "prod",
    "ts_ms": 1720051260123,
    "snapshot": "false",
    "db": "production",
    "schema": "public",
    "table": "orders",
    "txId": 89123456,
    "lsn": 24591040,
    "xmin": null
  },
  "op": "u",
  "ts_ms": 1720051260200,
  "transaction": null
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The postgresql.conf change (wal_level = logical) is the one setting the DBA must grant. max_replication_slots caps the number of concurrent readers; max_wal_senders caps the number of concurrent replication connections. Both need modest room above the number of connectors you plan to run.
  2. The PUBLICATION object scopes which tables the reader can see. FOR TABLE public.orders publishes only that table; FOR ALL TABLES publishes everything (dangerous — schema DDL now enters the WAL). Prefer explicit lists.
  3. The Debezium connector opens a replication connection as cdc_reader, creates the replication slot debezium_orders on first start (or reuses it on restart), and requests the pgoutput decoder. The connector translates each WAL record into a Debezium change event with before, after, source, and op fields.
  4. snapshot.mode = initial triggers a one-time consistent snapshot on the connector's first start: SET TRANSACTION SNAPSHOT freezes a point-in-time view; the connector SELECTs every row from every configured table, emits a snapshot=true INSERT event per row, records the LSN at snapshot start, then switches to tailing the WAL from that LSN. Snapshotting a 2B-row table is measured in hours; plan capacity accordingly.
  5. Every downstream change event lands in a Kafka topic named prod.public.orders with the JSON shape shown. op = c/u/d/r for create/update/delete/read (snapshot rows). Tombstones (a Kafka message with a null value) for deletes enable Kafka log compaction — old versions of the same key are garbage-collected.

Output.

Event kind op before after Notes
INSERT c null full row Create event
UPDATE u old row new row Both images
DELETE d old row null Tombstone follows
Snapshot row r null full row source.snapshot = true

Rule of thumb. For any Postgres log-based CDC deployment, use plugin.name = pgoutput (built-in, no plugin install), snapshot.mode = initial (safe default), and tombstones.on.delete = true (enables Kafka compaction). Set heartbeat.interval.ms = 10000 so quiet tables don't cause the slot to hold WAL indefinitely.

Worked example — replication slot lag monitoring and disk-full defense

Detailed explanation. The single biggest operational hazard of log-based Postgres CDC is a stalled reader combined with a busy source. Postgres will retain WAL indefinitely until the slot's confirmed_flush_lsn advances — if the reader is offline for a week, the WAL directory fills up and the primary crashes. Every senior architect ships slot-lag monitoring on day one.

  • The mechanism. pg_replication_slots.confirmed_flush_lsn is the highest LSN the reader has acknowledged. WAL between that LSN and pg_current_wal_lsn() is retained.
  • The alert. Slot lag in bytes > 1 GB, or slot inactive for > 15 min. Both are early-warning signals.
  • The defense. max_slot_wal_keep_size (Postgres 13+) caps how much WAL a slot can retain — beyond that, the slot is invalidated and the reader must resnapshot. Prevents disk-full crashes at the cost of losing continuity.

Question. Write the monitoring query, the alert rule, and the disk-full defense.

Input.

Component Value
Alert threshold slot lag > 1 GB or inactive > 15 min
Defense max_slot_wal_keep_size = 20GB
Runbook check reader health → restart connector → last resort: drop slot + resnapshot

Code.

-- Monitoring query — run every 60 seconds
SELECT
    slot_name,
    active,
    active_pid,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag_pretty,
    pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)                 AS lag_bytes,
    age(clock_timestamp(), pg_stat_get_replication_slot_stat(slot_name).stats_reset) AS since_last_ack
FROM   pg_replication_slots
WHERE  slot_type = 'logical';

-- Disk-full defense — cap slot retention (Postgres 13+)
ALTER SYSTEM SET max_slot_wal_keep_size = '20GB';
SELECT pg_reload_conf();
Enter fullscreen mode Exit fullscreen mode
# Prometheus exporter
import psycopg2, time
from prometheus_client import start_http_server, Gauge

g_lag_bytes = Gauge('pg_slot_lag_bytes',     'replication slot lag', ['slot'])
g_active    = Gauge('pg_slot_active',        '1 if reader connected',['slot'])

def scrape():
    conn = psycopg2.connect("host=db-primary port=5432 dbname=production user=monitor")
    conn.autocommit = True
    cur = conn.cursor()
    cur.execute("""
        SELECT slot_name,
               active::int,
               pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes
        FROM   pg_replication_slots
        WHERE  slot_type = 'logical'
    """)
    for slot_name, active, lag_bytes in cur.fetchall():
        g_lag_bytes.labels(slot_name).set(lag_bytes)
        g_active.labels(slot_name).set(active)
    conn.close()

if __name__ == "__main__":
    start_http_server(9128)
    while True:
        scrape()
        time.sleep(30)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The monitoring query joins pg_replication_slots with pg_current_wal_lsn() and computes the byte-difference — the amount of WAL Postgres is holding on the slot's behalf. pg_wal_lsn_diff handles the LSN arithmetic; the result is the "lag" that predicts disk-full risk.
  2. The Prometheus exporter turns the per-slot lag into a labelled gauge (pg_slot_lag_bytes{slot="debezium_orders"}). The on-call alert is pg_slot_lag_bytes > 1073741824 for 5m (1 GB sustained) — sustained lag means the reader is genuinely behind, not a momentary spike.
  3. A companion alert pg_slot_active{slot=~".+"} == 0 for 15m fires if the reader is disconnected for 15 minutes. This catches the "connector crashed but no alert" case where the slot is idle and lag grows quietly.
  4. max_slot_wal_keep_size = '20GB' is the disk-full defense: if a slot falls more than 20 GB behind, Postgres invalidates the slot (wal_status = 'lost'). The reader will fail on reconnect and must resnapshot. This trades continuity for cluster survival — a good trade.
  5. The runbook when the alert fires: (a) check if the Debezium connector is running — restart if dead; (b) check Kafka consumer lag — if Kafka is behind, the connector is emitting fine but can't ack; (c) if slot is wal_status = lost, drop the slot and re-create with snapshot.mode = initial to rebootstrap.

Output.

Slot state active lag_bytes Runbook
Healthy true < 100 MB none
Warning true 100 MB - 1 GB inspect connector logs
Alert true > 1 GB page on-call; find slow consumer
Idle false any reader down; restart connector
Lost any (n/a) slot invalidated; drop + resnapshot

Rule of thumb. Ship slot-lag + slot-active monitoring the same day you enable logical replication. Set max_slot_wal_keep_size to a value less than 25% of your WAL disk. A logical replication slot without monitoring is a 3 AM disk-full incident waiting to happen.

Worked example — MySQL binlog CDC with GTID resume

Detailed explanation. MySQL's log-based CDC story is the binlog. With binlog_format=ROW and binlog_row_image=FULL, each row-level change lands in the binlog with both before and after images. Debezium's MySQL connector reads via the standard replication protocol using GTIDs (globally-unique transaction IDs) for resumability. Walk through the setup.

  • Server. log_bin=ON, binlog_format=ROW, binlog_row_image=FULL, gtid_mode=ON, enforce_gtid_consistency=ON.
  • Connector. Debezium MySQL connector with GTID-based resume.
  • Resume. On restart, the connector reads the last processed GTID from its own history topic in Kafka.

Question. Configure MySQL for log-based CDC and set up the Debezium connector.

Input.

Component Value
MySQL version 8.0
binlog format ROW
GTID ON
Connector Debezium MySQL

Code.

-- MySQL server config (my.cnf)
-- [mysqld]
-- server-id                      = 1
-- log_bin                        = /var/log/mysql/mysql-bin.log
-- binlog_format                  = ROW
-- binlog_row_image               = FULL
-- binlog_expire_logs_seconds     = 604800  -- 7 days
-- gtid_mode                      = ON
-- enforce_gtid_consistency       = ON

-- Replication user
CREATE USER 'cdc_reader'@'%' IDENTIFIED BY 'strong-secret';
GRANT REPLICATION SLAVE, REPLICATION CLIENT, SELECT, RELOAD ON *.* TO 'cdc_reader'@'%';
Enter fullscreen mode Exit fullscreen mode
# Debezium MySQL connector
name: orders-cdc-mysql
config:
  connector.class: io.debezium.connector.mysql.MySqlConnector
  database.hostname: mysql-primary.internal
  database.port: 3306
  database.user: cdc_reader
  database.password: ${env:CDC_PASSWORD}
  database.server.id: 184054           # unique replica ID
  topic.prefix: prod

  # Include the orders table
  database.include.list: production
  table.include.list: production.orders

  # Snapshot policy
  snapshot.mode: initial

  # GTID-based resume — the connector persists processed GTIDs
  # in the schema history topic and resumes from there
  schema.history.internal.kafka.topic: schema-history.orders-cdc
  schema.history.internal.kafka.bootstrap.servers: kafka:9092

  # Kafka
  key.converter: org.apache.kafka.connect.json.JsonConverter
  value.converter: org.apache.kafka.connect.json.JsonConverter
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. binlog_format=ROW writes one binlog event per changed row (vs STATEMENT which writes the DML text). Row format is mandatory for CDC — statement format is ambiguous under non-deterministic functions.
  2. binlog_row_image=FULL writes both before and after images. The default MINIMAL writes only changed columns, which downstream consumers must reconstruct — usually not worth the disk savings for a CDC-driven binlog.
  3. gtid_mode=ON gives every transaction a globally-unique ID (server_uuid:tx_num). GTIDs are the resumability primitive: on connector restart, the connector says "resume after GTID X:Y" and MySQL streams from there. Without GTIDs, resume is via binlog file + position, which is fragile across binlog rotations.
  4. The Debezium connector opens a replication connection as cdc_reader, requests the binlog starting at its last-processed GTID, and translates each ROW event into a Debezium change event. The Kafka message shape is the same as the Postgres connector — same before, after, source, op fields.
  5. The schema-history.orders-cdc Kafka topic stores the DDL history (CREATE TABLE, ALTER TABLE) that the connector has seen. On restart, the connector replays this topic to rebuild its in-memory schema, then requests the binlog from the last processed GTID. This is why the connector can decode events even after ALTER TABLE.

Output.

Aspect Postgres MySQL
WAL / log WAL binlog
Position tracker Replication slot (LSN) GTID
Retention control replication slot + max_slot_wal_keep_size binlog_expire_logs_seconds
Snapshot mechanism SET TRANSACTION SNAPSHOT FLUSH TABLES WITH READ LOCK (briefly)
Debezium plugin pgoutput native binlog protocol

Rule of thumb. For MySQL log-based CDC, always run binlog_format=ROW + binlog_row_image=FULL + gtid_mode=ON. Anything else creates operational pain. The Debezium MySQL connector then works with the same downstream Kafka contract as the Postgres connector.

Senior interview question on log-based CDC

A senior interviewer might ask: "Design log-based CDC from Postgres 16 to Kafka for a 100-table warehouse feed. Cover the Postgres config, the connector snapshot strategy, the slot-lag monitoring, the schema-evolution story (someone adds a column tomorrow), and the failure semantics when Kafka is unreachable for 30 minutes."

Solution Using Debezium + pgoutput + slot monitoring + schema-registry-backed evolution

-- 1. Postgres server config (postgresql.conf) — one restart
-- wal_level                = logical
-- max_replication_slots    = 20
-- max_wal_senders          = 20
-- max_slot_wal_keep_size   = 50GB    -- disk-full defense

-- 2. Replication role + publication for all warehouse tables
CREATE ROLE cdc_reader WITH LOGIN REPLICATION PASSWORD 'strong-secret';
GRANT USAGE ON SCHEMA public TO cdc_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO cdc_reader;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO cdc_reader;

CREATE PUBLICATION warehouse_pub FOR ALL TABLES;
Enter fullscreen mode Exit fullscreen mode
# 3. Debezium connector — full-warehouse config
name: warehouse-cdc
config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector

  # Source
  plugin.name: pgoutput
  database.hostname: db-primary.internal
  database.port: 5432
  database.user: cdc_reader
  database.password: ${env:CDC_PASSWORD}
  database.dbname: production

  # CDC
  slot.name: debezium_warehouse
  publication.name: warehouse_pub
  snapshot.mode: initial
  snapshot.isolation.mode: repeatable_read
  snapshot.max.threads: 4         # parallelise the initial snapshot
  heartbeat.interval.ms: 10000
  schema.include.list: public

  # Schema-registry-backed evolution
  topic.prefix: prod
  key.converter: io.confluent.connect.avro.AvroConverter
  key.converter.schema.registry.url: http://schema-registry:8081
  value.converter: io.confluent.connect.avro.AvroConverter
  value.converter.schema.registry.url: http://schema-registry:8081

  # Delete handling + tombstones
  tombstones.on.delete: true

  # Kafka resilience
  producer.override.acks: all
  producer.override.retries: 2147483647
  producer.override.delivery.timeout.ms: 2147483647
  producer.override.max.in.flight.requests.per.connection: 5
  producer.override.enable.idempotence: true
Enter fullscreen mode Exit fullscreen mode
-- 4. Slot monitoring (Prometheus alerts)
-- pg_slot_lag_bytes{slot="debezium_warehouse"} > 5368709120 for 10m  # 5 GB
-- pg_slot_active{slot="debezium_warehouse"}    == 0            for 15m
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Concern Answer Reasoning
Postgres config wal_level=logical + max_slot_wal_keep_size=50GB disk-full defense
Publication scope FOR ALL TABLES 100-table warehouse; no per-table maintenance
Snapshot initial + max.threads=4 parallel snapshot for large tables
Schema evolution Avro + schema registry backward-compatible column adds
Kafka outage idempotent producer + infinite retries slot holds WAL until Kafka returns
Slot lag alert > 5 GB for 10 min early warning below max_slot_wal_keep_size

After deployment, the connector snapshots ~500 GB of history in ~4 hours (parallel snapshot threads), then tails at sub-second latency. A schema change (ALTER TABLE orders ADD COLUMN priority INT) auto-propagates: Debezium detects the DDL, registers a new schema version in the registry, and downstream consumers keep decoding. A 30-minute Kafka outage causes the slot to accumulate up to 50 GB of WAL; the connector reconnects on Kafka recovery and drains without data loss.

Output:

Metric Value
End-to-end latency (steady state) 100-200 ms
Source CPU overhead ~3% of the DB host
Snapshot time (500 GB) ~4 h with 4 threads
Slot lag budget 50 GB (max_slot_wal_keep_size)
Schema evolution automatic (schema registry)
Kafka outage tolerance up to 50 GB WAL retention

Why this works — concept by concept:

  • pgoutput + PUBLICATION FOR ALL TABLES — built-in Postgres decoder, no plugin install; ALL TABLES scope avoids per-table maintenance in a 100-table warehouse.
  • snapshot.mode=initial + max.threads=4 — the initial snapshot is the biggest one-time cost. Parallel threads scan multiple tables concurrently, cutting hours off the bootstrap window.
  • Avro + schema registry — schema-registry-backed Avro means a column add is a backward-compatible schema evolution, not a downstream-consumer break. JSON without a schema loses type information; Avro with a registry keeps everything typed.
  • max_slot_wal_keep_size=50GB — the disk-full defense. If Kafka is unreachable for hours, the slot can grow up to 50 GB; beyond that, the slot is invalidated (wal_status=lost) and the primary survives at the cost of losing continuity. Trade correctness for cluster survival.
  • Cost — ~3% source CPU, ~50 GB WAL retention budget, 4-hour one-time snapshot, ongoing schema-registry integration. Compared to trigger CDC's 2× write amplification and per-table trigger maintenance, log-based is dramatically cheaper at scale. Compared to timestamp CDC, it captures deletes natively and eliminates the nightly reconcile.

Streaming
Topic — streaming
Streaming CDC and Debezium problems

Practice →

Design Topic — design Design problems on log-based CDC systems

Practice →


5. Outbox pattern + hybrid

The outbox pattern solves the dual-write problem — write the business row and the event to the same table in one transaction, then let a CDC reader publish the events

The mental model in one line: the outbox pattern is a schema convention where the application writes the business row and a row into an outbox table inside the same database transaction, and a downstream reader (usually log-based CDC) tails the outbox table and publishes each row as a domain event — this collapses the dual-write problem (business row committed but event lost, or vice versa) into a single-transaction commit that either succeeds fully or fails fully. Every senior microservice architecture needs this pattern once it has more than one downstream consumer that expects typed events.

Iconographic outbox + hybrid diagram — a transaction card writing to business + outbox table atomically, log-based reader tailing the outbox, plus a hybrid box combining log-based + outbox for aggregates.

The dual-write problem — what the outbox pattern solves.

  • The naive design. Application writes to orders (Postgres), then publishes an OrderPlaced event to Kafka. Two separate systems, two separate writes.
  • The failure modes. (a) Postgres commits, Kafka publish fails → business row saved, event lost, downstream never learns. (b) Kafka publish succeeds, Postgres commits fails → downstream told about an order that doesn't exist. Both are correctness bugs.
  • The wrong fix. Two-phase commit across Postgres and Kafka — theoretically correct, operationally terrible (long-held locks, blocking on Kafka availability).
  • The outbox fix. Write the event to a Postgres outbox table in the same transaction as the business row. Both commit atomically. A downstream CDC reader (log-based, typically Debezium) tails the outbox table and publishes to Kafka. If Kafka is down, the reader retries; the source is still consistent.

The outbox table shape.

  • Primary key. id UUID or BIGSERIAL — monotonic for CDC ordering.
  • Aggregate key. aggregate_type TEXT (e.g. Order, Customer), aggregate_id TEXT — identifies which entity this event is about. Used for Kafka topic routing and partition-key derivation.
  • Event type. event_type TEXT (e.g. OrderPlaced, OrderShipped) — the semantic type of the event.
  • Payload. payload JSONB — the event body.
  • Timestamps + metadata. created_at TIMESTAMPTZ DEFAULT clock_timestamp().

The Debezium EventRouter SMT — the outbox-specific plumbing.

  • What it does. A Single Message Transform (SMT) that rewrites Debezium change events from the generic prod.public.outbox shape into prod.Order.OrderPlaced (topic per aggregate type + event type). Extracts the payload from the JSONB column so downstream consumers see typed events, not Debezium envelopes.
  • The routing. By default, one Kafka topic per aggregate_type. Configurable to route by event_type or a custom expression.
  • The partition key. The aggregate_id becomes the Kafka message key, so all events for the same aggregate land on the same partition (ordering guarantee per aggregate).
  • The delete. After the CDC reader has forwarded the outbox row, the application can safely delete it (or a scheduled job TTLs old rows). The events live in Kafka; the outbox is just the buffer.

The hybrid pattern — log-based + outbox side by side.

  • The idea. Log-based CDC captures raw source-table mutations (INSERT/UPDATE/DELETE on orders) for the warehouse. Outbox captures domain events (OrderPlaced, OrderShipped) for the event bus. Both are shipped by the same Debezium instance; they serve different consumers.
  • Why both. Raw mutations feed dashboards and audits ("show me every state change on this order"). Domain events feed microservices ("react when an order is placed"). Sharing one CDC tool but two data channels is the cheapest way to serve both.

Common interview probes on the outbox pattern.

  • "What problem does the outbox pattern solve?" — required answer: the dual-write problem between the DB and an event bus.
  • "Why not two-phase commit?" — long locks, blocks on Kafka availability, operational complexity.
  • "How does the outbox row get to Kafka?" — CDC reader (typically Debezium) tails the outbox table.
  • "How do you route different event types to different Kafka topics?" — Debezium EventRouter SMT keyed on aggregate_type or event_type.
  • "How do you clean up processed outbox rows?" — TTL job or delete-after-forward (only safe with idempotent consumers).

Worked example — outbox table + application code writing atomically

Detailed explanation. The canonical outbox setup on the application side: define the outbox table, wrap the business write and the outbox write in a single transaction, and let Debezium do the rest. Walk through the schema and the application code.

  • Schema. outbox(id, aggregate_type, aggregate_id, event_type, payload, created_at).
  • Application code. One transaction, two writes: business row + outbox row.
  • Rollback semantics. If either write fails, the whole transaction rolls back.

Question. Write the outbox schema and a Python function that places an order and emits an OrderPlaced event atomically.

Input.

Component Value
Business table public.orders
Outbox table public.outbox
Event type OrderPlaced
Aggregate Order / order_id

Code.

-- Outbox table — one shape serves all aggregate types
CREATE TABLE public.outbox (
    id              UUID         PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type  TEXT         NOT NULL,   -- 'Order', 'Customer', ...
    aggregate_id    TEXT         NOT NULL,   -- string form of the PK
    event_type      TEXT         NOT NULL,   -- 'OrderPlaced', 'OrderShipped', ...
    payload         JSONB        NOT NULL,   -- event body
    created_at      TIMESTAMPTZ  NOT NULL DEFAULT clock_timestamp()
);

CREATE INDEX idx_outbox_created_at ON public.outbox (created_at);
CREATE INDEX idx_outbox_aggregate  ON public.outbox (aggregate_type, aggregate_id);
Enter fullscreen mode Exit fullscreen mode
# Application code — one transaction, two writes
import json
import psycopg2
from typing import Any

def place_order(conn, customer_id: int, total_cents: int, items: list[dict[str, Any]]) -> int:
    """Insert an order and an OrderPlaced event atomically."""
    with conn:                     # implicit BEGIN / COMMIT (rollback on exception)
        with conn.cursor() as cur:
            # 1. Business write — insert the order
            cur.execute("""
                INSERT INTO public.orders(customer_id, total_cents, status)
                VALUES (%s, %s, 'pending')
                RETURNING id
            """, (customer_id, total_cents))
            order_id = cur.fetchone()[0]

            # 2. Outbox write — same transaction, atomic commit
            payload = {
                "order_id":    order_id,
                "customer_id": customer_id,
                "total_cents": total_cents,
                "items":       items,
                "placed_at":   "now",
            }
            cur.execute("""
                INSERT INTO public.outbox(aggregate_type, aggregate_id, event_type, payload)
                VALUES ('Order', %s, 'OrderPlaced', %s::jsonb)
            """, (str(order_id), json.dumps(payload)))

    # If we reach here, both writes committed. If either failed,
    # the with-block rolled back and re-raised.
    return order_id
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The outbox schema is one-table-fits-all: aggregate_type + aggregate_id identify the entity; event_type names the semantic event; payload JSONB carries the body. No per-event-type schema migration needed.
  2. The Python with conn: idiom (psycopg2's context manager) implicitly wraps the block in BEGIN ... COMMIT; any exception triggers ROLLBACK. This is the atomicity guarantee: business row and outbox row commit together or not at all.
  3. Step 1 inserts the business row and returns the generated order_id via RETURNING id. Step 2 uses that order_id as the aggregate_id for the outbox row and embeds it in the JSON payload as well.
  4. The application never talks to Kafka. That responsibility is delegated to Debezium tailing the outbox table. If Kafka is down, the application still commits normally; Debezium retries; the pipeline eventually catches up.
  5. Under Postgres MVCC, the two INSERTs are visible atomically to CDC — Debezium's snapshot isolation reads either the pre-commit state (nothing) or the post-commit state (both rows). There's no window where the business row exists but the outbox row doesn't.

Output.

Step State after
BEGIN (transaction open)
INSERT orders order_id = 42 (uncommitted)
INSERT outbox (uncommitted, references order_id 42)
COMMIT Both rows durably committed; Debezium sees them in WAL
Debezium tail Outbox row → Kafka topic prod.Order.OrderPlaced
(Debezium never sees an orders row without matching outbox) Atomicity preserved

Rule of thumb. For any dual-write scenario (DB + event bus, DB + cache, DB + search), replace the second write with an outbox row in the same transaction, and let a CDC reader publish. This is the single most-cited microservice pattern in senior interviews.

Worked example — Debezium EventRouter SMT + topic routing

Detailed explanation. With the outbox table in place, the CDC reader still needs to (a) recognise it as outbox-shaped, (b) route to the correct Kafka topic based on aggregate_type, and (c) unwrap the payload so downstream consumers see the event body, not the Debezium envelope. Debezium's built-in EventRouter SMT does all three. Configure the connector.

  • Recognition. SMT matches records from prod.public.outbox.
  • Routing. One topic per aggregate_type (e.g. prod.Order, prod.Customer).
  • Unwrap. SMT extracts payload as the message value; sets aggregate_id as the message key.

Question. Configure the Debezium Postgres connector with the EventRouter SMT for the outbox table.

Input.

Component Value
Source table public.outbox
Route by aggregate_type
Topic naming prod.<aggregate_type>
Message key aggregate_id
Message value payload JSONB

Code.

# Debezium connector — outbox mode
name: outbox-cdc
config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector

  plugin.name: pgoutput
  database.hostname: db-primary.internal
  database.port: 5432
  database.user: cdc_reader
  database.password: ${env:CDC_PASSWORD}
  database.dbname: production

  slot.name: debezium_outbox
  publication.name: outbox_pub    # created as: CREATE PUBLICATION outbox_pub FOR TABLE public.outbox;
  table.include.list: public.outbox
  snapshot.mode: never            # outbox is empty at bootstrap; no need to snapshot
  heartbeat.interval.ms: 10000
  topic.prefix: prod

  # EventRouter SMT — the outbox-specific plumbing
  transforms: outbox
  transforms.outbox.type: io.debezium.transforms.outbox.EventRouter
  transforms.outbox.route.by.field: aggregate_type
  transforms.outbox.route.topic.replacement: prod.${routedByValue}
  transforms.outbox.table.field.event.key: aggregate_id
  transforms.outbox.table.field.event.payload: payload
  transforms.outbox.table.expand.json.payload: true

  key.converter: org.apache.kafka.connect.storage.StringConverter
  value.converter: org.apache.kafka.connect.json.JsonConverter
  value.converter.schemas.enable: false
Enter fullscreen mode Exit fullscreen mode
# Kafka topic layout after the SMT runs
prod.Order       — messages keyed by order_id,    value = full OrderPlaced/Shipped payload
prod.Customer    — messages keyed by customer_id, value = full CustomerCreated/Updated payload
prod.Shipment    — messages keyed by shipment_id, value = full ShipmentDispatched payload

Message on prod.Order:
  key   = "42"
  value = {"order_id": 42, "customer_id": 7, "total_cents": 1500, ...}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The Debezium connector tails the WAL exactly like a normal CDC connector — no special Postgres-side setup for outbox. The PUBLICATION outbox_pub FOR TABLE public.outbox scopes it; snapshot.mode = never is safe because the outbox is empty (or being TTL'd) — bootstrap is unnecessary.
  2. The EventRouter SMT intercepts every Debezium change event from public.outbox. It reads the aggregate_type column to decide the target topic (prod.Order, prod.Customer, etc.), reads aggregate_id for the message key, and reads payload for the message value.
  3. transforms.outbox.route.topic.replacement = prod.${routedByValue} is the templated topic-name expression; ${routedByValue} is substituted with the value from route.by.field (aggregate_type). This gives one topic per aggregate type without manual per-type config.
  4. expand.json.payload = true tells the SMT to unwrap the JSONB payload column: instead of sending {"payload": {...}}, it sends the inner object directly. Downstream consumers see typed domain events, not Debezium envelopes.
  5. The result: from the outside, the outbox is invisible. Applications see typed events on typed topics. Debezium + the SMT is the entire plumbing; the outbox table is an implementation detail.

Output.

Outbox row Kafka topic Kafka key Kafka value
aggregate=Order, id=42, type=OrderPlaced prod.Order "42" {"order_id": 42, ...}
aggregate=Order, id=42, type=OrderShipped prod.Order "42" {"order_id": 42, "shipped_at": ...}
aggregate=Customer, id=7, type=CustomerCreated prod.Customer "7" {"customer_id": 7, ...}

Rule of thumb. For any Debezium + outbox deployment, use the built-in EventRouter SMT — do not roll your own topic routing. route.by.field on aggregate_type and event.key on aggregate_id are the defaults for a reason: they align with Kafka's per-key ordering guarantees.

Worked example — outbox TTL and idempotent-consumer contract

Detailed explanation. Two operational tails on every outbox deployment: (a) the outbox table grows indefinitely without cleanup, and (b) downstream consumers must be idempotent because Kafka delivers at-least-once. Walk through the TTL job and the consumer-side idempotency contract.

  • TTL. Nightly job deletes outbox rows older than N days; N = "how far back does Kafka retain the same event?" plus a safety margin.
  • Idempotency. Consumers store (aggregate_id, event_id) in a processed table; skip if seen.

Question. Write the TTL job and the consumer-side idempotency check.

Input.

Component Value
TTL age 7 days
Kafka retention 7 days
Consumer store Redis or Postgres
Dedupe key event id (UUID)

Code.

-- Nightly TTL job — delete outbox rows older than 7 days
-- (Safe because Debezium has forwarded them; Kafka has the durable copy)
DELETE FROM public.outbox
WHERE  created_at < now() - INTERVAL '7 days';

-- Better: partition by day and DROP old partitions
-- (O(1) delete vs O(rows))
Enter fullscreen mode Exit fullscreen mode
# Consumer-side idempotency — Postgres-backed dedupe
import psycopg2
from psycopg2 import errors as pg_errors

def process_order_event(conn, event: dict) -> None:
    """Idempotent handler for OrderPlaced / OrderShipped events."""
    event_id     = event["id"]           # UUID from outbox.id, carried in payload
    order_id     = event["order_id"]
    event_type   = event["event_type"]

    try:
        with conn:
            with conn.cursor() as cur:
                # 1. Try to record the event as processed
                cur.execute("""
                    INSERT INTO consumer_processed_events(event_id, consumer_name)
                    VALUES (%s, 'order-analytics-svc')
                """, (event_id,))

                # 2. Do the actual work (INSERT into analytics table, etc.)
                cur.execute("""
                    INSERT INTO analytics_orders(order_id, event_type, seen_at)
                    VALUES (%s, %s, now())
                """, (order_id, event_type))
    except pg_errors.UniqueViolation:
        # This event was already processed — safe to skip.
        conn.rollback()
        return
Enter fullscreen mode Exit fullscreen mode
-- Consumer-side processed-events table
CREATE TABLE consumer_processed_events (
    event_id        UUID        NOT NULL,
    consumer_name   TEXT        NOT NULL,
    processed_at    TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp(),
    PRIMARY KEY (event_id, consumer_name)
);

-- TTL for the dedupe table — keep 30 days
DELETE FROM consumer_processed_events
WHERE  processed_at < now() - INTERVAL '30 days';
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The outbox TTL is safe because the source of truth after CDC has forwarded is Kafka, not the outbox. Once Debezium's slot has confirmed the row's LSN, the row can be deleted. A 7-day age gives a comfortable buffer against slot lag or backfill needs.
  2. The partitioned-table alternative (DROP TABLE outbox_20260627) is O(1) rather than O(rows) and is the pattern of choice at high volume. DELETE ... WHERE created_at < now() - INTERVAL '7 days' is fine for lower volume.
  3. The consumer side must assume at-least-once delivery: Kafka can redeliver on partition rebalance, consumer restart, or explicit reset. Idempotency is not optional.
  4. The dedupe pattern uses a consumer_processed_events table with (event_id, consumer_name) as the composite PK. Trying to insert a duplicate raises UniqueViolation; catching it lets the consumer skip cleanly. Doing this inside a transaction with the actual work guarantees the "processed" record is committed atomically with the side effect.
  5. The consumer_name column allows multiple consumer services to share the dedupe table without interfering. Consumer A processing event X does not stop Consumer B from processing event X — each has its own row.

Output.

Concern Mechanism Result
Outbox unbounded growth Nightly TTL / partition drop bounded at 7 days
Duplicate delivery Consumer-side dedupe table at-most-once effect
Consumer replay Reset Kafka offset dedupe skips already-processed
Multiple consumers Per-consumer row independent progress

Rule of thumb. Every outbox deployment needs a TTL job on the producer side and an idempotency contract on the consumer side. Skipping either creates a slow-motion failure — the outbox table grows to TB scale, or a partition rebalance double-counts events.

Senior interview question on outbox pattern

A senior interviewer might ask: "You have a microservice that writes to Postgres and needs to publish typed events to Kafka. The naive design uses two-phase commit; the previous team tried it and it deadlocked. Walk me through the outbox pattern — the schema, the atomic write, the Debezium EventRouter, the TTL, and the consumer-side idempotency contract. Then extend to a hybrid where log-based CDC also feeds the warehouse."

Solution Using an outbox table + Debezium EventRouter + hybrid log-based warehouse feed

-- 1. Outbox schema — the atomic-write buffer
CREATE TABLE public.outbox (
    id              UUID         PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type  TEXT         NOT NULL,
    aggregate_id    TEXT         NOT NULL,
    event_type      TEXT         NOT NULL,
    payload         JSONB        NOT NULL,
    created_at      TIMESTAMPTZ  NOT NULL DEFAULT clock_timestamp()
) PARTITION BY RANGE (created_at);

CREATE TABLE public.outbox_20260704 PARTITION OF public.outbox
  FOR VALUES FROM ('2026-07-04') TO ('2026-07-05');

CREATE INDEX idx_outbox_agg ON public.outbox (aggregate_type, aggregate_id);

-- 2. Publications — outbox + business tables (hybrid)
CREATE PUBLICATION outbox_pub    FOR TABLE public.outbox;
CREATE PUBLICATION warehouse_pub FOR TABLE public.orders, public.customers, public.shipments;
Enter fullscreen mode Exit fullscreen mode
# 3. Two Debezium connectors sharing the same Postgres — different scope, different transforms
name: outbox-cdc
config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector
  plugin.name: pgoutput
  slot.name: debezium_outbox
  publication.name: outbox_pub
  table.include.list: public.outbox
  snapshot.mode: never
  topic.prefix: prod
  transforms: outbox
  transforms.outbox.type: io.debezium.transforms.outbox.EventRouter
  transforms.outbox.route.by.field: aggregate_type
  transforms.outbox.route.topic.replacement: prod.${routedByValue}
  transforms.outbox.table.field.event.key: aggregate_id
  transforms.outbox.table.field.event.payload: payload
  transforms.outbox.table.expand.json.payload: true

---
name: warehouse-cdc
config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector
  plugin.name: pgoutput
  slot.name: debezium_warehouse
  publication.name: warehouse_pub
  table.include.list: public.orders,public.customers,public.shipments
  snapshot.mode: initial
  topic.prefix: warehouse
  # No EventRouter — raw Debezium change events go straight to Kafka
  tombstones.on.delete: true
Enter fullscreen mode Exit fullscreen mode
# 4. Application — single-transaction dual write
def place_order(conn, customer_id: int, total_cents: int) -> int:
    with conn:
        with conn.cursor() as cur:
            cur.execute("""
                INSERT INTO public.orders(customer_id, total_cents, status)
                VALUES (%s, %s, 'pending')
                RETURNING id
            """, (customer_id, total_cents))
            order_id = cur.fetchone()[0]

            cur.execute("""
                INSERT INTO public.outbox(aggregate_type, aggregate_id, event_type, payload)
                VALUES ('Order', %s, 'OrderPlaced',
                        jsonb_build_object('order_id', %s, 'customer_id', %s, 'total_cents', %s))
            """, (str(order_id), order_id, customer_id, total_cents))
    return order_id
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Layer Component Purpose
Application one transaction; two INSERTs atomic business + event write
Postgres outbox + orders/customers/shipments one source of truth
Debezium outbox-cdc tail outbox_pub; EventRouter typed events to prod.<Aggregate> topics
Debezium warehouse-cdc tail warehouse_pub; snapshot=initial raw mutations to warehouse.public.<table>
Consumers order-analytics-svc + Snowflake independent progress; idempotency contract
Retention daily partition drop on outbox bounded storage

After deployment, the application makes one transaction, and two Kafka streams appear downstream: typed domain events on prod.Order / prod.Customer / prod.Shipment for microservice consumers, and raw mutation events on warehouse.public.orders / etc. for the warehouse. The application never touches Kafka; the DBA sees two replication slots on one Postgres.

Output:

Downstream Topic Message shape Consumer
Order service prod.Order OrderPlaced, OrderShipped payloads order-analytics-svc
Customer service prod.Customer CustomerCreated, CustomerUpdated crm-svc
Warehouse warehouse.public.orders Debezium change events Snowflake sink
Warehouse warehouse.public.customers Debezium change events Snowflake sink

Why this works — concept by concept:

  • Single-transaction outbox write — the atomic guarantee. Business row and event row commit together via Postgres MVCC; no two-phase commit needed. Solves the dual-write problem at its root.
  • EventRouter SMT — the outbox-to-typed-topic bridge. route.by.field = aggregate_type gives one topic per aggregate; expand.json.payload unwraps the JSONB so consumers see typed events, not Debezium envelopes.
  • Hybrid = two connectors, two publications, two purposes — outbox connector for domain events; warehouse connector for raw mutations. Different snapshot semantics (never vs initial) match the data lifecycles.
  • Idempotent consumers — Kafka delivers at-least-once; consumers dedupe by (event_id, consumer_name). This is the contract every outbox consumer signs; skipping it turns duplicate delivery into a correctness bug.
  • Cost — one Postgres, two replication slots, two Debezium connectors, ~4-5% source CPU total, partitioned outbox with daily drop for retention. Compared to two-phase commit (deadlocks, blocking), this is the operationally sane way to serve both microservice and warehouse consumers from one source. O(1) per commit on the app side; the CDC readers handle the fan-out.

Streaming
Topic — streaming
Streaming outbox and event-driven problems

Practice →

Design
Topic — design
Design problems on transactional messaging

Practice →


Cheat sheet — CDC pattern recipes

  • Which pattern when. Log-based CDC is the 2026 default when your DBA grants wal_level=logical (Postgres), binlog_format=ROW (MySQL), or replica-set oplog access (Mongo). Timestamp CDC is the fallback when no replication is grantable but SELECT works — mandatory nightly reconcile for deletes. Trigger CDC when CREATE TRIGGER is allowed but replication isn't — accept the 2× write amplification. Outbox pattern on top of any primary CDC whenever the app must emit typed events atomically with a business write.
  • Timestamp watermark checkpoint template. Store watermark durably in the source Postgres itself (cdc_watermarks(table_name PK, last_updated_at, last_poll_at)), advance to max(observed updated_at) inside the same transaction that writes the delta, use a 5-minute safety horizon (updated_at <= now() - 5m) to absorb late-committing transactions, and enforce updated_at = clock_timestamp() in a BEFORE UPDATE trigger to eliminate app-side clock skew.
  • Postgres audit-trigger 6-line template. CREATE FUNCTION audit_X_trigger() ... IF TG_OP='INSERT' THEN INSERT ...(op, id, after=to_jsonb(NEW)); ELSIF TG_OP='UPDATE' THEN INSERT ...(op, id, before=to_jsonb(OLD), after=to_jsonb(NEW)); ELSIF TG_OP='DELETE' THEN INSERT ...(op, id, before=to_jsonb(OLD)); END IF; RETURN NULL; END; bound to AFTER INSERT OR UPDATE OR DELETE FOR EACH ROW. Shadow table uses BIGSERIAL audit_id as the reader watermark; partition by changed_at for O(1) retention drops.
  • Debezium Postgres connector JSON. plugin.name=pgoutput, snapshot.mode=initial, snapshot.isolation.mode=repeatable_read, heartbeat.interval.ms=10000, slot.name=<unique-per-connector>, publication.name=<explicit> (never FOR ALL TABLES in production), tombstones.on.delete=true for Kafka log-compaction, producer.override.enable.idempotence=true, and max_slot_wal_keep_size=<N>GB on the Postgres side as the disk-full defense.
  • Outbox table DDL + EventRouter SMT. outbox(id UUID PK default gen_random_uuid(), aggregate_type, aggregate_id, event_type, payload JSONB, created_at); write inside the same transaction as the business row; connect Debezium with transforms.outbox.type=io.debezium.transforms.outbox.EventRouter, route.by.field=aggregate_type, route.topic.replacement=prod.${routedByValue}, table.field.event.key=aggregate_id, table.field.event.payload=payload, table.expand.json.payload=true. Partition outbox by day; drop old partitions once Debezium has confirmed the LSN.
  • Pattern decision matrix. Permission: log-based needs replication; trigger needs CREATE TRIGGER; timestamp needs SELECT; outbox needs schema change. Latency: log-based sub-second, trigger seconds, timestamp minutes, outbox = reader-dependent. Delete visibility: log-based/trigger/outbox yes, timestamp no. Source load: log-based ~0 on tables, trigger 2× write amp, timestamp 1 indexed scan per poll, outbox +1 write per business txn. Print this matrix on a sticky note; use it in every interview.
  • Slot-lag monitoring query. SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes FROM pg_replication_slots WHERE slot_type='logical'; alert on lag_bytes > 5 GB for 10 min (below the max_slot_wal_keep_size ceiling) and on active = false for 15 min. Runbook: check connector logs, check Kafka consumer lag, if wal_status='lost' drop slot and resnapshot.
  • Consumer-side idempotency contract. Every consumer of a Kafka topic fed by CDC (log-based, outbox, or otherwise) stores (event_id, consumer_name) in a consumer_processed_events table with a composite PK. Try to insert; on UniqueViolation, skip. Wrap the dedupe insert and the actual side-effect in one transaction so "recorded processed" and "did the work" commit together. TTL the dedupe table at ~30 days.
  • Bootstrap / snapshot policy. Log-based: snapshot.mode=initial for greenfield tables; snapshot.mode=never for tables you've bootstrapped some other way (e.g. outbox tables which are empty at start). Timestamp: bootstrap by setting watermark to epoch and running one large-batch first poll. Trigger: application-managed — either add the trigger before the table is populated or accept that pre-trigger rows are invisible.
  • Failure semantics reminder. Log-based: Kafka down → slot holds WAL → capped by max_slot_wal_keep_size. Timestamp: watermark stale → next poll catches up; late rows caught by safety window. Trigger: shadow table grows; reader restart tolerated. Outbox: business rollback = event rollback; Kafka down = outbox grows; capped by TTL. Every pattern has a bounded failure surface; know yours.
  • Schema evolution rules. Log-based + Avro + schema registry = backward-compatible column adds work automatically. Trigger CDC + JSONB payload = column adds are automatic (JSONB just carries them). Timestamp CDC + explicit column list = column adds require code change. Outbox + JSONB payload = new event types are additive; consumers ignore unknown types.
  • When to use hybrid. Any deployment that needs both raw source-table mutations (warehouse, audit, dashboards) and typed domain events (microservices, event bus). Two Debezium connectors, two publications, two Kafka namespaces (warehouse.* vs prod.*) — one Postgres. Costs ~5% source CPU total; buys both consumer models.
  • Migration cost between patterns. Timestamp → log-based: ~2 engineer-weeks (config + snapshot + downstream MERGE rewrites). Trigger → log-based: ~1 engineer-week (drop triggers + shadow table; switch reader source). Adding outbox to existing app: ~1 sprint per aggregate type (application code changes). Choose the primary pattern once; the migration cost is real.

Frequently asked questions

What is CDC in one sentence?

Change data capture is the pattern of continuously observing an operational database's mutations — inserts, updates, and deletes — and forwarding each mutation to downstream systems (warehouse, search index, cache, event bus) so those systems stay in near-real-time sync with the source without ever re-reading the full source table. The four canonical cdc patterns — timestamp (poll updated_at), trigger (audit shadow table), log-based (tail the WAL/binlog/oplog), and outbox (single-transaction event log) — differ in permission requirements, latency, delete handling, and source-side load, and the choice binds every downstream consumer for years. Every senior data-engineering interview probes CDC because it's the load-bearing pipeline pattern for the modern warehouse + microservice stack.

Timestamp CDC vs log-based CDC — when do I pick each?

Default to log-based CDC when your DBA grants wal_level=logical (Postgres), binlog_format=ROW (MySQL), or replica-set oplog access (Mongo). Log-based captures every DML including deletes, adds essentially zero load on the source tables (the WAL is written for durability regardless), and ships sub-second latency via Debezium, AWS DMS, or a native connector. Pick timestamp CDC (poll WHERE updated_at > watermark) as the fallback when logical replication is not grantable — some managed Postgres tiers, ancient MySQL versions, air-gapped Oracle deployments. Timestamp is cheaper to build and operates against any DB with an updated_at column, but it's blind to physical deletes and requires a nightly full-table reconcile job to catch missing rows. Never pick timestamp for a workload that needs sub-minute freshness — the poll interval is a hard floor on latency.

What is the outbox pattern and what problem does it solve?

The outbox pattern solves the dual-write problem — the correctness bug where an application writes a business row to a database and publishes an event to Kafka (or any other bus) as two separate operations. Either the database commit succeeds and Kafka publish fails (event lost) or Kafka publish succeeds and DB commit fails (event describes a non-existent row); both violate consistency. The outbox pattern collapses both writes into a single database transaction: the application writes the business row and a row to an outbox table in the same BEGIN/COMMIT, and a downstream CDC reader (typically Debezium with the EventRouter SMT) tails the outbox table and publishes to Kafka. Both writes commit atomically or neither does; the CDC reader retries on Kafka outage without ever losing an event. This pattern is the load-bearing correctness invariant of most modern microservice + event-bus architectures, and it's the single most-cited answer in senior CDC interviews.

Can I do CDC without logical replication?

Yes — timestamp cdc and trigger cdc are the two answers when logical replication is off-limits. Timestamp CDC needs only SELECT on the source table plus an indexed updated_at column; it polls for updated_at > last_watermark, ships the delta, and advances the watermark. It's blind to physical deletes, so you either adopt a soft-delete convention (UPDATE ... SET deleted_at = now() instead of DELETE) or ship a nightly full-table reconcile that catches missing rows. Trigger CDC needs CREATE TRIGGER on the source tables: an AFTER INSERT OR UPDATE OR DELETE trigger writes one row per DML to a shadow / audit table, and a reader tails that shadow table by monotonic BIGSERIAL audit_id. Trigger CDC captures every DML including deletes but pays for it with 2× write amplification (every source-table DML now writes two rows), which is meaningful on write-heavy tables. Pick timestamp for the lightest-touch, permission-frugal deployments; pick trigger when the delete-handling requirement forces your hand and log-based is genuinely off the table.

Which CDC pattern handles deletes correctly?

Log-based CDC (Debezium, AWS DMS, native pglogical) captures physical DELETEs natively — the WAL/binlog/oplog records every DML including deletes, and the reader emits a "tombstone" event carrying the pre-image of the deleted row. Trigger CDC captures deletes via a DELETE trigger that writes an op='D' row into the shadow table with the OLD row image. Outbox CDC doesn't capture source-table deletes directly — it captures whatever events the application chooses to write to the outbox, so if the app runs DELETE FROM orders without also inserting a Deleted outbox row, downstream never learns. Timestamp CDC is blind to physical deletes — a DELETE FROM orders WHERE id = 42 leaves no updated_at for the next poll to query, so the row silently vanishes downstream. The timestamp workaround is either soft-delete (UPDATE ... SET deleted_at = now()) or a nightly full-table reconcile that flags rows present in the warehouse but missing in the source. Getting the delete axis wrong ships a subtle correctness bug that only surfaces during audit season.

Is trigger-based CDC still used in 2026?

Yes, though it's the smallest of the four patterns by deployment count. Trigger CDC ships in three scenarios in 2026: (a) source databases where the DBA won't grant logical replication or the managed tier doesn't offer it (some SOC-compliance environments, older RDS tiers, air-gapped Oracle), (b) regulatory-audit workloads where the shadow / audit table doubles as the compliance record with changed_by TEXT DEFAULT current_user and txid BIGINT DEFAULT txid_current() provenance, and (c) legacy pipelines that predate Debezium and haven't been migrated. New greenfield deployments almost always start with log-based CDC because the operational story is simpler and the source-side write amplification is zero. That said, senior interviewers still probe trigger CDC because knowing why you'd pick it (permission constraints + delete handling + audit trail) is a proxy for understanding all four axes — permission, latency, delete visibility, source load — well enough to defend any pattern choice.

Practice on PipeCode

  • Drill the SQL practice library → for the CDC, watermark, incremental-load, and shadow-table problems senior interviewers love.
  • Rehearse on the ETL practice library → for the batch CDC pipelines, timestamp-CDC watermark advance, and snapshot-bootstrap patterns.
  • Sharpen the streaming axis with the streaming practice library → for Debezium, log-based CDC, and outbox event-routing scenarios.
  • Stack the prerequisites against PipeCode's broader 450+ data-engineering catalogue to anchor the four-pattern decision matrix against real graded inputs.

Lock in CDC pattern muscle memory

Docs explain patterns. PipeCode drills explain the decision — when timestamp CDC is blind to deletes, when trigger CDC's write amplification bites, when log-based CDC's slot lag becomes a disk-full incident, when the outbox pattern earns its place. Pipecode.ai is Leetcode for Data Engineering — pattern-first practice tuned for the production trade-offs senior data engineers actually face.

Practice SQL problems →
Practice streaming problems →

Top comments (0)