cdc patterns are the pick-one architectural decision that decides whether your downstream analytics warehouse is minutes behind reality or entire days behind — and it is the single component senior data engineers get wrong most often because "just use Debezium" is not always an option. Every operational database mutation your business writes — a customer address change, an order cancellation, a soft-deleted row — has to reach the warehouse, the search index, the feature store, and the audit trail without losing deletes, without re-reading the entire source table every night, and without saturating the source database's IOPS budget in the process. The engineering trade-off does not live in "should we do CDC" — every stack with more than one downstream system needs it — but in which change data capture pattern you pick and what it costs the source database.
This guide is the senior-DE walkthrough you wished existed the first time an interviewer asked "walk me through the four cdc patterns and their delete-handling story," or "your DBA won't grant logical replication — what's the fallback?", or "explain the outbox pattern and why it solves the dual-write problem." It walks through the four canonical patterns — timestamp cdc (polling updated_at > watermark), trigger cdc (shadow / audit table), log-based cdc (WAL / binlog / oplog readers), and the transactional outbox pattern (single-transaction fan-out) — the "four axes" interviewers actually probe (source-DB permission, latency, delete visibility, source-side load), the canonical config for each, and the hybrid pattern where log-based cdc reads an outbox table for aggregate events. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the SQL practice library →, rehearse on the ETL practice library →, and sharpen the streaming axis with the streaming practice library →.
On this page
- Why the CDC pattern choice determines everything downstream
- Timestamp / high-watermark CDC
- Trigger-based CDC
- Log-based CDC — WAL / binlog / oplog
- Outbox pattern + hybrid
- Cheat sheet — CDC pattern recipes
- Frequently asked questions
- Practice on PipeCode
1. Why the CDC pattern choice determines everything downstream
Four patterns, four wildly different cost curves — the choice binds you for years
The one-sentence invariant: change data capture is a picking exercise between polling the source table by updated_at, tailing a shadow audit table populated by triggers, tailing the database's write-ahead log via native replication, or reading a transactionally-consistent outbox table your application writes into — and each pattern trades source-DB permission requirements against latency, delete visibility, and source load in a way that cannot be undone downstream. The pattern you pick in month one becomes the pattern you fight to migrate away from in year three, because every downstream consumer, dashboard, and reconciliation job hard-codes assumptions about the shape of the CDC stream — whether deletes appear as tombstones or silently vanish, whether events are ordered by log-sequence-number or by wall clock, whether replay is a slot-rewind operation or a re-poll from a watermark.
The four axes interviewers actually probe.
-
Source-DB permission. Log-based CDC needs
wal_level=logical(Postgres),log_bin=ON+binlog_format=ROW(MySQL), or replica-set oplog access (Mongo). If your DBA won't (or your managed DB can't) grant those, log-based is off the table and you fall back to timestamp or trigger. Interviewers open with this question because it separates people who understand the operational reality from those who've only read the Debezium docs. - Latency. Timestamp CDC is bounded by the poll interval — minutes or hours. Trigger CDC adds a bounded per-DML write and a small tail delay — seconds. Log-based CDC is sub-second in steady state. Outbox depends on the reader; sub-second when paired with log-based, minutes when paired with polling.
-
Delete handling. Timestamp CDC is blind to physical deletes — a
DELETE FROM orders WHERE id = 42leaves noupdated_atto poll against, and the row silently vanishes downstream. Triggers, log-based, and outbox all capture deletes. Getting this axis wrong ships a subtle correctness bug that only surfaces during audit season. -
Source load. Timestamp CDC adds one indexed scan per poll — cheap on
updated_at-indexed tables. Trigger CDC doubles the write cost on every INSERT/UPDATE/DELETE — a hidden 2× write-amplification tax. Log-based CDC adds zero load on the source tables (the WAL is written anyway) but a small CPU cost on the replication reader. Outbox adds one extra write per transaction — measurable but bounded.
The 2026 reality — log-based dominates but the other three are alive and well.
-
Log-based is the default for any greenfield Postgres, MySQL, or MongoDB deployment where the DBA cooperates. Debezium, AWS DMS, and native connectors (
pglogical,wal2json, MongoDB change streams) are production-hardened. If you can pick, you pick log-based. - Timestamp / high-watermark ships in every legacy warehouse ETL pipeline: Airflow → source SELECT → S3 → Snowflake. Cheap, portable, works against any relational database — SQL Server, Oracle, DB2, MariaDB, SQLite. Blind to deletes; requires application-level soft-delete or a nightly full-table reconciliation to catch missing rows.
- Trigger-based is the fallback when logical replication is blocked — some managed Postgres tiers, ancient MySQL versions, or air-gapped installations. The trigger + shadow-table pattern is decades old and still occasionally the only option.
- Outbox is the pattern that solves the dual-write problem — where the application must write a business row and publish an event, and any failure mode ("wrote the row, crashed before publishing" or vice versa) leaves the two out of sync. The outbox writes both inside one transaction; a downstream reader (usually log-based) publishes the events. This is the pattern senior interviewers probe for microservice teams.
What interviewers listen for.
- Do you name all four patterns without prompting? — senior signal.
- Do you say "timestamp CDC is blind to deletes" in the first sentence when timestamp comes up? — required answer.
- Do you push back on "just use Debezium" with the permission question — "does the DBA grant logical replication?" — senior signal.
- Do you name the outbox pattern as the answer to the dual-write problem, not as "an alternative to Debezium"? — senior signal.
- Do you describe CDC as "a stream of ordered mutation events" rather than as vague "syncing data"? — required answer.
Worked example — the four-axis comparison table
Detailed explanation. The single most useful artifact for a CDC interview is a memorised 4×4 comparison table. Every senior CDC discussion converges on this table within the first ten minutes; having it in your head is what separates a fluent answer from a stumbling one. Walk through building the table for a hypothetical orders table that needs to reach a Snowflake warehouse.
-
Source table.
public.orders (id, customer_id, total_cents, status, created_at, updated_at)on Postgres 16. -
Downstream. Snowflake
RAW.orders— full history including deletes. - Target latency. From "the DBA needs it in 24h" (batch) to "sub-second" (streaming).
- Downstream consumers. Warehouse (batch OK), search index (needs seconds), audit log (needs every DML including delete).
Question. Build the four-pattern comparison for the orders table and pick the pattern each downstream consumer should subscribe to.
Input.
| Pattern | Permission needed | Latency | Delete visible? | Source load |
|---|---|---|---|---|
| Timestamp | none (SELECT on table) | minutes–hours | no (unless soft-delete) | 1 indexed scan per poll |
| Trigger | CREATE TRIGGER | seconds | yes | +1 write per DML (2×) |
| Log-based | wal_level=logical + REPLICATION | sub-second | yes | ~0 on source tables |
| Outbox | app-level (write to outbox in txn) | reader-dependent | yes | +1 write per business txn |
Code.
-- Postgres source table (all four patterns will target this)
CREATE TABLE public.orders (
id BIGSERIAL PRIMARY KEY,
customer_id BIGINT NOT NULL,
total_cents BIGINT NOT NULL,
status TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Index on updated_at is mandatory for timestamp CDC
CREATE INDEX idx_orders_updated_at ON public.orders (updated_at);
-- Trigger to keep updated_at fresh (needed for timestamp CDC)
CREATE OR REPLACE FUNCTION public.bump_updated_at() RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = now();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_orders_bump_updated_at
BEFORE UPDATE ON public.orders
FOR EACH ROW EXECUTE FUNCTION public.bump_updated_at();
Step-by-step explanation.
- The
orderstable needs anupdated_atcolumn and an index on it before timestamp CDC becomes viable. Without the index every poll scans the whole table — a per-poll cost that goes non-linear at multi-million-row scale. - The
bump_updated_attrigger is the baseline even in a non-CDC world — it guaranteesupdated_atreflects the most recent mutation. Without this trigger, an application that forgets to writeupdated_at = now()loses rows from every timestamp-CDC poll. - For log-based CDC, no schema change is needed on
ordersitself — the WAL captures every DML regardless. For trigger CDC we'll add a shadow table + triggers in section 3. For outbox we'll add a separateoutboxtable in section 5. - The choice of pattern is downstream-consumer-driven. Warehouse can tolerate minutes → timestamp CDC works. Search index needs seconds → trigger or log-based. Audit log needs every DML including delete → trigger, log-based, or outbox.
- In practice, most senior architectures pick log-based as the primary CDC channel, then layer outbox on top for events that don't correspond 1:1 with a source-table row (e.g.
OrderShippedwhen shipment is derived fromorders.statustransition).
Output.
| Consumer | Recommended pattern | Why |
|---|---|---|
| Snowflake warehouse (batch) | timestamp CDC | Cheap; poll every 15 min; add nightly full-reconcile for deletes |
| Search index | log-based CDC | Sub-second latency; deletes handled |
| Audit log | log-based or trigger | Every DML captured; log-based preferred if allowed |
| Application event bus | outbox pattern | Solves dual-write; consumers get typed events |
Rule of thumb. Never pick a CDC pattern based on "which one is trendy." Pick it based on (permission × latency × delete-visibility × source-load) — the four axes. Write the table on a whiteboard first; the pattern falls out of the constraints.
Worked example — what interviewers actually probe
Detailed explanation. The senior data-engineering CDC interview has a predictable structure: the interviewer opens with an ambiguous question ("how would you get data from our Postgres OLTP into the warehouse?"), then progressively narrows to test whether you know the axes. The candidates who name the pattern in sentence one score highest; the candidates who describe "an ETL pipeline" score lowest. Walk through the interview grading rubric.
- Ambiguous opener. "How would you sync our production Postgres to the warehouse?" — invites you to name a pattern.
- Follow-up 1. "What if the DBA won't grant logical replication?" — probes permission axis.
- Follow-up 2. "How does your pattern handle deletes?" — probes delete axis.
- Follow-up 3. "What's the load on Postgres?" — probes source-load axis.
- Follow-up 4. "How would you replay the last week?" — probes replay semantics.
Question. Draft a 5-minute senior CDC answer that covers all four axes without waiting to be asked.
Input.
| Interview signal | Weak answer | Senior answer |
|---|---|---|
| Pattern named | "we'd use Airflow" | "log-based CDC via Debezium if logical replication is allowed" |
| Permission | "not sure" | "requires wal_level=logical + REPLICATION role" |
| Delete handling | "we could add a soft-delete column" | "log-based captures delete tombstones natively" |
| Source load | "Postgres can handle it" | "zero source-table load; small CPU on the replication reader" |
| Replay | "restart Airflow" | "reset the replication slot LSN, or bootstrap via pg_dump snapshot" |
Code.
Senior CDC answer template (5 minutes)
======================================
Minute 1 — name the pattern up front
"I'd default to log-based CDC via Debezium, publishing to Kafka."
Minute 2 — permission
"This requires wal_level=logical on the Postgres primary, a REPLICATION
role for the connector, and a replication slot per connector. If the
DBA won't grant that (managed tier limitation, corporate policy),
the fallback is timestamp CDC (poll updated_at) with a nightly
full-reconcile job to catch deletes."
Minute 3 — delete handling
"Log-based captures INSERT/UPDATE/DELETE natively — deletes appear as
tombstone events with the old row image. Timestamp CDC is blind to
physical deletes, so we'd either soft-delete or reconcile."
Minute 4 — source load
"Log-based adds ~zero load on source tables — the WAL is written
anyway. The reader consumes ~2-3% CPU on the DB host. Trigger CDC
doubles the write cost on every DML. Timestamp CDC is one indexed
scan per poll interval."
Minute 5 — replay + failure semantics
"Debezium tracks progress via LSN in the replication slot. Replay is
'reset the slot to a prior LSN and re-emit.' The initial snapshot
uses SNAPSHOT ISOLATION so it's a consistent point-in-time cut. On
failure the slot holds WAL retention until the reader catches up —
we monitor slot lag with pg_replication_slots."
Step-by-step explanation.
- Minute 1 is the crucial framing. Naming the pattern immediately — "log-based CDC via Debezium" — signals you're a decision-maker, not a task-runner. Weak candidates dive into tools ("we'd use Kafka Connect and…") before naming the pattern.
- Minute 2 addresses the permission axis before the interviewer asks. This preempts the common trap where you commit to log-based, then admit you don't know how to fall back. Naming the fallback up front is senior signal.
- Minute 3 is the delete-handling probe. Every CDC pattern except timestamp captures deletes; naming the log-based tombstone event and the timestamp reconcile workaround shows you've built both.
- Minute 4 is the source-load argument. "Zero load on source tables" is the log-based sales pitch; "one indexed scan per poll" is the timestamp defense; "doubles write cost" is the trigger warning. Say all three unprompted.
- Minute 5 covers replay and failure — the reliability axis. The LSN-based replay story for log-based is qualitatively different from the watermark-file replay for timestamp; showing you know both is a senior signal.
Output.
| Grading criterion | Weak score | Senior score |
|---|---|---|
| Names pattern in minute 1 | rare | mandatory |
| Names permission fallback | rare | required |
| Names delete handling | occasional | mandatory |
| Names source-load numbers | rare | senior signal |
| Names replay semantics | rare | senior signal |
Rule of thumb. The senior CDC answer is a 5-minute monologue that covers all four axes without waiting for the follow-ups. Rehearse it once; deploy it every time.
Worked example — the "pick the pattern" decision tree
Detailed explanation. Given a new source system, the senior architect runs a 4-question decision tree in their head. Codifying the tree makes the interview answer reproducible: any interviewer can hand you a scenario and you can walk the tree out loud. Walk through the tree with three canonical scenarios: managed RDS Postgres with full permissions, a locked-down Oracle 11g, and a microservice team that needs event routing.
- Q1. Do you have logical-replication / binlog / oplog access? → yes = go to Q2; no = go to Q3.
- Q2. Is sub-second latency required or acceptable? → yes = log-based; no = still log-based (it's the default when allowed).
- Q3. Is CREATE TRIGGER allowed? → yes = trigger CDC; no = go to Q4.
-
Q4. Is
updated_atmaintained on every DML? → yes = timestamp CDC; no = must add a soft-delete +updated_atcolumn first. - Q5 (parallel branch). Does the app write a business row and need to emit an event atomically? → yes = outbox (on top of whatever primary CDC you picked).
Question. Walk the decision tree for the three scenarios and record the pattern each ends up with.
Input.
| Scenario | Q1 (log access?) | Q3 (trigger?) | Q4 (updated_at?) | Q5 (dual-write?) |
|---|---|---|---|---|
| Managed RDS Postgres | yes | — | — | maybe |
| Oracle 11g locked down | no | no | yes | no |
| Microservice + Kafka | yes | — | — | yes |
Code.
# Decision-tree helper (illustrative)
def pick_cdc_pattern(has_log_access: bool,
allows_triggers: bool,
has_updated_at: bool,
needs_dual_write: bool) -> list[str]:
"""Return the primary CDC pattern(s) for a source system."""
patterns = []
if has_log_access:
patterns.append("log-based")
elif allows_triggers:
patterns.append("trigger")
elif has_updated_at:
patterns.append("timestamp + nightly reconcile")
else:
raise ValueError("no viable CDC pattern; add updated_at first")
if needs_dual_write:
patterns.append("outbox (on top)")
return patterns
# Walk the three scenarios
print(pick_cdc_pattern(True, False, True, False))
# → ['log-based']
print(pick_cdc_pattern(False, False, True, False))
# → ['timestamp + nightly reconcile']
print(pick_cdc_pattern(True, False, True, True))
# → ['log-based', 'outbox (on top)']
Step-by-step explanation.
- Scenario 1 — Managed RDS Postgres with
wal_level=logicalenabled and REPLICATION role granted. The decision tree short-circuits at Q1 → log-based. This is the modern default. - Scenario 2 — Oracle 11g with no LogMiner access, no CREATE TRIGGER grant (locked-down DBA), but the
updated_atcolumn exists on every table. Q1 = no, Q3 = no, Q4 = yes → timestamp CDC with a nightly full-reconcile job to catch deletes. This is the legacy default. - Scenario 3 — a microservice team with full Postgres access that needs to emit typed domain events (
OrderPlaced,OrderShipped) to Kafka. Q1 = yes → log-based; Q5 = yes → also outbox for the domain events. The hybrid pattern uses log-based CDC as the reader tailing the outbox table. - The parallel Q5 branch (outbox) is orthogonal to the primary pattern. You can pair outbox with any primary CDC pattern; log-based + outbox is by far the most common hybrid in modern microservice architectures.
- If none of Q1-Q4 pass, the answer is "you must first add
updated_atand a soft-delete convention to the source schema." Refuse to pick a CDC pattern until the source schema supports one.
Output.
| Scenario | Primary pattern | Add-on |
|---|---|---|
| Managed RDS Postgres | log-based | — |
| Oracle 11g locked | timestamp + reconcile | — |
| Microservice + Kafka | log-based | outbox |
Rule of thumb. The four-question decision tree is a whiteboard-friendly answer. Practice walking it end-to-end so an interviewer can hand you any scenario and get a pattern name in under 60 seconds.
Senior interview question on CDC pattern selection
A senior interviewer often opens with: "You inherit a Postgres 12 → Snowflake pipeline that currently full-refreshes the orders table nightly. The business wants sub-second freshness and physical deletes to appear downstream. Walk me through the CDC pattern you'd migrate to, the source-DB changes you'd request, and the failure modes you'd guard against."
Solution Using log-based CDC with a bootstrap snapshot and slot monitoring
-- Step 1 — request DBA changes on Postgres primary
-- postgresql.conf
-- wal_level = logical
-- max_replication_slots = 10
-- max_wal_senders = 10
-- (requires restart)
-- Step 2 — create replication user + publication for the orders table
CREATE ROLE cdc_reader WITH LOGIN REPLICATION PASSWORD 'strong-secret';
GRANT SELECT ON public.orders TO cdc_reader;
CREATE PUBLICATION orders_pub FOR TABLE public.orders;
-- Step 3 — create replication slot (Debezium will do this on startup,
-- but you can pre-create for control)
SELECT pg_create_logical_replication_slot('debezium_orders', 'pgoutput');
# Step 4 — Debezium Postgres connector config
name: orders-cdc
config:
connector.class: io.debezium.connector.postgresql.PostgresConnector
plugin.name: pgoutput
database.hostname: db-primary.internal
database.port: 5432
database.user: cdc_reader
database.password: ${env:CDC_PASSWORD}
database.dbname: production
slot.name: debezium_orders
publication.name: orders_pub
snapshot.mode: initial # bootstrap: pg_dump-style snapshot first,
# then tail the WAL
table.include.list: public.orders
topic.prefix: prod
heartbeat.interval.ms: 10000 # every 10s emit a heartbeat to advance
# the slot even during quiet periods
key.converter: org.apache.kafka.connect.json.JsonConverter
value.converter: org.apache.kafka.connect.json.JsonConverter
-- Step 5 — slot lag monitoring query (alert on lag > 100 MB)
SELECT slot_name,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes
FROM pg_replication_slots
WHERE slot_name = 'debezium_orders';
Step-by-step trace.
| Step | Before (nightly full-refresh) | After (log-based CDC) |
|---|---|---|
| Latency | ~24 h | sub-second (steady state) |
| Delete visibility | no (rows silently vanish) | yes (tombstone events) |
| Source load | full-table scan nightly | ~0 on orders; +2% CPU on reader |
| Snowflake sync mechanism | truncate + reload | append + merge (dedupe by LSN) |
| Bootstrap | N/A | one-time snapshot.mode=initial
|
| Replay | re-run full-refresh | reset slot LSN, re-emit |
| Failure surface | job crashes → next night retry | slot backs up → WAL retention until reader recovers |
After the migration, orders mutations appear in Kafka within ~200 ms of commit, deletes ship as tombstone events (op: d with the pre-image), and Snowflake's downstream MERGE handles insert/update/delete idempotently by LSN. The nightly full-refresh disappears from the Airflow DAG.
Output:
| Metric | Before | After |
|---|---|---|
| Warehouse freshness | 24 h | < 1 s |
| Delete detection | none | native |
| Source-DB load | full scan × 1/day | continuous 2% CPU |
| WAL retention risk | none | slot must not lag >1 GB |
| Snowflake merge SQL | truncate + insert | MERGE ON id ordered by LSN |
Why this works — concept by concept:
- Log-based CDC — the WAL is Postgres's durable ordered event log. Every INSERT/UPDATE/DELETE is written to it before commit. A logical replication reader decodes those records into typed change events without ever touching the source table.
-
Publication + replication slot —
PUBLICATIONscopes which tables the reader sees;REPLICATION SLOTdurably tracks the reader's position (LSN). The slot guarantees no data loss even if the reader is offline for hours — Postgres retains WAL until the slot catches up. -
Snapshot bootstrap —
snapshot.mode=initialperforms a consistent point-in-time snapshot (viaSET TRANSACTION SNAPSHOT), emits every existing row, records the LSN at snapshot start, then tails the WAL from that LSN. This solves the "how do I initialise the warehouse?" question without an outage. -
Heartbeat interval — quiet tables produce no WAL, so the slot's
confirmed_flush_lsnnever advances, so WAL retention grows unboundedly. Heartbeats every 10 s force the slot forward. -
Cost — one replication slot (~free), 2-3% CPU on the reader box, mandatory monitoring of
pg_replication_slots.lag. The eliminated cost is the nightly full-table scan (O(rows) per night) and the developer time spent on soft-delete gymnastics. Net O(1) per commit versus O(N) per night.
SQL
Topic — sql
SQL CDC and change-stream problems
2. Timestamp / high-watermark CDC
SELECT WHERE updated_at > last_watermark is the CDC pattern every legacy warehouse still runs — cheap, portable, and blind to deletes
The mental model in one line: timestamp cdc is the pattern where a poller reads rows whose updated_at (or modified_at, or last_modified) exceeds the last-recorded high-watermark, writes the delta downstream, then advances the watermark to the maximum updated_at observed — it works against any relational database that has an updated_at column and an index, but it cannot see physical deletes and it cannot recover late-arriving updates without help. Every senior data engineer has built one; every senior data engineer has been burned by one; and it remains the correct answer for a large fraction of legacy ETL pipelines.
The four axes for timestamp CDC.
- Permission. SELECT on the source table. Nothing else. The most permission-frugal pattern, which is why it survives in air-gapped, locked-down, and legacy environments where the DBA will not grant CREATE TRIGGER or REPLICATION.
-
Latency. Bounded by the poll interval. 15 minutes is the common default; 1 minute is aggressive; sub-minute is either wasteful (each poll re-scans a large index range) or requires log-based CDC. The upstream data lands "at most
poll_intervalold." -
Delete handling. None. A
DELETE FROM orders WHERE id = 42leaves no trace queryable byWHERE updated_at > watermark. The row simply stops appearing in future polls. This is the pattern's fundamental limitation. -
Source load. One indexed range scan per poll — cheap if
updated_atis indexed, catastrophic if not. Every senior architect confirms the index exists before signing off on timestamp CDC.
The watermark checkpoint — the single durable piece of state.
-
What it is. A file (or DynamoDB row, or Postgres row) recording the maximum
updated_atvalue the poller has already processed. On the next poll, the query isWHERE updated_at > watermark. -
Where to store it. Airflow XCom, a JSON file on S3, a small Postgres table (
cdc_watermarks(table_name, last_ts)), Redis. All work; pick durability over convenience. - Idempotency. The poll query is idempotent — re-running it with the same watermark yields the same delta. The commit-then-update-watermark step is not idempotent; the safe pattern is write-delta → advance-watermark-atomically using a transaction.
-
Bootstrap. First run sets watermark to
1970-01-01(orNULL) and captures the entire table. Subsequent runs are incremental.
The three failure modes senior engineers pre-empt.
-
Clock skew. If two application servers write to Postgres with slightly different clocks, an
updated_at = 12:00:00.5row can commit after anupdated_at = 12:00:01row — but the poll query orders byupdated_at, so the earlier-clocked row is missed if the watermark has already advanced past 12:00:00.5. Mitigation: useupdated_at = clock_timestamp()inside aBEFORE UPDATEtrigger (Postgres-side clock, not app-side); or use a monotonic counter (update_seq BIGSERIAL) instead of a timestamp. -
Late-arriving rows. A long-running transaction that committed at 12:00:00 but inserted rows with
updated_at = 11:59:00will be missed if the poller has already advanced past 12:00:00. Mitigation: queryWHERE updated_at > watermark AND updated_at <= watermark_horizonwherewatermark_horizon = now() - INTERVAL '5 minutes'— a "safety window" that excludes recent commits from being processed until they've had time to settle. -
Physical deletes. The fundamental blind spot. Mitigation options: (a) soft-delete convention — never DELETE, always UPDATE with
deleted_at = now()and treat deleted rows as normal updates; (b) periodic full-table reconciliation — nightly, dump the full row set, diff against the warehouse, catch missing rows; (c) accept the inconsistency for tables where deletes are rare.
Common interview probes on timestamp CDC.
- "How does timestamp CDC handle deletes?" — required answer is "it doesn't; use soft-delete or reconcile."
- "How do you handle late-arriving rows?" — safety-window pattern (
updated_at <= now() - 5m). - "Where do you store the watermark?" — durably, transactionally-committed alongside the delta write.
- "When would you pick timestamp over log-based?" — permission constraints, small-scale, batch-tolerant warehouse pipelines.
Worked example — incremental Postgres → S3 by updated_at
Detailed explanation. The canonical timestamp CDC pipeline: an Airflow DAG polls Postgres every 15 minutes, reads rows where updated_at > watermark, writes them as JSON to S3, then advances the watermark. It handles the "no deletes" limitation by adding a nightly full-reconcile that catches any rows that disappeared. Walk through the DAG.
- Frequency. Every 15 minutes for incremental; nightly for full reconcile.
- Watermark storage. Airflow XCom (persistent metadata DB).
- Safety window. 5-minute lookback to catch late commits.
- Deletes. Nightly full-reconcile diffs S3 against source; flags disappeared rows.
Question. Implement the Airflow task that runs one incremental poll and advances the watermark atomically.
Input.
| Parameter | Value |
|---|---|
| Source | Postgres public.orders |
| Watermark column | updated_at |
| Watermark storage | Airflow XCom |
| Poll interval | 15 min |
| Safety window | 5 min |
Code.
# Airflow task — one incremental CDC poll for public.orders
import json
import psycopg2
from datetime import datetime, timedelta, timezone
from airflow.decorators import task
WATERMARK_XCOM_KEY = "cdc_orders_watermark"
@task
def poll_orders_incremental(ti, **_):
"""Read orders modified since last watermark; write to S3; advance watermark."""
# 1. Read the previous watermark (bootstrap to epoch on first run)
prev = ti.xcom_pull(key=WATERMARK_XCOM_KEY, include_prior_dates=True)
prev_ts = datetime.fromisoformat(prev) if prev else datetime(1970, 1, 1, tzinfo=timezone.utc)
# 2. Safety horizon — exclude commits from the last 5 minutes
horizon = datetime.now(timezone.utc) - timedelta(minutes=5)
# 3. Query the delta
conn = psycopg2.connect("host=db-primary port=5432 dbname=production user=cdc_reader")
cur = conn.cursor()
cur.execute("""
SELECT id, customer_id, total_cents, status, created_at, updated_at
FROM public.orders
WHERE updated_at > %s
AND updated_at <= %s
ORDER BY updated_at ASC
""", (prev_ts, horizon))
rows = cur.fetchall()
conn.close()
if not rows:
return prev_ts.isoformat() # nothing changed; do not advance
# 4. Write JSON to S3 (one file per poll; timestamped key)
key = f"cdc/orders/{horizon:%Y/%m/%d/%H%M}.jsonl"
payload = "\n".join(json.dumps({
"id": r[0], "customer_id": r[1], "total_cents": r[2],
"status": r[3], "created_at": r[4].isoformat(), "updated_at": r[5].isoformat()
}) for r in rows)
# (s3.put_object elided for brevity)
# 5. Advance watermark to the MAX(updated_at) actually observed
# NOT to `horizon`, because the source might have gaps
new_ts = max(r[5] for r in rows)
ti.xcom_push(key=WATERMARK_XCOM_KEY, value=new_ts.isoformat())
print(f"CDC poll wrote {len(rows)} rows; watermark {prev_ts} -> {new_ts}")
return new_ts.isoformat()
Step-by-step explanation.
- Step 1 loads the previous watermark from XCom. On first run (bootstrap), no XCom exists, so we default to epoch — the poll captures the entire history of the table. This is the "initial snapshot" for timestamp CDC; unlike log-based, there's no separate snapshot mechanism.
- Step 2 computes the safety horizon:
now() - 5 minutes. The poll query excludes rows withupdated_at > horizon— this pushes recently-committed rows to the next poll, giving them 5 minutes to settle. This defends against clock-skew and late-committed transactions. - Step 3 executes the delta query. The
ORDER BY updated_at ASCis not strictly required for correctness but makes the downstream file naturally sorted, which simplifies dedupe in Snowflake. - Step 4 writes the delta to S3 using a timestamp-embedded key. If the write fails, the task raises and Airflow retries; the watermark is not advanced, so the next attempt processes the same delta again. This is the idempotency guarantee.
- Step 5 advances the watermark to
max(updated_at)from the actual result set — not to the horizon time. This is subtle: if the last row is at12:00:05but the horizon is12:00:10, advancing to12:00:10risks skipping a row that commits at12:00:07with anupdated_atfrom the transaction's start time. Advancing only to observed values keeps the watermark honest.
Output.
| Poll | prev watermark | max updated_at seen | new watermark | Rows written |
|---|---|---|---|---|
| 1 (bootstrap) | 1970-01-01 | 2026-07-04 09:15:22 | 2026-07-04 09:15:22 | 12,345 |
| 2 | 2026-07-04 09:15:22 | 2026-07-04 09:29:47 | 2026-07-04 09:29:47 | 87 |
| 3 | 2026-07-04 09:29:47 | (no new rows) | 2026-07-04 09:29:47 | 0 |
| 4 | 2026-07-04 09:29:47 | 2026-07-04 09:44:11 | 2026-07-04 09:44:11 | 65 |
Rule of thumb. Store the watermark durably (Airflow XCom, DynamoDB, or a Postgres row), advance it only after the delta write is committed, and advance to max(observed) — never to a wall-clock horizon. The five-minute safety window is non-negotiable in a multi-writer world.
Worked example — the clock-skew failure mode
Detailed explanation. A distributed application writes to Postgres from 20 app pods, each setting updated_at = time.time() (server-side wall clock) in the UPDATE statement. Two of the pods have clocks that drift by ~2 seconds; the rest are NTP-synced. The timestamp CDC poll misses roughly 0.1% of rows whose updated_at from the skewed pods falls behind the watermark of an already-processed row from the synced pods. Walk through the diagnosis and the fix.
- Symptom. Downstream Snowflake row count is consistently 0.1% below Postgres row count.
-
Root cause. App-side
updated_atuses each pod's wall clock; skew across pods means the strict ordering the watermark assumes doesn't hold. -
Fix. Move
updated_atmaintenance to aBEFORE UPDATEtrigger usingclock_timestamp()— Postgres-side, single-clock, monotonic per transaction.
Question. Design the schema change that removes the clock-skew failure mode, and quantify the residual risk.
Input.
| Component | Before | After |
|---|---|---|
| updated_at source | app-side time.time()
|
Postgres clock_timestamp() in BEFORE UPDATE trigger |
| Clock authority | 20 pod clocks (drift ~2s) | 1 Postgres clock |
| Residual skew | up to 2 s | ~0 (single clock) |
| Late-arriving row rate | 0.1% | ~0.001% (long-txn edge case only) |
Code.
-- Remove the app-side updated_at write; enforce it in Postgres
CREATE OR REPLACE FUNCTION public.bump_updated_at() RETURNS TRIGGER AS $$
BEGIN
-- clock_timestamp() reads the current wall clock inside the transaction
-- (as opposed to now()/CURRENT_TIMESTAMP which is fixed at BEGIN time)
NEW.updated_at = clock_timestamp();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS trg_orders_bump_updated_at ON public.orders;
CREATE TRIGGER trg_orders_bump_updated_at
BEFORE INSERT OR UPDATE ON public.orders
FOR EACH ROW EXECUTE FUNCTION public.bump_updated_at();
-- Optional: revoke direct write on updated_at from app roles
REVOKE UPDATE (updated_at) ON public.orders FROM app_writer;
# Application code — no longer writes updated_at at all
def update_order_status(conn, order_id, new_status):
with conn.cursor() as cur:
cur.execute("""
UPDATE public.orders
SET status = %s
WHERE id = %s
""", (new_status, order_id))
# updated_at is bumped automatically by the trigger
conn.commit()
Step-by-step explanation.
- The root cause is that
updated_atwas being written by 20 different clocks (one per pod), so two rows committed 100 ms apart could haveupdated_atvalues ordered opposite to their commit order. The watermark, which assumes monotonic-per-commit ordering, silently drops the earlier-timestamped-but-later-committed row. - Moving
updated_atmaintenance into aBEFORE UPDATEtrigger usingclock_timestamp()centralises it on the Postgres clock — one clock, no skew.clock_timestamp()(notnow()) is the correct function:now()is fixed at transaction start, so multi-statement transactions would all get the sameupdated_at. - Revoking UPDATE on the
updated_atcolumn from the application role is a belt-and-braces defense: even if a developer addsSET updated_at = ...in code, Postgres rejects it. Enforcement at the schema level is far more reliable than enforcement in code review. - The residual risk is a long-running transaction that started at 12:00:00 but commits at 12:00:10: its
clock_timestamp()reads at the moment of the trigger fire, which is inside the transaction but before commit. If the poll runs at 12:00:05 with horizon = 12:00:00, it might miss the row. The 5-minute safety window absorbs this in practice. - After the fix, downstream row-count drift drops to well under 0.01% and can be attributed entirely to the "long-transaction inside safety window" edge case, which the periodic reconcile catches.
Output.
| Metric | Before | After |
|---|---|---|
| Distinct clock sources | 20 pod clocks | 1 Postgres clock |
| updated_at write authority | app | trigger |
| Row-count drift | 0.1% | < 0.01% |
| Reconcile frequency needed | 4x/day | 1x/day |
| Interview signal | "app-side timestamp" (weak) | "clock_timestamp() in trigger" (senior) |
Rule of thumb. Never let application code write updated_at. Enforce it in a BEFORE UPDATE trigger using clock_timestamp(), revoke the column-level UPDATE grant from app roles, and treat any app-side updated_at = as a schema violation. The clock-skew bug is invisible until you look for it.
Worked example — soft-delete workaround for missing physical deletes
Detailed explanation. A products table serves both an operational service and a warehouse. The team's timestamp CDC pipeline replicates INSERT and UPDATE reliably but silently drops DELETE. The fix is the soft-delete convention: replace all DELETE FROM products WHERE id = X with UPDATE products SET deleted_at = now() WHERE id = X. Deletes become updates; timestamp CDC sees them; downstream consumers filter them out. Every senior DE has migrated a codebase to soft-delete at least once.
-
The migration. Add
deleted_at TIMESTAMPTZ NULLcolumn. Rewrite all DELETE statements to UPDATE. Add a view that filtersWHERE deleted_at IS NULLfor read paths. - The trade-off. Rows are never physically removed; storage grows without bound. Add a garbage-collector job to hard-DELETE rows older than N days and propagate the hard-delete out-of-band to the warehouse.
Question. Migrate the products table to soft-delete and show how downstream Snowflake handles the deleted-at column for both dashboards and audit queries.
Input.
| Component | Change |
|---|---|
| Schema | add deleted_at TIMESTAMPTZ NULL
|
| Application code | rewrite DELETE → UPDATE deleted_at |
| Read paths | filter WHERE deleted_at IS NULL
|
| Watermark column |
updated_at (already present) |
| Warehouse | keep deleted rows; filter in dashboards |
Code.
-- 1. Add the soft-delete column
ALTER TABLE public.products ADD COLUMN deleted_at TIMESTAMPTZ;
CREATE INDEX idx_products_updated_at ON public.products (updated_at);
-- 2. Trigger bumps updated_at on any UPDATE, including deleted_at writes
CREATE OR REPLACE FUNCTION public.bump_updated_at() RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = clock_timestamp();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_products_bump_updated_at
BEFORE INSERT OR UPDATE ON public.products
FOR EACH ROW EXECUTE FUNCTION public.bump_updated_at();
-- 3. Live view for the application (hides soft-deleted rows)
CREATE OR REPLACE VIEW public.products_live AS
SELECT * FROM public.products WHERE deleted_at IS NULL;
-- 4. Application code now UPDATEs instead of DELETEs
-- Before: DELETE FROM public.products WHERE id = 42;
-- After: UPDATE public.products SET deleted_at = now() WHERE id = 42;
-- 5. Snowflake side — MERGE handles insert/update/soft-delete uniformly
MERGE INTO analytics.products AS tgt
USING (
SELECT * FROM staging.products_cdc_delta
) src
ON tgt.id = src.id
WHEN MATCHED THEN UPDATE SET
tgt.name = src.name,
tgt.price_cents = src.price_cents,
tgt.deleted_at = src.deleted_at, -- propagate the soft-delete
tgt.updated_at = src.updated_at
WHEN NOT MATCHED THEN INSERT (id, name, price_cents, deleted_at, updated_at)
VALUES (src.id, src.name, src.price_cents, src.deleted_at, src.updated_at);
-- 6. Dashboards filter out soft-deleted rows
CREATE OR REPLACE VIEW analytics.products_live AS
SELECT * FROM analytics.products WHERE deleted_at IS NULL;
Step-by-step explanation.
- Adding
deleted_at TIMESTAMPTZ NULLand rewriting every application-side DELETE asUPDATE ... SET deleted_at = now()turns physical deletes into logical ones. Theupdated_attrigger fires on the UPDATE, so the row'supdated_atadvances and the next CDC poll picks it up as an update. - The
products_liveview centralises the "filter soft-deleted" logic; the application queriesproducts_liveinstead ofproducts. Retrofitting this view lets you migrate the schema without touching every SELECT in the codebase in one commit. - On the CDC side, the poll now sees the soft-deleted row exactly like an update: it ships the row with the new
deleted_atvalue. No pattern change needed — timestamp CDC's blind spot is closed by the schema convention. - The Snowflake
MERGEtreats soft-delete as just another column write. Downstream dashboards use their ownproducts_liveview; auditors useanalytics.productsdirectly to see full history including deletions. - The garbage collector runs monthly:
DELETE FROM public.products WHERE deleted_at < now() - INTERVAL '90 days'— this does create a physical delete blind spot again, so the GC job must run through a separate out-of-band mechanism (a nightly job that emits DELETE events to Snowflake directly, or a full-table reconcile).
Output.
| Row lifecycle | Before soft-delete | After soft-delete |
|---|---|---|
| INSERT | captured | captured |
| UPDATE | captured | captured |
| DELETE (app-side) | silently lost | captured as deleted_at set |
| DELETE (GC job) | silently lost | requires reconcile or side channel |
| Warehouse history | incomplete | complete (until GC purge) |
Rule of thumb. For any table under timestamp CDC, adopt the soft-delete convention: add deleted_at, rewrite DELETEs as UPDATEs, expose a _live view for read paths, and treat any physical DELETE as a policy violation that must go through a special GC path. This is the cheapest fix for the "no deletes" blind spot.
Senior interview question on timestamp CDC
A senior interviewer might ask: "You need to ingest a 2-billion-row events table from Postgres 12 into Snowflake with 15-minute freshness. The DBA won't grant logical replication. Walk me through the timestamp CDC design, the watermark storage, the safety window, the delete handling, and the operational monitoring."
Solution Using timestamp CDC with a durable watermark, safety window, and nightly reconcile
# cdc_events_incremental.py — production timestamp CDC for the events table
import json
import psycopg2
from datetime import datetime, timedelta, timezone
WATERMARK_TABLE = "cdc_watermarks" # persistent in Postgres itself
def get_watermark(conn, table: str) -> datetime:
with conn.cursor() as cur:
cur.execute("""
SELECT last_updated_at
FROM cdc_watermarks
WHERE table_name = %s
FOR UPDATE
""", (table,))
row = cur.fetchone()
return row[0] if row else datetime(1970, 1, 1, tzinfo=timezone.utc)
def advance_watermark(conn, table: str, new_ts: datetime):
with conn.cursor() as cur:
cur.execute("""
INSERT INTO cdc_watermarks(table_name, last_updated_at)
VALUES (%s, %s)
ON CONFLICT (table_name) DO UPDATE
SET last_updated_at = EXCLUDED.last_updated_at
""", (table, new_ts))
def poll_events(conn) -> int:
prev = get_watermark(conn, "events")
horizon = datetime.now(timezone.utc) - timedelta(minutes=5)
with conn.cursor(name="events_delta") as cur: # server-side cursor for 2B rows
cur.itersize = 10000
cur.execute("""
SELECT id, event_ts, user_id, event_type, payload, updated_at
FROM public.events
WHERE updated_at > %s
AND updated_at <= %s
ORDER BY updated_at ASC
""", (prev, horizon))
n = 0
max_ts = prev
for row in cur:
write_row_to_s3(row) # buffered writer, flushes every 10k rows
if row[5] > max_ts:
max_ts = row[5]
n += 1
if max_ts > prev:
advance_watermark(conn, "events", max_ts)
conn.commit()
return n
-- Durable watermark table (Postgres itself; transactional)
CREATE TABLE IF NOT EXISTS cdc_watermarks (
table_name TEXT PRIMARY KEY,
last_updated_at TIMESTAMPTZ NOT NULL,
last_poll_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Nightly reconcile — full row-count + PK diff, catches deletes
WITH source AS (
SELECT id FROM public.events
),
warehouse AS (
SELECT id FROM external_snowflake.events -- via foreign data wrapper or reverse pull
)
SELECT 'missing_in_warehouse' AS kind, s.id
FROM source s
LEFT JOIN warehouse w ON w.id = s.id
WHERE w.id IS NULL
UNION ALL
SELECT 'physically_deleted' AS kind, w.id
FROM warehouse w
LEFT JOIN source s ON s.id = w.id
WHERE s.id IS NULL;
Step-by-step trace.
| Step | Value | Reasoning |
|---|---|---|
| Watermark storage |
cdc_watermarks Postgres table |
transactional; survives Airflow crashes |
| Safety window | 5 minutes | absorbs late-committing transactions |
| Cursor mode | server-side, itersize=10000 | never load 2B rows into Python memory |
| Watermark advance |
max(updated_at) observed |
never to horizon; stays honest |
| Advance-and-commit | one transaction | delta-write + watermark-advance atomic |
| Delete handling | nightly PK diff | catches physical deletes |
After deployment, the pipeline lands Postgres deltas in Snowflake within 15-20 minutes of commit; the nightly reconcile flags any missing rows for backfill; the watermark table survives Airflow restarts and worker failures because it lives in the source Postgres itself.
Output:
| Metric | Value |
|---|---|
| Rows scanned per poll | 500-5000 (typical) |
| Poll latency | 30-60 s |
| Freshness p99 | 15-20 min |
| Delete detection lag | up to 24 h (nightly reconcile) |
| Watermark durability | transactional in source Postgres |
Why this works — concept by concept:
-
Server-side cursor + itersize — for a 2-billion-row source,
psycopg2client-side cursors would load the entire result set into Python memory.conn.cursor(name=...)opens a server-side cursor that streams rows in batches ofitersize, keeping memory flat regardless of result-set size. -
Watermark in source Postgres — placing
cdc_watermarksin the same Postgres as the source data lets you commit the delta write and the watermark advance in one transaction. That atomicity is the only way to guarantee "processed row => watermark advanced" without a two-phase commit. -
Safety window (5 min) — the horizon =
now() - 5 minwindow absorbs late-committing transactions. Without it, a long-running write that touches an old timestamp gets skipped. -
Advance to observed max — advancing the watermark to
max(observed updated_at)rather thanhorizonavoids the "advance past a row we didn't see" bug. Combined with the safety window, this is the correct advance policy. - Cost — one indexed range scan per poll (O(delta), not O(table)) plus one nightly PK diff (O(table)). Compared to a nightly full refresh, this is 15-minute freshness at 1% of the compute cost. Compared to log-based CDC it's operationally simpler but blind to physical deletes without the reconcile.
SQL
Topic — sql
SQL incremental-load and watermark problems
3. Trigger-based CDC
BEFORE|AFTER INSERT|UPDATE|DELETE triggers writing to a shadow / audit table — the fallback when logical replication is off-limits
The mental model in one line: trigger cdc is the pattern where per-row triggers on every INSERT/UPDATE/DELETE write a corresponding row into a shadow (audit) table, and a downstream reader tails that shadow table — it captures every DML including deletes, works without any special replication grants, and pays for that capability with a bounded but non-trivial per-DML write amplification tax that senior engineers must quantify before shipping. Every senior DE has built at least one trigger CDC pipeline against a locked-down source database; the pattern is decades old and remains the correct answer for a specific slice of environments.
The four axes for trigger CDC.
- Permission. CREATE TRIGGER on the source tables plus INSERT on the shadow table. This is a lower bar than logical replication but a higher bar than SELECT-only — some managed / SOC-compliance environments still refuse triggers.
- Latency. Seconds. The trigger fires inside the source transaction so the shadow-table row is committed atomically with the business row. The reader tails the shadow table on a short poll (1-10 seconds) or via LISTEN/NOTIFY.
-
Delete handling. Native. DELETE triggers fire on physical deletes; the shadow row carries the pre-image plus
op = 'D'. This is the pattern's headline advantage over timestamp CDC. - Source load. This is where trigger CDC pays. Every INSERT/UPDATE/DELETE now writes two rows (one to the business table, one to the shadow table). Write amplification is exactly 2×. On a write-heavy table this is a meaningful IOPS tax; on a read-heavy table it's negligible.
The shadow table shape — what every audit table needs.
-
Primary key. Auto-increment integer (
audit_id BIGSERIAL) — the reader's watermark tracks this, not a timestamp. Monotonic, gap-free, cheap to index. - Business PK. The primary key of the source row so the reader can identify which row changed.
-
Op column.
'I','U','D'(or'INSERT','UPDATE','DELETE') — captures the DML kind. - Before/after images. For UPDATE, store both the old and new row (JSONB or separate columns) so downstream can emit change events. For DELETE, only the old image.
-
Metadata.
changed_at TIMESTAMPTZ DEFAULT clock_timestamp(),changed_by TEXT DEFAULT current_user,txid BIGINT DEFAULT txid_current()— audit-grade provenance.
The trigger function — six lines that pay the tax.
- BEFORE vs AFTER. Use AFTER triggers for CDC — the row values in NEW/OLD are the committed values. BEFORE triggers see uncommitted values that another BEFORE trigger might still mutate.
- FOR EACH ROW vs FOR EACH STATEMENT. Always FOR EACH ROW for CDC — you need per-row change events. Statement-level triggers are for enforcing schema-wide invariants.
-
The function body. Insert one row into the shadow table with
TG_OP, PK, before-image (from OLD), after-image (from NEW), and metadata. Return NULL for AFTER triggers. - The tax. ~50-150 microseconds per DML (varies by row size). Multiply by DML rate to get the CPU cost.
The reader — polling vs LISTEN/NOTIFY.
-
Polling. Every N seconds,
SELECT ... FROM audit_orders WHERE audit_id > watermark ORDER BY audit_id LIMIT M. Simple, robust. Latency = poll interval. -
LISTEN/NOTIFY. The AFTER trigger also runs
pg_notify('audit_orders_channel', ...). The readerLISTENs and processes as notifications arrive. Sub-second latency; more complex failure semantics (missed notifications on reader disconnect require polling backup).
Common interview probes on trigger CDC.
- "What's the source-DB cost of trigger CDC?" — required answer: exactly 2× write amplification per DML.
- "How do you handle updates in the shadow table?" — before/after images (OLD and NEW).
- "When is trigger CDC preferable to log-based?" — when logical replication is not grantable but CREATE TRIGGER is.
- "How do you monitor trigger CDC?" — shadow-table lag (
max(audit_id) - watermark); alert if lag grows.
Worked example — Postgres audit trigger for the orders table
Detailed explanation. The canonical trigger CDC setup: an audit_orders shadow table, an AFTER INSERT OR UPDATE OR DELETE trigger that writes one shadow row per DML, and an Airflow tail that reads the shadow table by monotonic audit_id. Build the whole thing from scratch.
-
Shadow table.
audit_orders(audit_id BIGSERIAL PK, op CHAR(1), order_id BIGINT, before JSONB, after JSONB, changed_at, changed_by, txid). - Trigger. AFTER INSERT OR UPDATE OR DELETE ON orders, FOR EACH ROW.
-
Reader. Airflow task tailing
audit_orders WHERE audit_id > watermark.
Question. Write the shadow-table DDL, the trigger function, the trigger binding, and the reader task.
Input.
| Object | Purpose |
|---|---|
audit_orders |
shadow table with pre/post images |
audit_orders_trigger() |
trigger function; runs once per row |
trg_orders_audit |
trigger binding on orders
|
tail_audit_orders |
Airflow reader task |
Code.
-- 1. Shadow / audit table
CREATE TABLE public.audit_orders (
audit_id BIGSERIAL PRIMARY KEY,
op CHAR(1) NOT NULL, -- 'I', 'U', 'D'
order_id BIGINT NOT NULL,
before JSONB, -- old row image (UPDATE, DELETE)
after JSONB, -- new row image (INSERT, UPDATE)
changed_at TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp(),
changed_by TEXT NOT NULL DEFAULT current_user,
txid BIGINT NOT NULL DEFAULT txid_current()
);
CREATE INDEX idx_audit_orders_audit_id ON public.audit_orders (audit_id);
-- 2. Trigger function — six lines that do all the work
CREATE OR REPLACE FUNCTION public.audit_orders_trigger()
RETURNS TRIGGER AS $$
BEGIN
IF (TG_OP = 'INSERT') THEN
INSERT INTO public.audit_orders(op, order_id, after)
VALUES ('I', NEW.id, to_jsonb(NEW));
ELSIF (TG_OP = 'UPDATE') THEN
INSERT INTO public.audit_orders(op, order_id, before, after)
VALUES ('U', NEW.id, to_jsonb(OLD), to_jsonb(NEW));
ELSIF (TG_OP = 'DELETE') THEN
INSERT INTO public.audit_orders(op, order_id, before)
VALUES ('D', OLD.id, to_jsonb(OLD));
END IF;
RETURN NULL; -- AFTER triggers ignore return value
END;
$$ LANGUAGE plpgsql;
-- 3. Trigger binding — one trigger, all three DML kinds
CREATE TRIGGER trg_orders_audit
AFTER INSERT OR UPDATE OR DELETE ON public.orders
FOR EACH ROW EXECUTE FUNCTION public.audit_orders_trigger();
# 4. Airflow reader — tails audit_orders by monotonic audit_id
import psycopg2
from airflow.decorators import task
@task
def tail_audit_orders(ti):
conn = psycopg2.connect("host=db-primary port=5432 dbname=production user=cdc_reader")
with conn.cursor() as cur:
# Watermark stored transactionally in Postgres
cur.execute("SELECT COALESCE(MAX(audit_id), 0) FROM cdc_watermarks_audit WHERE table_name = 'audit_orders'")
watermark = cur.fetchone()[0]
cur.execute("""
SELECT audit_id, op, order_id, before, after, changed_at
FROM public.audit_orders
WHERE audit_id > %s
ORDER BY audit_id ASC
LIMIT 10000
""", (watermark,))
rows = cur.fetchall()
for row in rows:
emit_to_downstream(row)
if rows:
new_watermark = rows[-1][0]
with conn.cursor() as cur:
cur.execute("""
INSERT INTO cdc_watermarks_audit(table_name, max_audit_id)
VALUES ('audit_orders', %s)
ON CONFLICT (table_name) DO UPDATE
SET max_audit_id = EXCLUDED.max_audit_id
""", (new_watermark,))
conn.commit()
return len(rows)
Step-by-step explanation.
- The
audit_ordersshadow table has aBIGSERIAL audit_idprimary key — this is the reader's monotonic watermark. Timestamps are unreliable (clock skew, transaction commit order); a sequence is not. Thebefore/afterJSONB columns give downstream consumers the full row images without a schema-per-column CDC contract. - The trigger function branches on
TG_OP. For INSERT, onlyafteris meaningful. For UPDATE, we store bothbefore(from OLD) andafter(from NEW) — this is the change payload downstream consumers need to compute deltas. For DELETE, onlybefore(from OLD) — the row is gone. -
to_jsonb(NEW)andto_jsonb(OLD)capture the entire row as a JSONB blob. This is preferable to per-column shadow tables (which require a schema migration toaudit_ordersevery time you add a column toorders). - The trigger is
AFTER INSERT OR UPDATE OR DELETE ... FOR EACH ROW— one trigger definition covers all three DML kinds.AFTERguarantees the shadow row commits atomically with the business row: either both commit or both roll back, no dual-write problem. - The Airflow reader tails by
audit_id > watermark. Becauseaudit_idis a BIGSERIAL, it's monotonic and gap-free (within a single writer; gaps can appear from rolled-back transactions but never as "audit_id < watermark that appears later"). This makes watermark advancement trivially correct.
Output.
| DML event | Row in audit_orders | audit_id |
|---|---|---|
| INSERT INTO orders VALUES (1, ...) | ('I', 1, NULL, {...}) |
1 |
| UPDATE orders SET status='shipped' WHERE id=1 | ('U', 1, {before}, {after}) |
2 |
| DELETE FROM orders WHERE id=1 | ('D', 1, {before}, NULL) |
3 |
Rule of thumb. For any trigger CDC deployment, use a BIGSERIAL audit_id as the watermark, store full row images as JSONB (not per-column), and make the trigger AFTER ... FOR EACH ROW. These three choices remove entire classes of correctness bugs.
Worked example — quantifying the write-amplification tax
Detailed explanation. Every senior architect must quantify the source-DB cost before signing off on trigger CDC. The tax is deterministic: every INSERT/UPDATE/DELETE writes one extra row. Multiply by DML rate to get the CPU + IOPS cost. Walk through the calculation for a moderate-write-rate orders table.
- DML rate. 500 writes/sec (mixed INSERT/UPDATE/DELETE).
-
Row size. ~500 bytes for
orders; ~1.5 KB for the JSONBaudit_ordersrow (both images + metadata). - Write amplification. 500 writes/sec → 500 shadow-table inserts/sec.
Question. Quantify the added CPU, IOPS, and storage cost of trigger CDC on the orders table.
Input.
| Metric | orders (base) | audit_orders (added) |
|---|---|---|
| Write rate | 500/s | 500/s |
| Row size | 500 B | 1500 B |
| IOPS per row | 1 WAL write | 1 WAL write |
| Storage per day | 21.6 GB | 64.8 GB |
Code.
-- Measure the actual trigger cost on a running system
BEGIN;
SELECT clock_timestamp() AS t0 \gset
-- Run 10,000 UPDATEs
DO $$
BEGIN
FOR i IN 1..10000 LOOP
UPDATE public.orders SET status = 'benchmark' WHERE id = 1;
END LOOP;
END$$;
SELECT clock_timestamp() AS t1 \gset
COMMIT;
-- Compute wall-clock cost per DML with trigger
SELECT extract(epoch FROM (:'t1'::timestamptz - :'t0'::timestamptz)) / 10000.0 AS sec_per_dml_with_trigger;
-- Disable the trigger and repeat
ALTER TABLE public.orders DISABLE TRIGGER trg_orders_audit;
-- ... rerun the 10,000 UPDATEs ...
-- Re-enable
ALTER TABLE public.orders ENABLE TRIGGER trg_orders_audit;
# Storage projection — 30 days of trigger CDC
DML_PER_SEC = 500
AUDIT_ROW_KB = 1.5
SECONDS_DAY = 86400
daily_gb = (DML_PER_SEC * SECONDS_DAY * AUDIT_ROW_KB) / (1024 * 1024)
monthly_gb = daily_gb * 30
print(f"Trigger CDC storage: {daily_gb:.1f} GB/day, {monthly_gb:.1f} GB/month")
# → Trigger CDC storage: 61.9 GB/day, 1856.4 GB/month
Step-by-step explanation.
- The benchmark measures the wall-clock cost per UPDATE with and without the trigger. In our measurements against a modest 4-vCPU Postgres, the trigger adds roughly 80-150 microseconds per DML. At 500 DML/sec that's 40-75 ms/sec of extra CPU — under 10% of one core.
- IOPS-wise, each shadow-table INSERT generates one WAL record and (eventually) one heap-page flush. The doubling is exact: 500 base DML → 500 shadow inserts → 1000 total WAL records/sec.
- Storage is the sneakiest cost. At 500 DML/sec × 1.5 KB per shadow row × 86400 sec/day = ~62 GB/day. Trigger CDC accumulates roughly 2 TB/month of audit data. This forces a retention policy: TTL the shadow table, archive to S3, or partition by day and drop old partitions.
- The trigger cost scales linearly with DML rate. At 5000 DML/sec the CPU cost hits 40-75% of one core — still tractable but no longer negligible. At 50000 DML/sec you cannot afford trigger CDC on a shared box.
- Compared to log-based CDC (which adds ~0 source-table load), trigger CDC is measurably more expensive on write-heavy workloads. Compared to timestamp CDC (which polls but doesn't add per-DML cost), it's measurably slower but catches deletes.
Output.
| Cost dimension | Trigger CDC | Log-based CDC | Timestamp CDC |
|---|---|---|---|
| CPU per DML | +80-150 μs | ~0 (WAL was written anyway) | 0 (polling only) |
| IOPS amplification | 2× | 1× | 1× (polling scan) |
| Storage (shadow / slot) | ~60 GB/day @ 500 DML/s | ~0 (WAL retention only) | 0 |
| Retention obligation | manual TTL required | slot lag monitoring | none |
Rule of thumb. Before shipping trigger CDC, run the wall-clock benchmark and compute the daily shadow-table storage. If daily shadow growth exceeds 10% of the source-table daily growth or the wall-clock trigger cost exceeds 5% of the source-DB CPU budget, negotiate for log-based CDC instead.
Worked example — LISTEN/NOTIFY for sub-second latency
Detailed explanation. The default trigger CDC reader polls the shadow table every N seconds. For sub-second latency, add pg_notify to the trigger function and a LISTEN client on the reader side. The result is push-based: writes wake the reader within milliseconds. The trade-off is more complex failure semantics — missed notifications on reader disconnect require a polling backup.
-
The push. Trigger runs
pg_notify('audit_orders_channel', audit_id::text)after inserting the shadow row. - The pull backup. Reader still polls every 10 seconds as a safety net against missed notifications.
- The reader. LISTEN on the channel; wake on notification; process by audit_id watermark.
Question. Extend the trigger and reader to use LISTEN/NOTIFY for push-based delivery with a polling backup.
Input.
| Component | Change |
|---|---|
| Trigger function | add pg_notify after shadow insert |
| Reader | run LISTEN; wake on notification; poll every 10s as backup |
| Notification payload |
audit_id as text (short; fits in NOTIFY 8KB limit) |
| Latency | sub-second (was: 1-10s poll) |
Code.
-- Extended trigger function — pushes a notification after the shadow insert
CREATE OR REPLACE FUNCTION public.audit_orders_trigger()
RETURNS TRIGGER AS $$
DECLARE
new_audit_id BIGINT;
BEGIN
IF (TG_OP = 'INSERT') THEN
INSERT INTO public.audit_orders(op, order_id, after)
VALUES ('I', NEW.id, to_jsonb(NEW))
RETURNING audit_id INTO new_audit_id;
ELSIF (TG_OP = 'UPDATE') THEN
INSERT INTO public.audit_orders(op, order_id, before, after)
VALUES ('U', NEW.id, to_jsonb(OLD), to_jsonb(NEW))
RETURNING audit_id INTO new_audit_id;
ELSIF (TG_OP = 'DELETE') THEN
INSERT INTO public.audit_orders(op, order_id, before)
VALUES ('D', OLD.id, to_jsonb(OLD))
RETURNING audit_id INTO new_audit_id;
END IF;
-- Wake any LISTEN readers
PERFORM pg_notify('audit_orders_channel', new_audit_id::text);
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
# Reader — LISTEN + poll backup
import psycopg2
import psycopg2.extensions
import select
import time
def run_reader():
conn = psycopg2.connect("host=db-primary port=5432 dbname=production user=cdc_reader")
conn.set_isolation_level(psycopg2.extensions.ISOLATION_LEVEL_AUTOCOMMIT)
with conn.cursor() as cur:
cur.execute("LISTEN audit_orders_channel;")
last_poll = time.time()
POLL_BACKUP_SEC = 10
while True:
# Wait for either a notification or the poll-backup timeout
elapsed = time.time() - last_poll
timeout = max(0.0, POLL_BACKUP_SEC - elapsed)
ready, _, _ = select.select([conn], [], [], timeout)
if ready:
conn.poll()
while conn.notifies:
notify = conn.notifies.pop(0)
# notify.payload contains the audit_id as text; we ignore it
# and just process everything > watermark (deduplication)
else:
# timeout — polling backup
pass
# Either path: drain new rows since watermark
drain_shadow_table(conn)
last_poll = time.time()
def drain_shadow_table(conn):
with conn.cursor() as cur:
cur.execute("SELECT COALESCE(MAX(audit_id), 0) FROM cdc_watermarks_audit WHERE table_name='audit_orders'")
watermark = cur.fetchone()[0]
cur.execute("""
SELECT audit_id, op, order_id, before, after
FROM public.audit_orders
WHERE audit_id > %s
ORDER BY audit_id
LIMIT 10000
""", (watermark,))
for row in cur:
emit_to_downstream(row)
# advance watermark ...
Step-by-step explanation.
- The extended trigger function captures the newly-inserted
audit_idviaRETURNING audit_id INTO new_audit_id, then callspg_notify('audit_orders_channel', new_audit_id::text). The notification fires after the shadow row is committed, so any listener that reacts is guaranteed to see the row. - The reader runs
LISTEN audit_orders_channeland blocks onselect.select([conn], ...)waiting for either a notification or a timeout. On notification wakeup, the connection'sconn.notifieslist is drained. On timeout, the polling backup runs anyway. - Whether wakeup is push (notification) or pull (timeout), the reader always calls
drain_shadow_table, which queriesWHERE audit_id > watermark ORDER BY audit_id LIMIT 10000. This means missed notifications are recovered by the next poll — the polling backup is the safety net. - The notification payload (
new_audit_id::text) is essentially unused — the reader could just as well drain on any wakeup. We include it for debug logging. Postgres limitsNOTIFYpayload to 8KB; putting the whole row in there is a temptation to resist. - The failure story: (a) reader crash → notifications sent while reader is down are lost, but polling backup on restart catches up; (b) network partition → LISTEN reconnects, polling catches up; (c)
pg_notifyfailure → transaction rolls back (both business and shadow rows). No lost data in any case.
Output.
| Latency profile | Push (LISTEN) | Poll backup |
|---|---|---|
| Steady state | ~50 ms | never fires |
| After reader restart | polling until next NOTIFY | catches missed events |
| Network partition | LISTEN reconnects | drives correctness |
| Notification lost | not covered by LISTEN alone | catches within 10s |
Rule of thumb. For sub-second trigger CDC, add pg_notify inside the trigger and keep the polling backup. Never rely on notifications alone — they can be silently dropped on client-side buffer overflow or reader disconnect.
Senior interview question on trigger CDC
A senior interviewer might ask: "Your managed Postgres tier doesn't offer logical replication but does allow CREATE TRIGGER. Design a trigger CDC pipeline for a 500 DML/sec orders table with sub-second latency to Kafka. Include the shadow table, the trigger, the reader, the write-amplification budget, and the on-call runbook when the shadow table grows unboundedly."
Solution Using an audit table + LISTEN/NOTIFY reader + partitioned retention
-- 1. Partitioned shadow table — daily partitions for cheap retention
CREATE TABLE public.audit_orders (
audit_id BIGSERIAL NOT NULL,
op CHAR(1) NOT NULL,
order_id BIGINT NOT NULL,
before JSONB,
after JSONB,
changed_at TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp(),
changed_by TEXT NOT NULL DEFAULT current_user,
txid BIGINT NOT NULL DEFAULT txid_current(),
PRIMARY KEY (audit_id, changed_at)
) PARTITION BY RANGE (changed_at);
-- Create today's + tomorrow's partition; automate creation via pg_partman
CREATE TABLE public.audit_orders_20260704
PARTITION OF public.audit_orders
FOR VALUES FROM ('2026-07-04') TO ('2026-07-05');
CREATE TABLE public.audit_orders_20260705
PARTITION OF public.audit_orders
FOR VALUES FROM ('2026-07-05') TO ('2026-07-06');
-- 2. Trigger + notification
CREATE OR REPLACE FUNCTION public.audit_orders_trigger()
RETURNS TRIGGER AS $$
DECLARE
new_audit_id BIGINT;
BEGIN
IF TG_OP = 'INSERT' THEN
INSERT INTO public.audit_orders(op, order_id, after)
VALUES ('I', NEW.id, to_jsonb(NEW))
RETURNING audit_id INTO new_audit_id;
ELSIF TG_OP = 'UPDATE' THEN
INSERT INTO public.audit_orders(op, order_id, before, after)
VALUES ('U', NEW.id, to_jsonb(OLD), to_jsonb(NEW))
RETURNING audit_id INTO new_audit_id;
ELSIF TG_OP = 'DELETE' THEN
INSERT INTO public.audit_orders(op, order_id, before)
VALUES ('D', OLD.id, to_jsonb(OLD))
RETURNING audit_id INTO new_audit_id;
END IF;
PERFORM pg_notify('audit_orders_channel', new_audit_id::text);
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_orders_audit
AFTER INSERT OR UPDATE OR DELETE ON public.orders
FOR EACH ROW EXECUTE FUNCTION public.audit_orders_trigger();
-- 3. Retention — drop partitions older than 7 days (after Kafka has committed)
DROP TABLE IF EXISTS public.audit_orders_20260627;
# 4. Kafka producer that tails audit_orders (LISTEN + poll)
# (elided; identical to the reader in the LISTEN example above,
# with kafka_producer.send() replacing emit_to_downstream())
Step-by-step trace.
| Layer | Config | Result |
|---|---|---|
| Partitioning | daily RANGE(changed_at) |
DROP PARTITION = O(1) delete |
| Retention | 7 days | 62 GB/day × 7 = ~430 GB shadow footprint |
| Notification | pg_notify per trigger | ~50 ms end-to-end latency |
| Backup poll | 10 s | catches missed notifications |
| Reader watermark | max(audit_id) in cdc_watermarks_audit | monotonic; gap-tolerant |
After the rollout, the orders DML rate stays at 500/s; audit_orders receives 500/s; Kafka sees each event within 50 ms of source commit; the 7-day retention keeps shadow-table footprint bounded at ~430 GB. The DBA sees no logical replication footprint; the SOC auditor sees a durable audit trail with changed_by and txid.
Output:
| Metric | Value |
|---|---|
| End-to-end latency (source commit → Kafka) | ~50 ms |
| Source CPU overhead (trigger) | ~8% of one core |
| Shadow-table daily growth | ~62 GB |
| Shadow-table retention | 7 days |
| Delete detection | native (DELETE trigger fires) |
| DBA permission required | CREATE TRIGGER only |
Why this works — concept by concept:
- AFTER trigger + shadow table — the AFTER timing guarantees the shadow row commits atomically with the business row; no dual-write. The shadow table is the durable ordered log the reader tails.
-
RANGE partitioning by changed_at — retention becomes
DROP TABLE audit_orders_YYYYMMDD, which is O(1) rather than the O(rows) cost ofDELETE FROM audit_orders WHERE changed_at < .... Partitioning is non-negotiable at 500 DML/s. -
pg_notify inside the trigger — push-based wake-up for sub-second latency. The 8KB payload cap is respected by sending only
audit_id::text; the reader drains the actual rows. - Backup poll every 10s — LISTEN can miss notifications on client-side buffer overflow or reader restart. The poll is the correctness safety net; LISTEN is the latency optimisation on top.
- Cost — 2× write amplification on the source (deterministic), ~8% CPU on the trigger, ~430 GB storage for 7-day retention, LISTEN + polling reader. Every cost is bounded and quantified. The eliminated cost is the "we can't get logical replication" outage.
SQL
Topic — sql
SQL trigger and audit-table problems
4. Log-based CDC — WAL / binlog / oplog
log-based cdc reads the database's write-ahead log directly — zero source-table load, captures every DML, and is the 2026 default when the DBA cooperates
The mental model in one line: log-based cdc is the pattern where a downstream reader tails the database's internal write-ahead log (Postgres WAL, MySQL binlog, Mongo oplog) via the native replication protocol, decodes each record into a typed change event, and forwards it — the source-table load is essentially zero because the WAL is written for durability anyway, deletes are captured natively as tombstone events, ordering is guaranteed by the log sequence, and the pattern is the default answer for any greenfield deployment where logical replication is grantable. Debezium, AWS DMS, Fivetran HVR, and native connectors (pglogical, wal2json, MongoDB change streams) all implement this pattern.
The four axes for log-based CDC.
-
Permission. Postgres:
wal_level = logical+REPLICATIONrole + replication slot. MySQL:log_bin=ON,binlog_format=ROW,binlog_row_image=FULL, replication user. MongoDB: replica-set member with oplog read. All three require DBA cooperation and (for Postgres/MySQL) a server restart to changewal_level/log_bin. - Latency. Sub-second in steady state. The reader lags the primary by the network hop + decode time — typically 50-200 ms.
- Delete handling. Native and complete. DELETE events carry the pre-image; the reader emits a tombstone event that downstream consumers use to invalidate caches, remove rows from search indices, etc.
- Source load. Essentially zero on the source tables — the WAL is written for durability regardless of CDC. The reader adds ~2-5% CPU on the DB host for the replication decoder plus ~10-50 MB memory. Compared to trigger CDC's 2× write amplification, this is a step-change improvement.
The WAL / binlog / oplog anatomy.
-
Postgres WAL. The write-ahead log records every heap-page change. Logical decoding (via
pgoutputorwal2jsonplugin) converts physical WAL records into logical row-level events. The output plugin choice determines the format (Debezium usespgoutput; some legacy stacks usewal2json). -
MySQL binlog. Row-based binlog format (
binlog_format=ROW) writes one binlog event per changed row.binlog_row_image=FULLincludes both before and after images. Debezium's MySQL connector reads via the standard replication protocol. -
MongoDB oplog. A capped collection (
local.oplog.rs) recording every write. MongoDB 3.6+ exposes a higher-levelchange streamAPI that wraps the oplog and adds resumability tokens.
The replication slot / binlog position / resume token — the durable position tracker.
-
Postgres replication slot. Named durable state (
pg_create_logical_replication_slot) that tracks the reader's last-confirmed WAL position (LSN). Postgres retains WAL until the slot'sconfirmed_flush_lsnadvances past it — safe against reader downtime, dangerous if the reader is offline indefinitely (WAL fills the disk). -
MySQL GTID / binlog file+position. Global Transaction ID (GTID) or binlog file + position tracks the reader's position. Binlog retention is time-based (
binlog_expire_logs_seconds). - MongoDB resume token. Opaque binary token returned by change streams; feed it back on reconnect to resume.
The snapshot bootstrap — how the reader learns about pre-existing rows.
- The problem. The WAL only contains changes from the moment CDC starts. What about the 100 million rows already in the table when CDC was enabled?
-
The solution. Debezium (and DMS) run a consistent snapshot on startup:
SET TRANSACTION SNAPSHOT(Postgres) orFLUSH TABLES WITH READ LOCKbriefly (MySQL) to record a point-in-time, SELECT every row from every configured table, emit a synthetic INSERT event per row, record the WAL position at snapshot start, then switch to tailing the WAL from that position. -
The trade-off. Snapshots are I/O-heavy for large tables. Options:
initial(snapshot every table),schema_only(skip snapshot; only new changes captured — useful if you don't need history),initial_only(snapshot then stop — useful for one-time backfills),never(no snapshot; assume you've bootstrapped some other way).
Common interview probes on log-based CDC.
- "How does log-based CDC handle deletes?" — required answer: tombstone events with pre-image.
- "What's the source load?" — required answer: ~0 on source tables (WAL is written anyway) + 2-5% CPU for the reader.
- "How do you bootstrap the initial data?" — snapshot mode: consistent snapshot then tail from snapshot LSN.
- "What's a replication slot? What are its risks?" — durable position tracker; WAL retention on offline reader can fill disk.
- "How do you replay the last hour of changes?" — reset the slot to an earlier LSN; re-emit from there. On managed DBs, this is often not possible (slots are managed).
Worked example — Postgres logical replication to a downstream JSON stream
Detailed explanation. The canonical log-based Postgres CDC setup: enable wal_level = logical, create a REPLICATION role, create a PUBLICATION for the tables you care about, and a Debezium connector consumes the WAL via pgoutput and publishes JSON to Kafka. Walk through every piece.
-
Enable.
wal_level = logical, restart Postgres. -
Grant.
CREATE ROLE cdc_reader WITH LOGIN REPLICATION PASSWORD ...;GRANT SELECT ON .... -
Publication.
CREATE PUBLICATION orders_pub FOR TABLE public.orders. - Slot. Debezium creates the replication slot on startup.
-
Connector. Debezium Postgres connector with
plugin.name=pgoutput,snapshot.mode=initial.
Question. Provide the Postgres-side setup and the Debezium connector config, then show the change-event shape that lands in Kafka.
Input.
| Component | Value |
|---|---|
| Postgres version | 16 |
| Plugin | pgoutput (built-in) |
| Publication | orders_pub for public.orders |
| Slot | debezium_orders |
| Snapshot mode | initial |
Code.
-- 1. postgresql.conf — requires restart
-- wal_level = logical
-- max_replication_slots = 10
-- max_wal_senders = 10
-- 2. Postgres-side setup (one-time)
CREATE ROLE cdc_reader WITH LOGIN REPLICATION PASSWORD 'strong-secret';
GRANT SELECT ON public.orders TO cdc_reader;
GRANT USAGE ON SCHEMA public TO cdc_reader;
CREATE PUBLICATION orders_pub FOR TABLE public.orders;
-- Verify
SELECT * FROM pg_publication;
SELECT * FROM pg_publication_tables WHERE pubname = 'orders_pub';
-- 3. pg_hba.conf — allow the replication user to open a replication connection
-- host replication cdc_reader 10.0.0.0/8 scram-sha-256
# 4. Debezium Postgres connector (Kafka Connect JSON config)
name: orders-cdc
config:
connector.class: io.debezium.connector.postgresql.PostgresConnector
# Source
plugin.name: pgoutput
database.hostname: db-primary.internal
database.port: 5432
database.user: cdc_reader
database.password: ${env:CDC_PASSWORD}
database.dbname: production
# CDC control
slot.name: debezium_orders
publication.name: orders_pub
snapshot.mode: initial
snapshot.isolation.mode: repeatable_read
table.include.list: public.orders
heartbeat.interval.ms: 10000
# Kafka
topic.prefix: prod
key.converter: org.apache.kafka.connect.json.JsonConverter
value.converter: org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable: false
# Delete handling — emit tombstones (Kafka log-compaction friendly)
tombstones.on.delete: true
// 5. Sample Debezium change event for an UPDATE
{
"before": {"id": 42, "customer_id": 7, "total_cents": 1500, "status": "pending", "updated_at": 1720051200000000},
"after": {"id": 42, "customer_id": 7, "total_cents": 1500, "status": "shipped", "updated_at": 1720051260000000},
"source": {
"version": "2.5.0.Final",
"connector": "postgresql",
"name": "prod",
"ts_ms": 1720051260123,
"snapshot": "false",
"db": "production",
"schema": "public",
"table": "orders",
"txId": 89123456,
"lsn": 24591040,
"xmin": null
},
"op": "u",
"ts_ms": 1720051260200,
"transaction": null
}
Step-by-step explanation.
- The
postgresql.confchange (wal_level = logical) is the one setting the DBA must grant.max_replication_slotscaps the number of concurrent readers;max_wal_senderscaps the number of concurrent replication connections. Both need modest room above the number of connectors you plan to run. - The
PUBLICATIONobject scopes which tables the reader can see.FOR TABLE public.orderspublishes only that table;FOR ALL TABLESpublishes everything (dangerous — schema DDL now enters the WAL). Prefer explicit lists. - The Debezium connector opens a replication connection as
cdc_reader, creates the replication slotdebezium_orderson first start (or reuses it on restart), and requests thepgoutputdecoder. The connector translates each WAL record into a Debezium change event withbefore,after,source, andopfields. -
snapshot.mode = initialtriggers a one-time consistent snapshot on the connector's first start:SET TRANSACTION SNAPSHOTfreezes a point-in-time view; the connector SELECTs every row from every configured table, emits asnapshot=trueINSERT event per row, records the LSN at snapshot start, then switches to tailing the WAL from that LSN. Snapshotting a 2B-row table is measured in hours; plan capacity accordingly. - Every downstream change event lands in a Kafka topic named
prod.public.orderswith the JSON shape shown.op = c/u/d/rfor create/update/delete/read (snapshot rows). Tombstones (a Kafka message with anullvalue) for deletes enable Kafka log compaction — old versions of the same key are garbage-collected.
Output.
| Event kind | op |
before | after | Notes |
|---|---|---|---|---|
| INSERT | c |
null | full row | Create event |
| UPDATE | u |
old row | new row | Both images |
| DELETE | d |
old row | null | Tombstone follows |
| Snapshot row | r |
null | full row | source.snapshot = true |
Rule of thumb. For any Postgres log-based CDC deployment, use plugin.name = pgoutput (built-in, no plugin install), snapshot.mode = initial (safe default), and tombstones.on.delete = true (enables Kafka compaction). Set heartbeat.interval.ms = 10000 so quiet tables don't cause the slot to hold WAL indefinitely.
Worked example — replication slot lag monitoring and disk-full defense
Detailed explanation. The single biggest operational hazard of log-based Postgres CDC is a stalled reader combined with a busy source. Postgres will retain WAL indefinitely until the slot's confirmed_flush_lsn advances — if the reader is offline for a week, the WAL directory fills up and the primary crashes. Every senior architect ships slot-lag monitoring on day one.
-
The mechanism.
pg_replication_slots.confirmed_flush_lsnis the highest LSN the reader has acknowledged. WAL between that LSN andpg_current_wal_lsn()is retained. - The alert. Slot lag in bytes > 1 GB, or slot inactive for > 15 min. Both are early-warning signals.
-
The defense.
max_slot_wal_keep_size(Postgres 13+) caps how much WAL a slot can retain — beyond that, the slot is invalidated and the reader must resnapshot. Prevents disk-full crashes at the cost of losing continuity.
Question. Write the monitoring query, the alert rule, and the disk-full defense.
Input.
| Component | Value |
|---|---|
| Alert threshold | slot lag > 1 GB or inactive > 15 min |
| Defense | max_slot_wal_keep_size = 20GB |
| Runbook | check reader health → restart connector → last resort: drop slot + resnapshot |
Code.
-- Monitoring query — run every 60 seconds
SELECT
slot_name,
active,
active_pid,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag_pretty,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes,
age(clock_timestamp(), pg_stat_get_replication_slot_stat(slot_name).stats_reset) AS since_last_ack
FROM pg_replication_slots
WHERE slot_type = 'logical';
-- Disk-full defense — cap slot retention (Postgres 13+)
ALTER SYSTEM SET max_slot_wal_keep_size = '20GB';
SELECT pg_reload_conf();
# Prometheus exporter
import psycopg2, time
from prometheus_client import start_http_server, Gauge
g_lag_bytes = Gauge('pg_slot_lag_bytes', 'replication slot lag', ['slot'])
g_active = Gauge('pg_slot_active', '1 if reader connected',['slot'])
def scrape():
conn = psycopg2.connect("host=db-primary port=5432 dbname=production user=monitor")
conn.autocommit = True
cur = conn.cursor()
cur.execute("""
SELECT slot_name,
active::int,
pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes
FROM pg_replication_slots
WHERE slot_type = 'logical'
""")
for slot_name, active, lag_bytes in cur.fetchall():
g_lag_bytes.labels(slot_name).set(lag_bytes)
g_active.labels(slot_name).set(active)
conn.close()
if __name__ == "__main__":
start_http_server(9128)
while True:
scrape()
time.sleep(30)
Step-by-step explanation.
- The monitoring query joins
pg_replication_slotswithpg_current_wal_lsn()and computes the byte-difference — the amount of WAL Postgres is holding on the slot's behalf.pg_wal_lsn_diffhandles the LSN arithmetic; the result is the "lag" that predicts disk-full risk. - The Prometheus exporter turns the per-slot lag into a labelled gauge (
pg_slot_lag_bytes{slot="debezium_orders"}). The on-call alert ispg_slot_lag_bytes > 1073741824 for 5m(1 GB sustained) — sustained lag means the reader is genuinely behind, not a momentary spike. - A companion alert
pg_slot_active{slot=~".+"} == 0 for 15mfires if the reader is disconnected for 15 minutes. This catches the "connector crashed but no alert" case where the slot is idle and lag grows quietly. -
max_slot_wal_keep_size = '20GB'is the disk-full defense: if a slot falls more than 20 GB behind, Postgres invalidates the slot (wal_status = 'lost'). The reader will fail on reconnect and must resnapshot. This trades continuity for cluster survival — a good trade. - The runbook when the alert fires: (a) check if the Debezium connector is running — restart if dead; (b) check Kafka consumer lag — if Kafka is behind, the connector is emitting fine but can't ack; (c) if slot is
wal_status = lost, drop the slot and re-create withsnapshot.mode = initialto rebootstrap.
Output.
| Slot state | active | lag_bytes | Runbook |
|---|---|---|---|
| Healthy | true | < 100 MB | none |
| Warning | true | 100 MB - 1 GB | inspect connector logs |
| Alert | true | > 1 GB | page on-call; find slow consumer |
| Idle | false | any | reader down; restart connector |
| Lost | any | (n/a) | slot invalidated; drop + resnapshot |
Rule of thumb. Ship slot-lag + slot-active monitoring the same day you enable logical replication. Set max_slot_wal_keep_size to a value less than 25% of your WAL disk. A logical replication slot without monitoring is a 3 AM disk-full incident waiting to happen.
Worked example — MySQL binlog CDC with GTID resume
Detailed explanation. MySQL's log-based CDC story is the binlog. With binlog_format=ROW and binlog_row_image=FULL, each row-level change lands in the binlog with both before and after images. Debezium's MySQL connector reads via the standard replication protocol using GTIDs (globally-unique transaction IDs) for resumability. Walk through the setup.
-
Server.
log_bin=ON,binlog_format=ROW,binlog_row_image=FULL,gtid_mode=ON,enforce_gtid_consistency=ON. - Connector. Debezium MySQL connector with GTID-based resume.
- Resume. On restart, the connector reads the last processed GTID from its own history topic in Kafka.
Question. Configure MySQL for log-based CDC and set up the Debezium connector.
Input.
| Component | Value |
|---|---|
| MySQL version | 8.0 |
| binlog format | ROW |
| GTID | ON |
| Connector | Debezium MySQL |
Code.
-- MySQL server config (my.cnf)
-- [mysqld]
-- server-id = 1
-- log_bin = /var/log/mysql/mysql-bin.log
-- binlog_format = ROW
-- binlog_row_image = FULL
-- binlog_expire_logs_seconds = 604800 -- 7 days
-- gtid_mode = ON
-- enforce_gtid_consistency = ON
-- Replication user
CREATE USER 'cdc_reader'@'%' IDENTIFIED BY 'strong-secret';
GRANT REPLICATION SLAVE, REPLICATION CLIENT, SELECT, RELOAD ON *.* TO 'cdc_reader'@'%';
# Debezium MySQL connector
name: orders-cdc-mysql
config:
connector.class: io.debezium.connector.mysql.MySqlConnector
database.hostname: mysql-primary.internal
database.port: 3306
database.user: cdc_reader
database.password: ${env:CDC_PASSWORD}
database.server.id: 184054 # unique replica ID
topic.prefix: prod
# Include the orders table
database.include.list: production
table.include.list: production.orders
# Snapshot policy
snapshot.mode: initial
# GTID-based resume — the connector persists processed GTIDs
# in the schema history topic and resumes from there
schema.history.internal.kafka.topic: schema-history.orders-cdc
schema.history.internal.kafka.bootstrap.servers: kafka:9092
# Kafka
key.converter: org.apache.kafka.connect.json.JsonConverter
value.converter: org.apache.kafka.connect.json.JsonConverter
Step-by-step explanation.
-
binlog_format=ROWwrites one binlog event per changed row (vsSTATEMENTwhich writes the DML text). Row format is mandatory for CDC — statement format is ambiguous under non-deterministic functions. -
binlog_row_image=FULLwrites both before and after images. The defaultMINIMALwrites only changed columns, which downstream consumers must reconstruct — usually not worth the disk savings for a CDC-driven binlog. -
gtid_mode=ONgives every transaction a globally-unique ID (server_uuid:tx_num). GTIDs are the resumability primitive: on connector restart, the connector says "resume after GTID X:Y" and MySQL streams from there. Without GTIDs, resume is via binlog file + position, which is fragile across binlog rotations. - The Debezium connector opens a replication connection as
cdc_reader, requests the binlog starting at its last-processed GTID, and translates each ROW event into a Debezium change event. The Kafka message shape is the same as the Postgres connector — samebefore,after,source,opfields. - The
schema-history.orders-cdcKafka topic stores the DDL history (CREATE TABLE, ALTER TABLE) that the connector has seen. On restart, the connector replays this topic to rebuild its in-memory schema, then requests the binlog from the last processed GTID. This is why the connector can decode events even after ALTER TABLE.
Output.
| Aspect | Postgres | MySQL |
|---|---|---|
| WAL / log | WAL | binlog |
| Position tracker | Replication slot (LSN) | GTID |
| Retention control | replication slot + max_slot_wal_keep_size | binlog_expire_logs_seconds |
| Snapshot mechanism | SET TRANSACTION SNAPSHOT | FLUSH TABLES WITH READ LOCK (briefly) |
| Debezium plugin | pgoutput | native binlog protocol |
Rule of thumb. For MySQL log-based CDC, always run binlog_format=ROW + binlog_row_image=FULL + gtid_mode=ON. Anything else creates operational pain. The Debezium MySQL connector then works with the same downstream Kafka contract as the Postgres connector.
Senior interview question on log-based CDC
A senior interviewer might ask: "Design log-based CDC from Postgres 16 to Kafka for a 100-table warehouse feed. Cover the Postgres config, the connector snapshot strategy, the slot-lag monitoring, the schema-evolution story (someone adds a column tomorrow), and the failure semantics when Kafka is unreachable for 30 minutes."
Solution Using Debezium + pgoutput + slot monitoring + schema-registry-backed evolution
-- 1. Postgres server config (postgresql.conf) — one restart
-- wal_level = logical
-- max_replication_slots = 20
-- max_wal_senders = 20
-- max_slot_wal_keep_size = 50GB -- disk-full defense
-- 2. Replication role + publication for all warehouse tables
CREATE ROLE cdc_reader WITH LOGIN REPLICATION PASSWORD 'strong-secret';
GRANT USAGE ON SCHEMA public TO cdc_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO cdc_reader;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO cdc_reader;
CREATE PUBLICATION warehouse_pub FOR ALL TABLES;
# 3. Debezium connector — full-warehouse config
name: warehouse-cdc
config:
connector.class: io.debezium.connector.postgresql.PostgresConnector
# Source
plugin.name: pgoutput
database.hostname: db-primary.internal
database.port: 5432
database.user: cdc_reader
database.password: ${env:CDC_PASSWORD}
database.dbname: production
# CDC
slot.name: debezium_warehouse
publication.name: warehouse_pub
snapshot.mode: initial
snapshot.isolation.mode: repeatable_read
snapshot.max.threads: 4 # parallelise the initial snapshot
heartbeat.interval.ms: 10000
schema.include.list: public
# Schema-registry-backed evolution
topic.prefix: prod
key.converter: io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url: http://schema-registry:8081
value.converter: io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url: http://schema-registry:8081
# Delete handling + tombstones
tombstones.on.delete: true
# Kafka resilience
producer.override.acks: all
producer.override.retries: 2147483647
producer.override.delivery.timeout.ms: 2147483647
producer.override.max.in.flight.requests.per.connection: 5
producer.override.enable.idempotence: true
-- 4. Slot monitoring (Prometheus alerts)
-- pg_slot_lag_bytes{slot="debezium_warehouse"} > 5368709120 for 10m # 5 GB
-- pg_slot_active{slot="debezium_warehouse"} == 0 for 15m
Step-by-step trace.
| Concern | Answer | Reasoning |
|---|---|---|
| Postgres config | wal_level=logical + max_slot_wal_keep_size=50GB | disk-full defense |
| Publication scope | FOR ALL TABLES | 100-table warehouse; no per-table maintenance |
| Snapshot | initial + max.threads=4 | parallel snapshot for large tables |
| Schema evolution | Avro + schema registry | backward-compatible column adds |
| Kafka outage | idempotent producer + infinite retries | slot holds WAL until Kafka returns |
| Slot lag alert | > 5 GB for 10 min | early warning below max_slot_wal_keep_size |
After deployment, the connector snapshots ~500 GB of history in ~4 hours (parallel snapshot threads), then tails at sub-second latency. A schema change (ALTER TABLE orders ADD COLUMN priority INT) auto-propagates: Debezium detects the DDL, registers a new schema version in the registry, and downstream consumers keep decoding. A 30-minute Kafka outage causes the slot to accumulate up to 50 GB of WAL; the connector reconnects on Kafka recovery and drains without data loss.
Output:
| Metric | Value |
|---|---|
| End-to-end latency (steady state) | 100-200 ms |
| Source CPU overhead | ~3% of the DB host |
| Snapshot time (500 GB) | ~4 h with 4 threads |
| Slot lag budget | 50 GB (max_slot_wal_keep_size) |
| Schema evolution | automatic (schema registry) |
| Kafka outage tolerance | up to 50 GB WAL retention |
Why this works — concept by concept:
- pgoutput + PUBLICATION FOR ALL TABLES — built-in Postgres decoder, no plugin install; ALL TABLES scope avoids per-table maintenance in a 100-table warehouse.
- snapshot.mode=initial + max.threads=4 — the initial snapshot is the biggest one-time cost. Parallel threads scan multiple tables concurrently, cutting hours off the bootstrap window.
- Avro + schema registry — schema-registry-backed Avro means a column add is a backward-compatible schema evolution, not a downstream-consumer break. JSON without a schema loses type information; Avro with a registry keeps everything typed.
-
max_slot_wal_keep_size=50GB — the disk-full defense. If Kafka is unreachable for hours, the slot can grow up to 50 GB; beyond that, the slot is invalidated (
wal_status=lost) and the primary survives at the cost of losing continuity. Trade correctness for cluster survival. - Cost — ~3% source CPU, ~50 GB WAL retention budget, 4-hour one-time snapshot, ongoing schema-registry integration. Compared to trigger CDC's 2× write amplification and per-table trigger maintenance, log-based is dramatically cheaper at scale. Compared to timestamp CDC, it captures deletes natively and eliminates the nightly reconcile.
Streaming
Topic — streaming
Streaming CDC and Debezium problems
5. Outbox pattern + hybrid
The outbox pattern solves the dual-write problem — write the business row and the event to the same table in one transaction, then let a CDC reader publish the events
The mental model in one line: the outbox pattern is a schema convention where the application writes the business row and a row into an outbox table inside the same database transaction, and a downstream reader (usually log-based CDC) tails the outbox table and publishes each row as a domain event — this collapses the dual-write problem (business row committed but event lost, or vice versa) into a single-transaction commit that either succeeds fully or fails fully. Every senior microservice architecture needs this pattern once it has more than one downstream consumer that expects typed events.
The dual-write problem — what the outbox pattern solves.
-
The naive design. Application writes to
orders(Postgres), then publishes anOrderPlacedevent to Kafka. Two separate systems, two separate writes. - The failure modes. (a) Postgres commits, Kafka publish fails → business row saved, event lost, downstream never learns. (b) Kafka publish succeeds, Postgres commits fails → downstream told about an order that doesn't exist. Both are correctness bugs.
- The wrong fix. Two-phase commit across Postgres and Kafka — theoretically correct, operationally terrible (long-held locks, blocking on Kafka availability).
-
The outbox fix. Write the event to a Postgres
outboxtable in the same transaction as the business row. Both commit atomically. A downstream CDC reader (log-based, typically Debezium) tails the outbox table and publishes to Kafka. If Kafka is down, the reader retries; the source is still consistent.
The outbox table shape.
-
Primary key.
id UUIDorBIGSERIAL— monotonic for CDC ordering. -
Aggregate key.
aggregate_type TEXT(e.g.Order,Customer),aggregate_id TEXT— identifies which entity this event is about. Used for Kafka topic routing and partition-key derivation. -
Event type.
event_type TEXT(e.g.OrderPlaced,OrderShipped) — the semantic type of the event. -
Payload.
payload JSONB— the event body. -
Timestamps + metadata.
created_at TIMESTAMPTZ DEFAULT clock_timestamp().
The Debezium EventRouter SMT — the outbox-specific plumbing.
-
What it does. A Single Message Transform (SMT) that rewrites Debezium change events from the generic
prod.public.outboxshape intoprod.Order.OrderPlaced(topic per aggregate type + event type). Extracts the payload from the JSONB column so downstream consumers see typed events, not Debezium envelopes. -
The routing. By default, one Kafka topic per
aggregate_type. Configurable to route byevent_typeor a custom expression. -
The partition key. The
aggregate_idbecomes the Kafka message key, so all events for the same aggregate land on the same partition (ordering guarantee per aggregate). - The delete. After the CDC reader has forwarded the outbox row, the application can safely delete it (or a scheduled job TTLs old rows). The events live in Kafka; the outbox is just the buffer.
The hybrid pattern — log-based + outbox side by side.
-
The idea. Log-based CDC captures raw source-table mutations (INSERT/UPDATE/DELETE on
orders) for the warehouse. Outbox captures domain events (OrderPlaced,OrderShipped) for the event bus. Both are shipped by the same Debezium instance; they serve different consumers. - Why both. Raw mutations feed dashboards and audits ("show me every state change on this order"). Domain events feed microservices ("react when an order is placed"). Sharing one CDC tool but two data channels is the cheapest way to serve both.
Common interview probes on the outbox pattern.
- "What problem does the outbox pattern solve?" — required answer: the dual-write problem between the DB and an event bus.
- "Why not two-phase commit?" — long locks, blocks on Kafka availability, operational complexity.
- "How does the outbox row get to Kafka?" — CDC reader (typically Debezium) tails the outbox table.
- "How do you route different event types to different Kafka topics?" — Debezium EventRouter SMT keyed on
aggregate_typeorevent_type. - "How do you clean up processed outbox rows?" — TTL job or delete-after-forward (only safe with idempotent consumers).
Worked example — outbox table + application code writing atomically
Detailed explanation. The canonical outbox setup on the application side: define the outbox table, wrap the business write and the outbox write in a single transaction, and let Debezium do the rest. Walk through the schema and the application code.
-
Schema.
outbox(id, aggregate_type, aggregate_id, event_type, payload, created_at). - Application code. One transaction, two writes: business row + outbox row.
- Rollback semantics. If either write fails, the whole transaction rolls back.
Question. Write the outbox schema and a Python function that places an order and emits an OrderPlaced event atomically.
Input.
| Component | Value |
|---|---|
| Business table | public.orders |
| Outbox table | public.outbox |
| Event type | OrderPlaced |
| Aggregate | Order / order_id |
Code.
-- Outbox table — one shape serves all aggregate types
CREATE TABLE public.outbox (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_type TEXT NOT NULL, -- 'Order', 'Customer', ...
aggregate_id TEXT NOT NULL, -- string form of the PK
event_type TEXT NOT NULL, -- 'OrderPlaced', 'OrderShipped', ...
payload JSONB NOT NULL, -- event body
created_at TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp()
);
CREATE INDEX idx_outbox_created_at ON public.outbox (created_at);
CREATE INDEX idx_outbox_aggregate ON public.outbox (aggregate_type, aggregate_id);
# Application code — one transaction, two writes
import json
import psycopg2
from typing import Any
def place_order(conn, customer_id: int, total_cents: int, items: list[dict[str, Any]]) -> int:
"""Insert an order and an OrderPlaced event atomically."""
with conn: # implicit BEGIN / COMMIT (rollback on exception)
with conn.cursor() as cur:
# 1. Business write — insert the order
cur.execute("""
INSERT INTO public.orders(customer_id, total_cents, status)
VALUES (%s, %s, 'pending')
RETURNING id
""", (customer_id, total_cents))
order_id = cur.fetchone()[0]
# 2. Outbox write — same transaction, atomic commit
payload = {
"order_id": order_id,
"customer_id": customer_id,
"total_cents": total_cents,
"items": items,
"placed_at": "now",
}
cur.execute("""
INSERT INTO public.outbox(aggregate_type, aggregate_id, event_type, payload)
VALUES ('Order', %s, 'OrderPlaced', %s::jsonb)
""", (str(order_id), json.dumps(payload)))
# If we reach here, both writes committed. If either failed,
# the with-block rolled back and re-raised.
return order_id
Step-by-step explanation.
- The
outboxschema is one-table-fits-all:aggregate_type+aggregate_ididentify the entity;event_typenames the semantic event;payload JSONBcarries the body. No per-event-type schema migration needed. - The Python
with conn:idiom (psycopg2's context manager) implicitly wraps the block inBEGIN ... COMMIT; any exception triggers ROLLBACK. This is the atomicity guarantee: business row and outbox row commit together or not at all. - Step 1 inserts the business row and returns the generated
order_idviaRETURNING id. Step 2 uses thatorder_idas theaggregate_idfor the outbox row and embeds it in the JSON payload as well. - The application never talks to Kafka. That responsibility is delegated to Debezium tailing the
outboxtable. If Kafka is down, the application still commits normally; Debezium retries; the pipeline eventually catches up. - Under Postgres MVCC, the two INSERTs are visible atomically to CDC — Debezium's snapshot isolation reads either the pre-commit state (nothing) or the post-commit state (both rows). There's no window where the business row exists but the outbox row doesn't.
Output.
| Step | State after |
|---|---|
| BEGIN | (transaction open) |
| INSERT orders | order_id = 42 (uncommitted) |
| INSERT outbox | (uncommitted, references order_id 42) |
| COMMIT | Both rows durably committed; Debezium sees them in WAL |
| Debezium tail | Outbox row → Kafka topic prod.Order.OrderPlaced
|
| (Debezium never sees an orders row without matching outbox) | Atomicity preserved |
Rule of thumb. For any dual-write scenario (DB + event bus, DB + cache, DB + search), replace the second write with an outbox row in the same transaction, and let a CDC reader publish. This is the single most-cited microservice pattern in senior interviews.
Worked example — Debezium EventRouter SMT + topic routing
Detailed explanation. With the outbox table in place, the CDC reader still needs to (a) recognise it as outbox-shaped, (b) route to the correct Kafka topic based on aggregate_type, and (c) unwrap the payload so downstream consumers see the event body, not the Debezium envelope. Debezium's built-in EventRouter SMT does all three. Configure the connector.
-
Recognition. SMT matches records from
prod.public.outbox. -
Routing. One topic per
aggregate_type(e.g.prod.Order,prod.Customer). -
Unwrap. SMT extracts
payloadas the message value; setsaggregate_idas the message key.
Question. Configure the Debezium Postgres connector with the EventRouter SMT for the outbox table.
Input.
| Component | Value |
|---|---|
| Source table | public.outbox |
| Route by | aggregate_type |
| Topic naming | prod.<aggregate_type> |
| Message key | aggregate_id |
| Message value | payload JSONB |
Code.
# Debezium connector — outbox mode
name: outbox-cdc
config:
connector.class: io.debezium.connector.postgresql.PostgresConnector
plugin.name: pgoutput
database.hostname: db-primary.internal
database.port: 5432
database.user: cdc_reader
database.password: ${env:CDC_PASSWORD}
database.dbname: production
slot.name: debezium_outbox
publication.name: outbox_pub # created as: CREATE PUBLICATION outbox_pub FOR TABLE public.outbox;
table.include.list: public.outbox
snapshot.mode: never # outbox is empty at bootstrap; no need to snapshot
heartbeat.interval.ms: 10000
topic.prefix: prod
# EventRouter SMT — the outbox-specific plumbing
transforms: outbox
transforms.outbox.type: io.debezium.transforms.outbox.EventRouter
transforms.outbox.route.by.field: aggregate_type
transforms.outbox.route.topic.replacement: prod.${routedByValue}
transforms.outbox.table.field.event.key: aggregate_id
transforms.outbox.table.field.event.payload: payload
transforms.outbox.table.expand.json.payload: true
key.converter: org.apache.kafka.connect.storage.StringConverter
value.converter: org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable: false
# Kafka topic layout after the SMT runs
prod.Order — messages keyed by order_id, value = full OrderPlaced/Shipped payload
prod.Customer — messages keyed by customer_id, value = full CustomerCreated/Updated payload
prod.Shipment — messages keyed by shipment_id, value = full ShipmentDispatched payload
Message on prod.Order:
key = "42"
value = {"order_id": 42, "customer_id": 7, "total_cents": 1500, ...}
Step-by-step explanation.
- The Debezium connector tails the WAL exactly like a normal CDC connector — no special Postgres-side setup for outbox. The
PUBLICATION outbox_pub FOR TABLE public.outboxscopes it;snapshot.mode = neveris safe because the outbox is empty (or being TTL'd) — bootstrap is unnecessary. - The
EventRouterSMT intercepts every Debezium change event frompublic.outbox. It reads theaggregate_typecolumn to decide the target topic (prod.Order,prod.Customer, etc.), readsaggregate_idfor the message key, and readspayloadfor the message value. -
transforms.outbox.route.topic.replacement = prod.${routedByValue}is the templated topic-name expression;${routedByValue}is substituted with the value fromroute.by.field(aggregate_type). This gives one topic per aggregate type without manual per-type config. -
expand.json.payload = truetells the SMT to unwrap the JSONB payload column: instead of sending{"payload": {...}}, it sends the inner object directly. Downstream consumers see typed domain events, not Debezium envelopes. - The result: from the outside, the outbox is invisible. Applications see typed events on typed topics. Debezium + the SMT is the entire plumbing; the outbox table is an implementation detail.
Output.
| Outbox row | Kafka topic | Kafka key | Kafka value |
|---|---|---|---|
| aggregate=Order, id=42, type=OrderPlaced | prod.Order | "42" | {"order_id": 42, ...} |
| aggregate=Order, id=42, type=OrderShipped | prod.Order | "42" | {"order_id": 42, "shipped_at": ...} |
| aggregate=Customer, id=7, type=CustomerCreated | prod.Customer | "7" | {"customer_id": 7, ...} |
Rule of thumb. For any Debezium + outbox deployment, use the built-in EventRouter SMT — do not roll your own topic routing. route.by.field on aggregate_type and event.key on aggregate_id are the defaults for a reason: they align with Kafka's per-key ordering guarantees.
Worked example — outbox TTL and idempotent-consumer contract
Detailed explanation. Two operational tails on every outbox deployment: (a) the outbox table grows indefinitely without cleanup, and (b) downstream consumers must be idempotent because Kafka delivers at-least-once. Walk through the TTL job and the consumer-side idempotency contract.
- TTL. Nightly job deletes outbox rows older than N days; N = "how far back does Kafka retain the same event?" plus a safety margin.
-
Idempotency. Consumers store
(aggregate_id, event_id)in a processed table; skip if seen.
Question. Write the TTL job and the consumer-side idempotency check.
Input.
| Component | Value |
|---|---|
| TTL age | 7 days |
| Kafka retention | 7 days |
| Consumer store | Redis or Postgres |
| Dedupe key | event id (UUID) |
Code.
-- Nightly TTL job — delete outbox rows older than 7 days
-- (Safe because Debezium has forwarded them; Kafka has the durable copy)
DELETE FROM public.outbox
WHERE created_at < now() - INTERVAL '7 days';
-- Better: partition by day and DROP old partitions
-- (O(1) delete vs O(rows))
# Consumer-side idempotency — Postgres-backed dedupe
import psycopg2
from psycopg2 import errors as pg_errors
def process_order_event(conn, event: dict) -> None:
"""Idempotent handler for OrderPlaced / OrderShipped events."""
event_id = event["id"] # UUID from outbox.id, carried in payload
order_id = event["order_id"]
event_type = event["event_type"]
try:
with conn:
with conn.cursor() as cur:
# 1. Try to record the event as processed
cur.execute("""
INSERT INTO consumer_processed_events(event_id, consumer_name)
VALUES (%s, 'order-analytics-svc')
""", (event_id,))
# 2. Do the actual work (INSERT into analytics table, etc.)
cur.execute("""
INSERT INTO analytics_orders(order_id, event_type, seen_at)
VALUES (%s, %s, now())
""", (order_id, event_type))
except pg_errors.UniqueViolation:
# This event was already processed — safe to skip.
conn.rollback()
return
-- Consumer-side processed-events table
CREATE TABLE consumer_processed_events (
event_id UUID NOT NULL,
consumer_name TEXT NOT NULL,
processed_at TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp(),
PRIMARY KEY (event_id, consumer_name)
);
-- TTL for the dedupe table — keep 30 days
DELETE FROM consumer_processed_events
WHERE processed_at < now() - INTERVAL '30 days';
Step-by-step explanation.
- The outbox TTL is safe because the source of truth after CDC has forwarded is Kafka, not the outbox. Once Debezium's slot has confirmed the row's LSN, the row can be deleted. A 7-day age gives a comfortable buffer against slot lag or backfill needs.
- The partitioned-table alternative (
DROP TABLE outbox_20260627) is O(1) rather than O(rows) and is the pattern of choice at high volume.DELETE ... WHERE created_at < now() - INTERVAL '7 days'is fine for lower volume. - The consumer side must assume at-least-once delivery: Kafka can redeliver on partition rebalance, consumer restart, or explicit reset. Idempotency is not optional.
- The dedupe pattern uses a
consumer_processed_eventstable with(event_id, consumer_name)as the composite PK. Trying to insert a duplicate raisesUniqueViolation; catching it lets the consumer skip cleanly. Doing this inside a transaction with the actual work guarantees the "processed" record is committed atomically with the side effect. - The
consumer_namecolumn allows multiple consumer services to share the dedupe table without interfering. Consumer A processing event X does not stop Consumer B from processing event X — each has its own row.
Output.
| Concern | Mechanism | Result |
|---|---|---|
| Outbox unbounded growth | Nightly TTL / partition drop | bounded at 7 days |
| Duplicate delivery | Consumer-side dedupe table | at-most-once effect |
| Consumer replay | Reset Kafka offset | dedupe skips already-processed |
| Multiple consumers | Per-consumer row | independent progress |
Rule of thumb. Every outbox deployment needs a TTL job on the producer side and an idempotency contract on the consumer side. Skipping either creates a slow-motion failure — the outbox table grows to TB scale, or a partition rebalance double-counts events.
Senior interview question on outbox pattern
A senior interviewer might ask: "You have a microservice that writes to Postgres and needs to publish typed events to Kafka. The naive design uses two-phase commit; the previous team tried it and it deadlocked. Walk me through the outbox pattern — the schema, the atomic write, the Debezium EventRouter, the TTL, and the consumer-side idempotency contract. Then extend to a hybrid where log-based CDC also feeds the warehouse."
Solution Using an outbox table + Debezium EventRouter + hybrid log-based warehouse feed
-- 1. Outbox schema — the atomic-write buffer
CREATE TABLE public.outbox (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_type TEXT NOT NULL,
aggregate_id TEXT NOT NULL,
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp()
) PARTITION BY RANGE (created_at);
CREATE TABLE public.outbox_20260704 PARTITION OF public.outbox
FOR VALUES FROM ('2026-07-04') TO ('2026-07-05');
CREATE INDEX idx_outbox_agg ON public.outbox (aggregate_type, aggregate_id);
-- 2. Publications — outbox + business tables (hybrid)
CREATE PUBLICATION outbox_pub FOR TABLE public.outbox;
CREATE PUBLICATION warehouse_pub FOR TABLE public.orders, public.customers, public.shipments;
# 3. Two Debezium connectors sharing the same Postgres — different scope, different transforms
name: outbox-cdc
config:
connector.class: io.debezium.connector.postgresql.PostgresConnector
plugin.name: pgoutput
slot.name: debezium_outbox
publication.name: outbox_pub
table.include.list: public.outbox
snapshot.mode: never
topic.prefix: prod
transforms: outbox
transforms.outbox.type: io.debezium.transforms.outbox.EventRouter
transforms.outbox.route.by.field: aggregate_type
transforms.outbox.route.topic.replacement: prod.${routedByValue}
transforms.outbox.table.field.event.key: aggregate_id
transforms.outbox.table.field.event.payload: payload
transforms.outbox.table.expand.json.payload: true
---
name: warehouse-cdc
config:
connector.class: io.debezium.connector.postgresql.PostgresConnector
plugin.name: pgoutput
slot.name: debezium_warehouse
publication.name: warehouse_pub
table.include.list: public.orders,public.customers,public.shipments
snapshot.mode: initial
topic.prefix: warehouse
# No EventRouter — raw Debezium change events go straight to Kafka
tombstones.on.delete: true
# 4. Application — single-transaction dual write
def place_order(conn, customer_id: int, total_cents: int) -> int:
with conn:
with conn.cursor() as cur:
cur.execute("""
INSERT INTO public.orders(customer_id, total_cents, status)
VALUES (%s, %s, 'pending')
RETURNING id
""", (customer_id, total_cents))
order_id = cur.fetchone()[0]
cur.execute("""
INSERT INTO public.outbox(aggregate_type, aggregate_id, event_type, payload)
VALUES ('Order', %s, 'OrderPlaced',
jsonb_build_object('order_id', %s, 'customer_id', %s, 'total_cents', %s))
""", (str(order_id), order_id, customer_id, total_cents))
return order_id
Step-by-step trace.
| Layer | Component | Purpose |
|---|---|---|
| Application | one transaction; two INSERTs | atomic business + event write |
| Postgres | outbox + orders/customers/shipments | one source of truth |
| Debezium outbox-cdc | tail outbox_pub; EventRouter | typed events to prod.<Aggregate> topics |
| Debezium warehouse-cdc | tail warehouse_pub; snapshot=initial | raw mutations to warehouse.public.<table>
|
| Consumers | order-analytics-svc + Snowflake | independent progress; idempotency contract |
| Retention | daily partition drop on outbox | bounded storage |
After deployment, the application makes one transaction, and two Kafka streams appear downstream: typed domain events on prod.Order / prod.Customer / prod.Shipment for microservice consumers, and raw mutation events on warehouse.public.orders / etc. for the warehouse. The application never touches Kafka; the DBA sees two replication slots on one Postgres.
Output:
| Downstream | Topic | Message shape | Consumer |
|---|---|---|---|
| Order service | prod.Order |
OrderPlaced, OrderShipped payloads | order-analytics-svc |
| Customer service | prod.Customer |
CustomerCreated, CustomerUpdated | crm-svc |
| Warehouse | warehouse.public.orders |
Debezium change events | Snowflake sink |
| Warehouse | warehouse.public.customers |
Debezium change events | Snowflake sink |
Why this works — concept by concept:
- Single-transaction outbox write — the atomic guarantee. Business row and event row commit together via Postgres MVCC; no two-phase commit needed. Solves the dual-write problem at its root.
-
EventRouter SMT — the outbox-to-typed-topic bridge.
route.by.field = aggregate_typegives one topic per aggregate;expand.json.payloadunwraps the JSONB so consumers see typed events, not Debezium envelopes. - Hybrid = two connectors, two publications, two purposes — outbox connector for domain events; warehouse connector for raw mutations. Different snapshot semantics (never vs initial) match the data lifecycles.
-
Idempotent consumers — Kafka delivers at-least-once; consumers dedupe by
(event_id, consumer_name). This is the contract every outbox consumer signs; skipping it turns duplicate delivery into a correctness bug. - Cost — one Postgres, two replication slots, two Debezium connectors, ~4-5% source CPU total, partitioned outbox with daily drop for retention. Compared to two-phase commit (deadlocks, blocking), this is the operationally sane way to serve both microservice and warehouse consumers from one source. O(1) per commit on the app side; the CDC readers handle the fan-out.
Streaming
Topic — streaming
Streaming outbox and event-driven problems
Design
Topic — design
Design problems on transactional messaging
Cheat sheet — CDC pattern recipes
-
Which pattern when. Log-based CDC is the 2026 default when your DBA grants
wal_level=logical(Postgres),binlog_format=ROW(MySQL), or replica-set oplog access (Mongo). Timestamp CDC is the fallback when no replication is grantable but SELECT works — mandatory nightly reconcile for deletes. Trigger CDC when CREATE TRIGGER is allowed but replication isn't — accept the 2× write amplification. Outbox pattern on top of any primary CDC whenever the app must emit typed events atomically with a business write. -
Timestamp watermark checkpoint template. Store watermark durably in the source Postgres itself (
cdc_watermarks(table_name PK, last_updated_at, last_poll_at)), advance tomax(observed updated_at)inside the same transaction that writes the delta, use a 5-minute safety horizon (updated_at <= now() - 5m) to absorb late-committing transactions, and enforceupdated_at = clock_timestamp()in aBEFORE UPDATEtrigger to eliminate app-side clock skew. -
Postgres audit-trigger 6-line template.
CREATE FUNCTION audit_X_trigger() ... IF TG_OP='INSERT' THEN INSERT ...(op, id, after=to_jsonb(NEW)); ELSIF TG_OP='UPDATE' THEN INSERT ...(op, id, before=to_jsonb(OLD), after=to_jsonb(NEW)); ELSIF TG_OP='DELETE' THEN INSERT ...(op, id, before=to_jsonb(OLD)); END IF; RETURN NULL; END;bound toAFTER INSERT OR UPDATE OR DELETE FOR EACH ROW. Shadow table usesBIGSERIAL audit_idas the reader watermark; partition bychanged_atfor O(1) retention drops. -
Debezium Postgres connector JSON.
plugin.name=pgoutput,snapshot.mode=initial,snapshot.isolation.mode=repeatable_read,heartbeat.interval.ms=10000,slot.name=<unique-per-connector>,publication.name=<explicit>(neverFOR ALL TABLESin production),tombstones.on.delete=truefor Kafka log-compaction,producer.override.enable.idempotence=true, andmax_slot_wal_keep_size=<N>GBon the Postgres side as the disk-full defense. -
Outbox table DDL + EventRouter SMT.
outbox(id UUID PK default gen_random_uuid(), aggregate_type, aggregate_id, event_type, payload JSONB, created_at); write inside the same transaction as the business row; connect Debezium withtransforms.outbox.type=io.debezium.transforms.outbox.EventRouter,route.by.field=aggregate_type,route.topic.replacement=prod.${routedByValue},table.field.event.key=aggregate_id,table.field.event.payload=payload,table.expand.json.payload=true. Partition outbox by day; drop old partitions once Debezium has confirmed the LSN. - Pattern decision matrix. Permission: log-based needs replication; trigger needs CREATE TRIGGER; timestamp needs SELECT; outbox needs schema change. Latency: log-based sub-second, trigger seconds, timestamp minutes, outbox = reader-dependent. Delete visibility: log-based/trigger/outbox yes, timestamp no. Source load: log-based ~0 on tables, trigger 2× write amp, timestamp 1 indexed scan per poll, outbox +1 write per business txn. Print this matrix on a sticky note; use it in every interview.
-
Slot-lag monitoring query.
SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes FROM pg_replication_slots WHERE slot_type='logical';alert onlag_bytes > 5 GB for 10 min(below themax_slot_wal_keep_sizeceiling) and onactive = false for 15 min. Runbook: check connector logs, check Kafka consumer lag, ifwal_status='lost'drop slot and resnapshot. -
Consumer-side idempotency contract. Every consumer of a Kafka topic fed by CDC (log-based, outbox, or otherwise) stores
(event_id, consumer_name)in aconsumer_processed_eventstable with a composite PK. Try to insert; onUniqueViolation, skip. Wrap the dedupe insert and the actual side-effect in one transaction so "recorded processed" and "did the work" commit together. TTL the dedupe table at ~30 days. -
Bootstrap / snapshot policy. Log-based:
snapshot.mode=initialfor greenfield tables;snapshot.mode=neverfor tables you've bootstrapped some other way (e.g. outbox tables which are empty at start). Timestamp: bootstrap by setting watermark to epoch and running one large-batch first poll. Trigger: application-managed — either add the trigger before the table is populated or accept that pre-trigger rows are invisible. -
Failure semantics reminder. Log-based: Kafka down → slot holds WAL → capped by
max_slot_wal_keep_size. Timestamp: watermark stale → next poll catches up; late rows caught by safety window. Trigger: shadow table grows; reader restart tolerated. Outbox: business rollback = event rollback; Kafka down = outbox grows; capped by TTL. Every pattern has a bounded failure surface; know yours. - Schema evolution rules. Log-based + Avro + schema registry = backward-compatible column adds work automatically. Trigger CDC + JSONB payload = column adds are automatic (JSONB just carries them). Timestamp CDC + explicit column list = column adds require code change. Outbox + JSONB payload = new event types are additive; consumers ignore unknown types.
-
When to use hybrid. Any deployment that needs both raw source-table mutations (warehouse, audit, dashboards) and typed domain events (microservices, event bus). Two Debezium connectors, two publications, two Kafka namespaces (
warehouse.*vsprod.*) — one Postgres. Costs ~5% source CPU total; buys both consumer models. - Migration cost between patterns. Timestamp → log-based: ~2 engineer-weeks (config + snapshot + downstream MERGE rewrites). Trigger → log-based: ~1 engineer-week (drop triggers + shadow table; switch reader source). Adding outbox to existing app: ~1 sprint per aggregate type (application code changes). Choose the primary pattern once; the migration cost is real.
Frequently asked questions
What is CDC in one sentence?
Change data capture is the pattern of continuously observing an operational database's mutations — inserts, updates, and deletes — and forwarding each mutation to downstream systems (warehouse, search index, cache, event bus) so those systems stay in near-real-time sync with the source without ever re-reading the full source table. The four canonical cdc patterns — timestamp (poll updated_at), trigger (audit shadow table), log-based (tail the WAL/binlog/oplog), and outbox (single-transaction event log) — differ in permission requirements, latency, delete handling, and source-side load, and the choice binds every downstream consumer for years. Every senior data-engineering interview probes CDC because it's the load-bearing pipeline pattern for the modern warehouse + microservice stack.
Timestamp CDC vs log-based CDC — when do I pick each?
Default to log-based CDC when your DBA grants wal_level=logical (Postgres), binlog_format=ROW (MySQL), or replica-set oplog access (Mongo). Log-based captures every DML including deletes, adds essentially zero load on the source tables (the WAL is written for durability regardless), and ships sub-second latency via Debezium, AWS DMS, or a native connector. Pick timestamp CDC (poll WHERE updated_at > watermark) as the fallback when logical replication is not grantable — some managed Postgres tiers, ancient MySQL versions, air-gapped Oracle deployments. Timestamp is cheaper to build and operates against any DB with an updated_at column, but it's blind to physical deletes and requires a nightly full-table reconcile job to catch missing rows. Never pick timestamp for a workload that needs sub-minute freshness — the poll interval is a hard floor on latency.
What is the outbox pattern and what problem does it solve?
The outbox pattern solves the dual-write problem — the correctness bug where an application writes a business row to a database and publishes an event to Kafka (or any other bus) as two separate operations. Either the database commit succeeds and Kafka publish fails (event lost) or Kafka publish succeeds and DB commit fails (event describes a non-existent row); both violate consistency. The outbox pattern collapses both writes into a single database transaction: the application writes the business row and a row to an outbox table in the same BEGIN/COMMIT, and a downstream CDC reader (typically Debezium with the EventRouter SMT) tails the outbox table and publishes to Kafka. Both writes commit atomically or neither does; the CDC reader retries on Kafka outage without ever losing an event. This pattern is the load-bearing correctness invariant of most modern microservice + event-bus architectures, and it's the single most-cited answer in senior CDC interviews.
Can I do CDC without logical replication?
Yes — timestamp cdc and trigger cdc are the two answers when logical replication is off-limits. Timestamp CDC needs only SELECT on the source table plus an indexed updated_at column; it polls for updated_at > last_watermark, ships the delta, and advances the watermark. It's blind to physical deletes, so you either adopt a soft-delete convention (UPDATE ... SET deleted_at = now() instead of DELETE) or ship a nightly full-table reconcile that catches missing rows. Trigger CDC needs CREATE TRIGGER on the source tables: an AFTER INSERT OR UPDATE OR DELETE trigger writes one row per DML to a shadow / audit table, and a reader tails that shadow table by monotonic BIGSERIAL audit_id. Trigger CDC captures every DML including deletes but pays for it with 2× write amplification (every source-table DML now writes two rows), which is meaningful on write-heavy tables. Pick timestamp for the lightest-touch, permission-frugal deployments; pick trigger when the delete-handling requirement forces your hand and log-based is genuinely off the table.
Which CDC pattern handles deletes correctly?
Log-based CDC (Debezium, AWS DMS, native pglogical) captures physical DELETEs natively — the WAL/binlog/oplog records every DML including deletes, and the reader emits a "tombstone" event carrying the pre-image of the deleted row. Trigger CDC captures deletes via a DELETE trigger that writes an op='D' row into the shadow table with the OLD row image. Outbox CDC doesn't capture source-table deletes directly — it captures whatever events the application chooses to write to the outbox, so if the app runs DELETE FROM orders without also inserting a Deleted outbox row, downstream never learns. Timestamp CDC is blind to physical deletes — a DELETE FROM orders WHERE id = 42 leaves no updated_at for the next poll to query, so the row silently vanishes downstream. The timestamp workaround is either soft-delete (UPDATE ... SET deleted_at = now()) or a nightly full-table reconcile that flags rows present in the warehouse but missing in the source. Getting the delete axis wrong ships a subtle correctness bug that only surfaces during audit season.
Is trigger-based CDC still used in 2026?
Yes, though it's the smallest of the four patterns by deployment count. Trigger CDC ships in three scenarios in 2026: (a) source databases where the DBA won't grant logical replication or the managed tier doesn't offer it (some SOC-compliance environments, older RDS tiers, air-gapped Oracle), (b) regulatory-audit workloads where the shadow / audit table doubles as the compliance record with changed_by TEXT DEFAULT current_user and txid BIGINT DEFAULT txid_current() provenance, and (c) legacy pipelines that predate Debezium and haven't been migrated. New greenfield deployments almost always start with log-based CDC because the operational story is simpler and the source-side write amplification is zero. That said, senior interviewers still probe trigger CDC because knowing why you'd pick it (permission constraints + delete handling + audit trail) is a proxy for understanding all four axes — permission, latency, delete visibility, source load — well enough to defend any pattern choice.
Practice on PipeCode
- Drill the SQL practice library → for the CDC, watermark, incremental-load, and shadow-table problems senior interviewers love.
- Rehearse on the ETL practice library → for the batch CDC pipelines, timestamp-CDC watermark advance, and snapshot-bootstrap patterns.
- Sharpen the streaming axis with the streaming practice library → for Debezium, log-based CDC, and outbox event-routing scenarios.
- Stack the prerequisites against PipeCode's broader 450+ data-engineering catalogue to anchor the four-pattern decision matrix against real graded inputs.
Lock in CDC pattern muscle memory
Docs explain patterns. PipeCode drills explain the decision — when timestamp CDC is blind to deletes, when trigger CDC's write amplification bites, when log-based CDC's slot lag becomes a disk-full incident, when the outbox pattern earns its place. Pipecode.ai is Leetcode for Data Engineering — pattern-first practice tuned for the production trade-offs senior data engineers actually face.





Top comments (0)