Gowtham Potureddi

Posted on May 27

Change Data Capture (CDC) for Data Engineering Interviews: Debezium, Log-Based vs Trigger-Based, Kafka Connect

#sql #python #interview #dataengineering

change data capture interview questions appear in every senior data engineering loop because CDC is the substrate that connects every transactional database to every analytical destination in modern data platforms. Interviewers don't stop at "what is CDC?" — they probe whether you understand log based cdc vs trigger based cdc as a real trade-off, debezium architecture as the open-source workhorse, the dual writes anti-pattern as a common failure mode, and cdc to snowflake / cdc to bigquery patterns as the production reality.

This guide walks through the seven CDC primitives that show up most often in data engineering interview questions at FAANG, fintech, and SaaS shops. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works. By the end you'll be able to defend log-based CDC over polling, explain the outbox pattern as the fix for dual writes, walk Debezium through snapshot + streaming phases, and ship a replay-safe op-aware MERGE sink — the exact shape cdc data engineer interview rounds reward when debezium kafka connect and cdc schema evolution come up.

When you want hands-on reps immediately after reading, drill the streaming practice library →, browse real-time analytics drills →, or rehearse database SQL problems → for the warehouse-MERGE side.

On this page

Why CDC shows up in every senior data engineering interview
The three CDC strategies — query-based, trigger-based, log-based
Debezium architecture — snapshot then stream
CDC into Kafka — topic-per-table, key choice, partitioning
Schema evolution and the dual-writes trap
CDC sinks — warehouses, lakehouses, op-aware MERGE
Operational CDC — backfills, snapshots, replication lag, observability
Choosing the right CDC primitive (cheat sheet)
Frequently asked questions
Practice on PipeCode

1. Why CDC shows up in every senior data engineering interview

CDC is the only sustainable way to ship transactional data to analytical destinations

The one-sentence invariant: CDC streams every INSERT, UPDATE, and DELETE from a source database to one or more downstream destinations, in near real time, without scanning the source on every load. Once you internalise that — every row change, ordered, replayable, with the original operation type — every change data capture interview questions prompt resolves to "which CDC strategy, which sink, which schema evolution mode."

Why CDC matters in 2026 DE interviews.

Real-time analytics. Dashboards that lag by hours are dead; product analytics, fraud detection, and ops need sub-minute freshness.
Microservices integration. Hundreds of services with their own databases need to feed a shared data plane without coupling.
Warehouse / lakehouse hydration. Snowflake, BigQuery, Databricks all expect a CDC feed for low-latency tables.
Event-driven architectures. CDC events feed downstream consumers (search index, cache invalidation, ML features) without dual writes.

The four CDC concepts every interview opens with.

Capture. Detect every row change on the source.
Order. Preserve the order of changes per key (so the downstream end state is correct).
Deliver. Push events to the destination (Kafka, S3, target warehouse).
Apply. Materialise the changes downstream (MERGE on PK, append-only history, etc.).

What interviewers listen for.

Do you reach for log-based CDC the moment "low latency + low source-DB load" comes up? — senior signal.
Do you mention the outbox pattern when "dual writes" are discussed? — senior-architectural signal.
Do you say MERGE on primary key when "warehouse sink" is asked? — required answer.
Do you bring up snapshot + streaming when "initial backfill" comes up? — Debezium-fluency signal.

Worked example — pick a CDC strategy for PostgreSQL → Snowflake

Detailed explanation. A common opener: "We have an OLTP PostgreSQL with 200 tables; we need them replicated into Snowflake with < 60-second lag and minimal source-DB load. What's your architecture?" The senior answer is log-based CDC via Debezium → Kafka → Snowflake Sink Connector with MERGE.

Question. Sketch the architecture and justify each component.

Code (component view).

# Debezium PostgreSQL source connector (Kafka Connect)
name: postgres-cdc-source
config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector
  database.hostname: postgres-primary
  database.user: replication_user
  database.dbname: production
  topic.prefix: cdc
  table.include.list: public.orders,public.customers,public.products
  slot.name: debezium_slot
  plugin.name: pgoutput
  snapshot.mode: initial
  schema.history.internal.kafka.bootstrap.servers: kafka:9092

# Snowflake sink connector
name: snowflake-cdc-sink
config:
  connector.class: com.snowflake.kafka.connector.SnowflakeSinkConnector
  topics.regex: cdc\.public\..*
  snowflake.database.name: PROD
  snowflake.schema.name: RAW
  key.converter: io.confluent.connect.avro.AvroConverter
  value.converter: io.confluent.connect.avro.AvroConverter
  value.converter.schema.registry.url: http://schema-registry:8081
  buffer.flush.time: 30

Step-by-step explanation.

Debezium reads the PostgreSQL WAL (Write-Ahead Log) via a logical replication slot. Source-DB load is the cost of one streaming reader — comparable to a replica.
Snapshot mode = initial captures every existing row once at connector startup, then transitions to streaming the WAL from the snapshot's LSN cutoff.
Each change becomes a Kafka message with key = primary key and value = { before, after, op, ts_ms, source }.
Topic-per-table convention — cdc.public.orders, cdc.public.customers, etc. — keeps namespaces clean.
Snowflake Sink Connector buffers messages, bulk-loads to a staging table, then MERGE INTO target keyed on the primary key. Idempotent on retry.

Output.

Component	Role	Latency contribution
Postgres WAL	source-of-truth log	0ms
Debezium	tail WAL → Kafka	<1s
Kafka	durable buffer	<100ms
Sink connector	buffer + MERGE	up to `buffer.flush.time` (30s)
Snowflake	final destination	<5s for MERGE

Rule of thumb. Log-based + Kafka + topic-per-table + op-aware MERGE is the default 2026 CDC architecture. Reach for query-based polling only when WAL access isn't available.

CDC interview question on the four-component split

A senior probe: "Why split CDC into Debezium + Kafka + Sink? Why not write a single application that reads Postgres and writes Snowflake?"

Solution Using a decoupled Kafka-centric architecture

[ Postgres WAL ]
       │
       ▼
[ Debezium ] ─── slot+publication ───┐
                                     │
                                     ▼
                              [ Kafka topics ]   ← durable buffer; multiple consumers
                                     │
                  ┌──────────────────┼──────────────────┐
                  ▼                  ▼                  ▼
        [ Snowflake sink ]   [ BigQuery sink ]   [ Search indexer ]

Step-by-step trace.

Decision	Single-app approach	Kafka-centric approach
1. Multiple downstream destinations	duplicate work in app	each sink subscribes independently
2. Replay history	re-snapshot or custom log	Kafka retention replay
3. Backpressure on slow sink	source-DB pressure	Kafka absorbs
4. Add a new consumer	new app or fork	new sink connector
5. Schema evolution	manual	Schema Registry

Output:

Concern	Single app	Kafka-centric
New consumer	new code, new app	new sink config
Replay history	re-snapshot	reset offset
Slow consumer	source DB feels it	Kafka decouples
Schema evolution	per-app handling	Registry-managed

Why this works — concept by concept:

Decoupled capture vs delivery — Debezium captures into Kafka; sinks pull from Kafka. Each side scales independently.
Kafka as durable buffer — retention is the replay surface; consumers can pause without affecting the source DB.
Topic-per-table convention — natural partitioning of the CDC namespace; permits per-table retention and access controls.
Schema Registry — centralises Avro/Protobuf schema evolution across producers and consumers.
Cost — one extra hop (Kafka) adds <1s latency and substantial operational simplicity. Almost always worth it.

Streaming
Topic — streaming pipelines
CDC pipeline design problems

Practice →

2. The three CDC strategies — query-based, trigger-based, log-based

`log based cdc vs trigger based cdc` is the most-asked CDC interview question — know all three

The mental model: all three strategies turn "what changed since last time" into a stream of events, but they pay different costs.

Query-based (polling) CDC.

How: SELECT * FROM orders WHERE updated_at > $last_max ORDER BY updated_at on a cadence.
Pros: trivial; no special DB permissions; works on every database.
Cons:
- Cannot detect DELETEs unless you also poll a tombstone table.
- Misses fast intra-second updates that share the same updated_at.
- Source-DB load scales with poll frequency.
- Ordering within the polling window is best-effort.
When acceptable: low-volume tables, daily lag is fine, you don't care about DELETEs.

Trigger-based CDC.

How: DB triggers on every INSERT/UPDATE/DELETE write a row to an audit_orders table; a consumer polls or tails the audit table.
Pros: catches DELETEs, exact ordering, works on any RDBMS.
Cons:
- Source-DB write amplification — every write becomes 2 writes.
- Triggers can corrupt OLTP performance at scale.
- Audit-table growth needs aggressive pruning.
When acceptable: small-volume mission-critical tables on a DB without WAL-access; auditing requirements.

Log-based CDC.

How: read the database's transaction log directly — Postgres WAL, MySQL binlog, Oracle redo log, SQL Server CDC tables.
Pros: lowest source-DB load (one streaming reader, like a replica); lowest latency (sub-second); captures DELETEs; strict per-LSN ordering.
Cons: requires WAL access (REPLICATION role on Postgres, binlog_format=ROW on MySQL); more setup; tool dependency (Debezium, Fivetran, AWS DMS).
When: the 2026 default for any production CDC pipeline.

Cost matrix.

Property	Query-based	Trigger-based	Log-based
Latency	minutes–hours	seconds	sub-second
Source-DB load	medium–high	high	low
Captures DELETE	no	yes	yes
Ordering	best-effort	insertion order	log order (LSN)
Setup difficulty	easy	medium	medium
OLTP impact	reads only	write amplification	one reader

Worked example — convert a polling pipeline to log-based CDC

Detailed explanation. A team runs an Airflow DAG every 15 minutes that does SELECT * FROM orders WHERE updated_at > $last. It misses ~3% of fast updates and never catches DELETEs. Convert to Debezium.

Question. Show the polling SQL anti-pattern and the Debezium replacement.

Input (polling anti-pattern).

@task
def poll_orders():
    last = Variable.get("orders_last_max")  # ⚠ stateful, fragile
    rows = pg.execute(f"""
        SELECT * FROM orders
        WHERE updated_at > '{last}'
        ORDER BY updated_at
    """)
    target.upsert(rows, key="order_id")
    Variable.set("orders_last_max", max(r["updated_at"] for r in rows))
# Misses: DELETEs · fast intra-second updates · holes during DB outages.

Code (Debezium replacement).

name: postgres-orders-cdc
config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector
  database.hostname: postgres-primary
  database.dbname: production
  topic.prefix: cdc
  table.include.list: public.orders
  slot.name: debezium_orders
  plugin.name: pgoutput
  snapshot.mode: initial

Step-by-step explanation.

The polling query reads rows where updated_at > last_max — anything missing an updated_at column or two updates in the same millisecond is invisible.
DELETEs leave no trace in the source table, so the polling consumer never sees them.
Debezium tails the WAL — every row mutation, including DELETE, emits an event with op=c|u|d|r.
Per-key ordering is preserved by the LSN of each WAL record.
Backfills are free: change snapshot.mode to initial and Debezium re-snapshots the entire table once.

Output (event stream comparison).

Source event	Polling result	Debezium event
INSERT order_id=1	captured	`op=c, key=1, after={...}`
UPDATE order_id=1 (1ms later)	possibly missed	`op=u, key=1, before={...}, after={...}`
DELETE order_id=1	invisible	`op=d, key=1, before={...}, after=null`
INSERT order_id=2 during DB blip	depends on retry	`op=c, key=2` (after WAL recovers)

Rule of thumb. Whenever you see "SELECT … WHERE updated_at > last" in production code, it's an upgrade candidate.

CDC interview question on choosing between trigger-based and log-based

The probe: "We're on Oracle and the DBA won't grant WAL access. CDC anyway?"

Solution Using trigger-based audit tables with bounded retention

CREATE TABLE audit_orders (
    audit_id    BIGSERIAL PRIMARY KEY,
    op_type     VARCHAR(1) NOT NULL,  -- I, U, D
    captured_at TIMESTAMPTZ DEFAULT NOW(),
    order_id    BIGINT NOT NULL,
    before_row  JSONB,
    after_row   JSONB
);

CREATE OR REPLACE TRIGGER trg_orders_audit
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW
EXECUTE FUNCTION audit_orders_fn();

-- consumer
SELECT * FROM audit_orders
WHERE audit_id > :last_seen
ORDER BY audit_id
LIMIT 10000;

Step-by-step trace.

Event	Trigger fires	Audit row	Consumer view
INSERT id=1	yes	`op=I, before=null, after={...}`	sees op=I
UPDATE id=1	yes	`op=U, before={...}, after={...}`	sees op=U
DELETE id=1	yes	`op=D, before={...}, after=null`	sees op=D
consumer crashes	—	rows persist	re-reads from last seen audit_id

Output:

Property	Value
Captures DELETE	yes
Ordering	by `audit_id` (insertion order)
Source-DB write amp	2× (one row + one audit row)
Audit-table growth	unbounded — needs retention
Pruning	nightly DELETE FROM audit_orders WHERE captured_at < NOW() - 30 days

Why this works — concept by concept:

Triggers as the capture mechanism — every DML automatically writes the audit row inside the same transaction. Atomic by definition.
Audit table as the queue — consumer just polls WHERE audit_id > last_seen. Per-key ordering preserved.
Bounded retention — without pruning, the audit table grows forever and starts dominating storage. 30-day retention is a common policy.
Write amplification cost — every DML now writes 2 rows. For low-volume tables this is fine; for high-volume tables this can double the DB load.
Cost — write throughput halved at the source DB; storage = O(retention × write_rate × row_size). Acceptable for small tables; log-based is much cheaper at high volume.

Streaming
Topic — real-time analytics
CDC strategy selection problems

Practice →

3. Debezium architecture — snapshot then stream

`debezium architecture` is "one initial snapshot, then forever streaming the WAL"

Debezium is the open-source log-based CDC platform built on top of Kafka Connect. It supports PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, MongoDB, Db2, Cassandra, and more.

The Debezium event shape.

{
  "before": {"order_id": 42, "status": "paid"},
  "after":  {"order_id": 42, "status": "shipped"},
  "source": {"connector": "postgresql", "ts_ms": 1716840000000, "lsn": "0/16D2EBC0"},
  "op": "u",
  "ts_ms": 1716840000123
}

op is one of c (create / INSERT), u (update), d (delete), r (read — snapshot).
before is null for INSERTs; after is null for DELETEs.
source carries the LSN / binlog offset for replay positioning.

Snapshot modes.

initial (default) — full table snapshot once, then stream the WAL from the snapshot's LSN cutoff. Most common.
never — never snapshot; start streaming from the current WAL position. Use when historical state is already populated.
when_needed — snapshot only if a snapshot hasn't been completed yet.
schema_only — capture schemas, no data. Useful for schema-evolution registration.
incremental (Debezium 1.6+) — chunk-based snapshot that runs concurrently with streaming. Best for very large tables.

Source-specific quirks.

PostgreSQL: needs wal_level=logical, a publication, and a replication slot. Slot grows the WAL until the consumer commits — slot lag is the #1 ops issue.
MySQL: needs binlog_format=ROW, binlog_row_image=FULL. GTID mode helps with replication-source failover.
Oracle: requires LogMiner or XStream APIs; XStream needs the GoldenGate license.
SQL Server: needs sys.sp_cdc_enable_db and per-table cdc_enable_table. Reads from system CDC tables, not the t-log directly.

Worked example — debug "Debezium fell behind the source"

Detailed explanation. A senior probe: "Our Debezium connector is hours behind. Where do you start?"

Question. Diagnostic walkthrough for Debezium lag on PostgreSQL.

Code (diagnostic commands).

-- 1. Replication slot lag (bytes of WAL retained for the slot)
SELECT slot_name, active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS slot_lag
FROM pg_replication_slots
WHERE slot_name = 'debezium_orders';

-- 2. Slot is consuming WAL? (if confirmed_flush_lsn isn't advancing → consumer is dead)
SELECT slot_name, confirmed_flush_lsn, last_used FROM pg_replication_slots;

-- 3. Source write rate
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS wal_position;

# 4. Connector status from Kafka Connect REST
curl -s http://kafka-connect:8083/connectors/postgres-orders-cdc/status | jq

# 5. Connector lag metrics from JMX
echo "MillisecondsBehindSource: $(jmxterm-get debezium ...)"

Step-by-step explanation.

Slot lag is the canonical signal. If it grows unboundedly, the consumer isn't keeping up (or is paused).
Connector status — RUNNING vs FAILED vs PAUSED. A failed connector retains the slot but stops consuming, so WAL piles up.
MillisecondsBehindSource JMX metric — exposes the gap between WAL writes and Kafka publishes.
Kafka consumer lag on the Snowflake sink — cdc.public.orders topic. If the sink is slow, the messages back up in Kafka but the Debezium source side is fine.
Disk space — if WAL is filling the source-DB disk, you have minutes before things get serious.

Output (decision matrix).

Symptom	Likely cause	Fix
Slot lag growing, connector status `FAILED`	crashed connector	restart; fix root error
Slot lag growing, status `RUNNING`	downstream Kafka slow / full	scale Kafka, drain sink
Slot lag stable, sink consumer lag growing	sink throughput	scale sink workers
WAL filling disk fast	hot writes + slow consumer	bump connector resources or kill slot temporarily
Schema-history topic compaction issues	misconfigured retention	switch to compacted topic

Rule of thumb. Slot lag is your North Star. Alert on slot_lag > 1GB for production CDC.

CDC interview question on snapshot strategies for large tables

The probe: "We need to CDC a 2 TB orders table. The default initial snapshot would take 12 hours and hold a row-level lock — what do you do?"

Solution Using `incremental` snapshot

config:
  ...
  snapshot.mode: when_needed
  snapshot.locking.mode: none
  incremental.snapshot.chunk.size: 10000
  incremental.snapshot.watermarking.strategy: insert_insert

Step-by-step trace.

Step	Action	Effect
1	Connector starts; streaming phase begins immediately	WAL events flow from the current LSN
2	Operator sends a `signal` message requesting incremental snapshot for `public.orders`	Debezium opens a chunked snapshot loop
3	Snapshot loop reads `orders` in 10k-row chunks ordered by PK	each chunk emits `op=r` events into the topic
4	Streaming + snapshot run concurrently	replay-safe because both emit to the same topic with consistent keys
5	Snapshot completes; the loop terminates	only streaming continues

Output:

Approach	Lock time	Initial latency	Disk impact
`initial` snapshot	minutes–hours of row-level lock	hours before streaming starts	huge slot growth during snapshot
`incremental` snapshot	none (no lock)	streaming starts immediately	bounded slot growth

Why this works — concept by concept:

Chunked snapshot — instead of one giant SELECT, Debezium reads chunks of N rows ordered by PK. Each chunk is a tiny transaction.
Concurrent with streaming — the streaming phase keeps emitting WAL events for the latest writes; the snapshot phase emits historical reads. Both land on the same topic with the same key.
Watermarking — Debezium writes insert/insert watermark events around each chunk so consumers can reason about chunk boundaries if needed.
No source-DB lock — chunks are PK-range queries, served from the read replica or with LOCK MODE NONE.
Cost — slightly more code in the connector; substantially less operational pain for huge tables.

Streaming
Topic — streaming pipelines
Debezium operational design problems

Practice →

4. CDC into Kafka — topic-per-table, key choice, partitioning

Topic-per-table + key=PK + partition by key = ordered, replayable, partitionable CDC

The CDC-into-Kafka layout matters as much as the CDC source itself. Wrong key choice → wrong ordering. Wrong topic shape → impossible-to-manage namespaces. Wrong partition count → bottlenecked replication.

The three CDC-on-Kafka conventions.

Topic-per-table — one source table → one Kafka topic, named {prefix}.{schema}.{table} (e.g. cdc.public.orders).
Key = primary key — every CDC event uses the row's PK as the Kafka key. All events for one row land in one partition.
Partition count tuned to the table's write rate — high-traffic tables get more partitions; low-traffic ones get fewer.

Tombstones.

DELETE events emit two messages per row by default: the op=d event AND a tombstone (null value, same key).
Tombstones tell log-compacted topics to drop the key entirely on the next compaction round.
Useful when you want the topic to materialise as "current state per key."

Schema considerations.

Avro / Protobuf / JSON Schema via the Schema Registry — the Debezium-recommended path.
Schema id is the 4-byte header on every message — consumers fetch the schema by id.
JSON without schema is fine for development but loses schema evolution discipline.

Outbox-pattern topics.

When using the outbox pattern (see §5), Debezium reads the outbox table and re-keys/re-topics each event using SMTs (Single Message Transforms).
Result: one outbox feed → many domain-event topics (orders-events, payment-events, etc.).

Worked example — convert a flat all-tables topic into topic-per-table

Detailed explanation. A naive Debezium config writes every table to one giant topic. The senior fix splits by table using a RegexRouter SMT.

Question. Take a topic.prefix=cdc config and ensure each source table lands on its own topic.

Code (correct config).

config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector
  topic.prefix: cdc
  table.include.list: public.orders,public.customers,public.products
  # By default Debezium writes to ${topic.prefix}.${schema}.${table}, e.g. cdc.public.orders
  # No router needed for the default behaviour. But if you want to flatten the namespace:
  transforms: route
  transforms.route.type: org.apache.kafka.connect.transforms.RegexRouter
  transforms.route.regex: cdc\.public\.(.*)
  transforms.route.replacement: warehouse.$1

Step-by-step explanation.

Default Debezium → topic name = cdc.public.orders (prefix . schema . table).
The RegexRouter SMT renames cdc.public.orders to warehouse.orders for cleaner downstream subscriber UX.
The key is unchanged — still the row's primary key.
Each topic gets its own partition count via auto.create.topics.enable + num.partitions per topic (or pre-create with explicit kafka-topics.sh --create).
Tombstones are emitted by default — useful for compacted topics, disable with tombstones.on.delete=false if your sink can't handle them.

Output.

Source table	Default Kafka topic	After router
public.orders	cdc.public.orders	warehouse.orders
public.customers	cdc.public.customers	warehouse.customers
public.products	cdc.public.products	warehouse.products

Rule of thumb. Topic-per-table with the natural Debezium naming convention. Use a RegexRouter only when the prefix needs to be cleaner for downstream consumers.

CDC interview question on partition count for a hot table

The probe: "The events table writes 50k rows/sec. How many partitions?"

Solution Using `events.count / consumer_throughput` math

# pre-create the topic with the right partition count
kafka-topics.sh --bootstrap-server kafka:9092 \
  --create --topic cdc.public.events \
  --partitions 30 \
  --replication-factor 3 \
  --config min.insync.replicas=2 \
  --config retention.ms=604800000

Step-by-step trace.

Metric	Value	Source
Write rate	50k rows/sec	source DB
Single-consumer throughput	~5k rows/sec	benchmark of the Snowflake sink
Required partitions	50k / 5k = 10	math
Over-provision factor	3×	future growth
Final partition count	30	rounded

Output:

Decision	Value	Why
Partitions	30	50k/5k = 10 × 3 over-provision
Replication	3	survive one broker loss
Key	`event_id` (PK)	per-event ordering
Retention	7 days	replay window

Why this works — concept by concept:

Partition count = parallelism cap — N consumers max per consumer group. 30 partitions means 30 parallel sink workers.
Key = PK preserves per-row ordering — all changes to event 42 land on the same partition, so the sink applies them in source order.
Pre-create the topic — never rely on auto-create for production CDC topics; you'll get default partition counts that don't match your throughput.
Retention as replay window — 7 days means you can blow up the sink and rebuild it from Kafka without re-snapshotting the source DB.
Cost — partitions cost a few KB of metadata each; under-provisioning costs throughput. Over-provision is cheap.

Streaming
Topic — streaming pipelines
CDC topic + partition design

Practice →

5. Schema evolution and the dual-writes trap

`cdc schema evolution` is solved by Schema Registry + compatibility modes; `dual writes` is solved by the outbox pattern

Schema evolution and dual writes are two of the most common CDC failure modes — and both have well-known, named solutions.

Schema evolution.

BACKWARD (default in Confluent Registry) — new schemas can READ old data. Consumers can upgrade before producers.
FORWARD — old consumers can READ data from new producers.
FULL — both directions.
NONE — anything goes (don't).
For CDC topics, BACKWARD is the typical choice: producers (Debezium) add columns, sinks (Snowflake) handle additive changes.

The dual-writes anti-pattern.

# ⚠️ anti-pattern
def place_order(order):
    db.execute("INSERT INTO orders VALUES (...)")    # ① durable in DB
    kafka.send("order_events", order.to_event())     # ② durable in Kafka
# Either step can fail independently — split-brain state.

DB succeeds, Kafka fails → ghost order with no event.
DB fails, Kafka sees the event → duplicate event downstream.
Two durable systems, two transactions — no atomicity.

The outbox pattern — the fix.

# ✓ outbox pattern
def place_order(order):
    with db.transaction():
        db.execute("INSERT INTO orders VALUES (...)")
        db.execute("INSERT INTO outbox (event_type, payload) VALUES ('order_placed', ...)")
    # Both rows commit atomically — single transaction, single durable system.
# Debezium tails the outbox table and produces the Kafka event.

ONE transaction, ONE durable system (the DB).
Debezium does the Kafka publishing as a separate concern.
Routing the outbox table to per-event-type topics is a 5-line SMT (EventRouter).

Worked example — implement the outbox pattern with Debezium's `EventRouter` SMT

Detailed explanation. The outbox table holds raw event payloads; Debezium routes each row to the correct domain topic using an SMT.

Question. Design the outbox table and the Debezium config to route order_placed events to order-events topic.

Input (outbox table).

CREATE TABLE outbox (
    id          BIGSERIAL PRIMARY KEY,
    aggregate   VARCHAR(64) NOT NULL,   -- e.g. 'order'
    aggregate_id VARCHAR(64) NOT NULL,  -- e.g. '42' (the order_id)
    event_type  VARCHAR(64) NOT NULL,   -- e.g. 'order_placed'
    payload     JSONB NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_outbox_created ON outbox(created_at);

Code (Debezium config with EventRouter SMT).

config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector
  database.hostname: postgres-primary
  database.dbname: production
  topic.prefix: outbox
  table.include.list: public.outbox
  transforms: outbox
  transforms.outbox.type: io.debezium.transforms.outbox.EventRouter
  transforms.outbox.route.topic.replacement: ${routedByValue}-events
  transforms.outbox.route.by.field: aggregate
  transforms.outbox.table.field.event.id: id
  transforms.outbox.table.field.event.key: aggregate_id
  transforms.outbox.table.field.event.payload: payload

Step-by-step explanation.

The service writes orders row + outbox row in one transaction.
Debezium tails the WAL and sees both rows as separate events. Only the outbox event matters for downstream.
The EventRouter SMT reads the aggregate column → routes to order-events topic.
The Kafka key is aggregate_id (the order's natural key).
The Kafka value is the payload JSON column (already-serialised event).

Output (event flow).

Source row	Outbox row	Kafka topic	Kafka key	Kafka value
orders.id=42	outbox.id=101, aggregate=order, type=order_placed	order-events	42	`{"order_id":42,...}`
orders.id=42 (UPDATE)	outbox.id=102, aggregate=order, type=order_shipped	order-events	42	`{"order_id":42,"status":"shipped"}`

Rule of thumb. Whenever your service needs to write to a database AND publish an event, use the outbox pattern. Never trust two durable systems to commit in lockstep without it.

CDC interview question on backward-compatible schema changes

The probe: "We need to add a column to the source table. How do we evolve the CDC pipeline?"

Solution Using BACKWARD compatibility + nullable additions

-- additive change in the source DB
ALTER TABLE orders ADD COLUMN promo_code VARCHAR(20) NULL;

# Schema Registry policy
"subject": "cdc.public.orders-value"
"compatibility": "BACKWARD"

Step-by-step trace.

Step	Action	Effect
1	DBA runs `ALTER TABLE orders ADD COLUMN promo_code`	column appears in Postgres
2	Next write fires Debezium WAL	new Avro schema registered (v2) with extra nullable field
3	Schema Registry checks compatibility: BACKWARD → new schema can read old data	passes
4	Consumer running on schema v1 reads message with schema v2	new field appears as `null` (default)
5	Consumer upgraded to schema v2	reads the column normally

Output:

Schema change	BACKWARD compatible?	Allowed without consumer upgrade?
Add nullable column	yes	yes
Add column with default	yes	yes
Remove column	no (only `FORWARD`)	no
Rename column	no	no
Change column type	depends on type pair	no in general

Why this works — concept by concept:

BACKWARD compatibility — new schema can read old data; safe for the producer to evolve while consumers catch up.
Additive changes are always safe — adding a nullable column has zero impact on existing consumers.
Destructive changes (drop, rename) require coordination — pause producers, drain consumers, deploy new schema, resume.
Schema Registry as the gate — every new schema is checked before it's accepted; impossible to push a breaking schema accidentally.
Cost — Schema Registry adds one lookup per new schema id (cached); operational overhead is the disciplined compatibility mode.

Streaming
Topic — real-time analytics
Outbox + schema evolution problems

Practice →

6. CDC sinks — warehouses, lakehouses, op-aware MERGE

`cdc to snowflake` and `cdc to bigquery` = bulk stage + op-aware MERGE on primary key

The mental model: sinks accumulate Kafka events into a staging table, then apply them with a single op-aware MERGE keyed on the primary key. The MERGE is what makes the sink idempotent — replaying the topic produces the same end state.

The five sink choices.

Snowflake Sink Connector (Confluent / Snowflake) — buffers + MERGE.
BigQuery Sink Connector — buffers + MERGE (or stream-insert with dedup).
Databricks Auto Loader + DLT — Delta tables with CDC merge.
Iceberg / Hudi / Delta Lake — lakehouse formats with native CDC merge ops.
Custom consumer — when none of the above fit.

Op-aware MERGE.

MERGE INTO target t
USING stg s
ON t.id = s.id
WHEN MATCHED AND s.op = 'd' THEN DELETE
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED AND s.op != 'd' THEN INSERT ...;

The op column from Debezium drives which branch the MERGE takes:

op=c → INSERT
op=u → UPDATE
op=d → DELETE
op=r (snapshot) → UPSERT (treat as INSERT or UPDATE)

Replay safety.

Re-consume the topic from offset 0 → MERGE applies the same final state.
No duplicate rows, no missed deletes.
Idempotent by construction.

Worked example — apply Debezium events to a Snowflake target

Detailed explanation. Walk through how a three-event sequence (INSERT, UPDATE, DELETE) for order_id=42 ends up in the target table.

Question. Trace op=c, op=u, op=d for order_id=42 through the Snowflake sink. Show the staging table and the MERGE output.

Input (Kafka topic cdc.public.orders).

// msg 1
{"op":"c","after":{"order_id":42,"status":"paid","amount":100}}
// msg 2
{"op":"u","before":{"status":"paid"},"after":{"order_id":42,"status":"shipped","amount":100}}
// msg 3
{"op":"d","before":{"order_id":42,"status":"shipped","amount":100},"after":null}

Code (sink processing).

-- Staging table populated by the sink connector batch
SELECT * FROM stg;
-- ┌──────────┬────┬───────────┬────────┐
-- │ order_id │ op │ status    │ amount │
-- ├──────────┼────┼───────────┼────────┤
-- │       42 │ c  │ paid      │    100 │
-- │       42 │ u  │ shipped   │    100 │
-- │       42 │ d  │ <null>    │ <null> │
-- └──────────┴────┴───────────┴────────┘

-- pre-MERGE collapse: keep only the LAST event per key (by Kafka offset)
WITH last_per_key AS (
    SELECT order_id, op, status, amount,
           ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY kafka_offset DESC) AS rn
    FROM stg
)
MERGE INTO orders t
USING (SELECT * FROM last_per_key WHERE rn = 1) s
ON t.order_id = s.order_id
WHEN MATCHED AND s.op = 'd' THEN DELETE
WHEN MATCHED THEN UPDATE SET status = s.status, amount = s.amount
WHEN NOT MATCHED AND s.op != 'd' THEN INSERT (order_id, status, amount)
                                  VALUES (s.order_id, s.status, s.amount);

Step-by-step explanation.

The sink connector buffers messages and writes them to a staging table.
Pre-MERGE collapse — for the same order_id, only the LAST event matters; intermediate UPDATE before DELETE doesn't need to be applied.
The MERGE walks each row in the collapsed staging table.
For order_id=42 the last event is op=d → matched → DELETE.
End state: row is gone from orders. Replay produces the same result.

Output (final orders table).

order_id	status	amount
(empty — id=42 deleted)

Rule of thumb. Pre-MERGE collapse + op-aware MERGE = replay-safe sink. Never write a sink that applies each event individually without collapsing.

CDC interview question on append-only history vs current state

A senior probe: "Do we land CDC as current state, or as append-only history?"

Solution Using two-tier — append-only `_raw` plus materialised current state

-- Tier 1: append-only CDC log
CREATE TABLE orders_raw (
    kafka_offset BIGINT,
    op           VARCHAR(1),
    captured_at  TIMESTAMPTZ,
    order_id     BIGINT,
    status       VARCHAR(20),
    amount       NUMERIC(12,2)
);

-- Tier 2: materialised view of current state (refreshed by MERGE)
CREATE MATERIALIZED VIEW orders_current AS
WITH last_per_key AS (
    SELECT order_id, op, status, amount,
           ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY kafka_offset DESC) AS rn
    FROM orders_raw
)
SELECT order_id, status, amount
FROM last_per_key
WHERE rn = 1 AND op != 'd';

Step-by-step trace.

Tier	Stored	Use case
Tier 1 (`orders_raw`)	every event ever, append-only	audit, replay, point-in-time queries
Tier 2 (`orders_current`)	current state per key	dashboards, joins

Output:

Question	Answer with single-tier MERGE	Answer with two-tier
What was the order's status at 14:00 yesterday?	impossible (state was overwritten)	query `orders_raw`
What's the current status?	query `orders`	query `orders_current`
Replay history	possible if Kafka has it	possible from `orders_raw`
Storage cost	low	high (every event)

Why this works — concept by concept:

Append-only raw layer — preserves full history; storage cost = O(events × row_size), but storage is cheap and audits are priceless.
Materialised current state — derived from raw via a windowed MERGE; refreshable as often as the SLA needs.
Decoupled audit from current state — different consumers ask different questions; one layer for each.
Replay confidence — Tier 1 itself is replay-safe (append-only on kafka_offset); Tier 2 can be rebuilt from Tier 1 at any time.
Cost — extra storage in exchange for time-travel. Acceptable for tables that benefit from audit history.

Streaming
Topic — database
CDC sink design problems (MERGE, replay)

Practice →

7. Operational CDC — backfills, snapshots, replication lag, observability

Production CDC is replication-slot management plus four metrics

CDC interview rounds often end with operational realism: how do you spot a problem, how do you backfill safely, how do you survive a primary failover.

The four CDC metrics every senior team monitors.

Replication slot lag (bytes) — bytes of WAL retained for the slot. Growing = consumer not keeping up.
Connector lag (MillisecondsBehindSource) — gap between WAL writes and Kafka publishes.
Consumer lag — partition-by-partition Kafka consumer lag on every sink.
Sink throughput (rows/sec) — actual application rate at the destination.

Backfills.

For a known-good window, just re-snapshot via snapshot.mode=initial on a new connector (different slot).
For a partial backfill, use incremental snapshot scoped to a subset of tables.
For a brand-new sink, set the consumer offset to earliest on the Kafka topic (provided retention covers the window).

Primary failover.

PostgreSQL: physical replication followers don't carry logical slots by default. Failover to a streaming replica requires pg_create_logical_replication_slot on the new primary and re-syncing — or use a tool like pg_failover_slots.
MySQL: GTID mode + binlog filename/position tracking. Debezium auto-resumes from the recorded position if the new primary's binlog covers it.
Operationally: always know the LSN/GTID of the last delivered event so you can rebuild the slot on the new primary.

Observability stack.

Debezium exposes JMX metrics → Prometheus → Grafana dashboards.
Schema Registry has REST endpoints for schema-version inventory.
Kafka brokers expose consumer-lag metrics.
Source DB exposes pg_replication_slots / equivalent.

Worked example — design the CDC operations runbook

Detailed explanation. A senior probe might be: "Write the on-call runbook for our CDC pipeline."

Question. Document the three most-paged scenarios and the steps to resolve.

Code (runbook structure).

# CDC On-Call Runbook

## Scenario A — replication slot lag growing
Symptom: `slot_lag > 1GB` for > 5 min.
1. Check connector status: GET /connectors/{name}/status. If FAILED, see logs and restart.
2. Check connector lag (`MillisecondsBehindSource`). If high, scale connector resources.
3. Check Kafka consumer lag on downstream sinks. If they're behind, scale sink workers.
4. If WAL fills source-DB disk in < 30 min, emergency: drop the slot, force-clear, accept data loss, re-snapshot.

## Scenario B — connector keeps crashing
Symptom: Status: FAILED repeatedly.
1. Read the connector log; look for `OutOfMemoryError`, `SchemaParseException`, `NotEnoughReplicasException`.
2. Common fix: increase heap for the Kafka Connect worker.
3. Common fix: schema-history topic compaction; recreate it as compacted.
4. Common fix: skip a poison-pill message via `errors.tolerance=all` + DLQ.

## Scenario C — sink applying slowly
Symptom: Kafka consumer lag growing, source-DB slot lag stable.
1. Check sink throughput; benchmark single-partition consumer.
2. Add partitions to the topic (rebalance required).
3. Scale sink workers (Kafka Connect tasks).
4. Tune `buffer.flush.time` / `buffer.flush.count` for the sink.

Step-by-step explanation.

Distinguish source-side problems (slot lag, Debezium failure) from sink-side problems (consumer lag, throughput).
Every scenario has a low-risk first action (read status / logs) and a high-risk last action (drop the slot, force-clear, re-snapshot).
Sink scaling problems are usually solved by partition count + worker count; rarely by sink-side connector config alone.
Poison-pill messages are handled via the dead-letter queue pattern — errors.tolerance=all and an explicit DLQ topic.
Make the runbook executable — paste-able commands, not prose.

Output (runbook quality check).

Question	Yes / No
Distinguishes source-side from sink-side?	Yes
Includes the "drop the slot" emergency?	Yes
Gives explicit commands?	Yes
Names the relevant metrics?	Yes
Covers schema-history compaction?	Yes

Rule of thumb. A runbook that doesn't include the disaster scenario (drop the slot) isn't a runbook.

CDC interview question on primary failover

The probe: "What happens to your CDC pipeline if Postgres primary fails over?"

Solution Using `pg_failover_slots` + binlog-position bookkeeping

# Install the pg_failover_slots extension on the primary and all replicas
CREATE EXTENSION pg_failover_slots;

# Promote a replica
pg_ctl promote -D /var/lib/postgresql/data

# Verify the slot was promoted to the new primary
SELECT slot_name, active, restart_lsn FROM pg_replication_slots;
# Debezium should reconnect within seconds, resuming from restart_lsn

Step-by-step trace.

Step	Without pg_failover_slots	With pg_failover_slots
1	Primary fails	Primary fails
2	Replica promoted	Replica promoted
3	New primary has NO slots	New primary has the slot, advanced to last-known LSN
4	Debezium tries to reconnect	Debezium reconnects to the new primary
5	Connector fails (no slot)	Connector resumes from where it left off
6	Manual recovery: snapshot from scratch	Zero downtime, no re-snapshot

Output:

Concern	Without	With
Downtime	hours (re-snapshot 200 tables)	seconds
Data loss	possible (events between failure and new slot)	none
Operational complexity	high	low
Extension required	no	yes (pg_failover_slots)

Why this works — concept by concept:

pg_failover_slots — replicates the logical slot state to standby servers; on failover, the new primary already has the slot at the right LSN.
Restart from restart_lsn — Debezium reconnects and resumes streaming without missing any WAL records.
Zero re-snapshot — no need to re-read the full table; the slot already covers the gap.
Operationally simple — one extension, one command on the new primary, no manual intervention.
Cost — minor write overhead on the primary to maintain slot state on standbys; massive win on failover.

Streaming
Topic — streaming pipelines
CDC ops + failover problems

Practice →

Choosing the right CDC primitive (cheat sheet)

Need low latency + low source-DB load? Log-based CDC (Debezium). Not query polling.
Source DB has no WAL access? Trigger-based CDC with a bounded-retention audit table.
Need atomic DB + event write? Outbox pattern + Debezium tailing the outbox table. Never dual writes.
Need topic-per-table? Default Debezium naming convention ({prefix}.{schema}.{table}).
Need replay-safe sink? Op-aware MERGE on primary key + pre-MERGE collapse to last event per key.
Need audit history? Two-tier: append-only _raw table + materialised _current view derived from raw.
Need schema evolution? Schema Registry + BACKWARD compatibility for additive changes.
Need to backfill a huge table? incremental snapshot mode — concurrent with streaming, no lock.
Need failover survival? pg_failover_slots on Postgres; GTID mode on MySQL.
Need to monitor? Slot lag + connector lag + consumer lag + sink throughput.

Frequently asked questions

What is Change Data Capture and why is log-based CDC the default choice?

CDC streams every INSERT, UPDATE, and DELETE from a source database to one or more downstream destinations, in near real time. Log-based CDC reads the database's transaction log (Postgres WAL, MySQL binlog, Oracle redo log) directly — it has the lowest source-DB load, the lowest latency (sub-second), captures DELETEs, and preserves strict per-key ordering by LSN. It's the 2026 default because the alternatives — query polling and DB triggers — either miss DELETEs or impose heavy write-amplification on the source.

How does Debezium work and what does the snapshot phase do?

Debezium is the open-source log-based CDC platform built on Kafka Connect. On startup it runs a snapshot phase — a one-time read of the entire table to capture existing rows — then transitions to streaming mode, tailing the WAL forever and emitting one Kafka event per row mutation. Each event carries before, after, op (c/u/d/r), and source LSN. For very large tables, use incremental snapshot mode, which runs chunked snapshots concurrently with streaming and avoids long-held locks.

What's the dual writes problem and how does the outbox pattern solve it?

Dual writes is the anti-pattern of writing to a database AND publishing to a message broker in two separate steps — either step can fail independently, leaving the system in an inconsistent state. The outbox pattern fixes this: write the business row AND a row in an outbox table inside a single DB transaction. Then have Debezium tail the outbox table and publish the events to Kafka. Only one durable system per transaction, atomic by definition. Debezium's EventRouter SMT routes outbox rows to per-event-type topics.

How does CDC into Kafka work — what's the topic and key convention?

The convention is topic-per-table (Debezium defaults to {prefix}.{schema}.{table}) and key = primary key. This guarantees that all events for one row land in the same partition, preserving per-key ordering. Partition count is sized to the table's write rate divided by single-consumer throughput, with 2–3× over-provisioning for growth. DELETEs emit two messages by default — the op=d event plus a tombstone (null value, same key) — which lets compacted topics drop the key entirely.

How do CDC sinks apply Debezium events to a target warehouse?

CDC sinks buffer Kafka messages into a staging table, collapse them to "last event per primary key" (using Kafka offset as the tiebreaker), then run an op-aware MERGE keyed on PK: WHEN MATCHED AND op='d' THEN DELETE, WHEN MATCHED THEN UPDATE, WHEN NOT MATCHED THEN INSERT. This MERGE is what makes the sink replay-safe — re-consuming the topic from offset 0 produces the same final state. Many teams also keep a Tier-1 append-only _raw table to preserve full history for audit and point-in-time queries.

What's the difference between BACKWARD and FORWARD schema compatibility for CDC?

BACKWARD compatibility (the default for CDC) means the new schema can READ data written with the old schema — so producers can upgrade before consumers. FORWARD means old consumers can READ data written with new producers (less common). FULL means both. For CDC, BACKWARD is right because the database adds columns (a producer change), and the sink picks up the new column on its next deploy. Always combine with Schema Registry to gate every schema change — destructive changes (drop, rename) require coordinated upgrades.

Practice on PipeCode

Drill the streaming practice library → for end-to-end CDC pipeline questions.
Rehearse real-time analytics drills → for windowed dedup and event-time scenarios.
Sharpen database SQL problems → for the warehouse-MERGE side.
For the broader interview surface, read top data engineering interview questions →.
Stack the prerequisites with the only 5 skills you need to become a data engineer →.
Reinforce the compute side with Apache Spark internals for DE interviews →.
For the design-round muscles, work through ETL system design for DE interviews →.
To pair CDC with table modelling, browse data modelling for DE interviews →.

Pipecode.ai is Leetcode for Data Engineering — every CDC concept above ships with hands-on practice rooms where you design log-based CDC pipelines, configure real Debezium connectors, and trace op-aware MERGEs. Start with the streaming library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.

Practice streaming now →
Real-time analytics drills →

Top comments (1)

PracHub • May 27

The preference for log-based CDC using Debezium with PostgreSQL's WAL is efficient, especially with low latency demands. How do you handle schema changes during CDC? Do you recommend a strategy for managing evolving schemas while keeping data flows reliable? For those preparing for these discussions, our question bank on prachub.com includes company-tagged CDC topics seen in recent interviews. It's aligned with what interviewers are currently asking, helping candidates get ready for both basic and detailed questions.

1. Why CDC shows up in every senior data engineering interview

CDC is the only sustainable way to ship transactional data to analytical destinations

Worked example — pick a CDC strategy for PostgreSQL → Snowflake

CDC interview question on the four-component split

Solution Using a decoupled Kafka-centric architecture

2. The three CDC strategies — query-based, trigger-based, log-based

log based cdc vs trigger based cdc is the most-asked CDC interview question — know all three

Worked example — convert a polling pipeline to log-based CDC

CDC interview question on choosing between trigger-based and log-based

Solution Using trigger-based audit tables with bounded retention

3. Debezium architecture — snapshot then stream

debezium architecture is "one initial snapshot, then forever streaming the WAL"

Worked example — debug "Debezium fell behind the source"

CDC interview question on snapshot strategies for large tables

Solution Using incremental snapshot

4. CDC into Kafka — topic-per-table, key choice, partitioning

Topic-per-table + key=PK + partition by key = ordered, replayable, partitionable CDC

Worked example — convert a flat all-tables topic into topic-per-table

CDC interview question on partition count for a hot table

Solution Using events.count / consumer_throughput math

5. Schema evolution and the dual-writes trap

cdc schema evolution is solved by Schema Registry + compatibility modes; dual writes is solved by the outbox pattern

Worked example — implement the outbox pattern with Debezium's EventRouter SMT

CDC interview question on backward-compatible schema changes

Solution Using BACKWARD compatibility + nullable additions

6. CDC sinks — warehouses, lakehouses, op-aware MERGE

cdc to snowflake and cdc to bigquery = bulk stage + op-aware MERGE on primary key

Worked example — apply Debezium events to a Snowflake target

CDC interview question on append-only history vs current state

Solution Using two-tier — append-only _raw plus materialised current state

7. Operational CDC — backfills, snapshots, replication lag, observability

Production CDC is replication-slot management plus four metrics

Worked example — design the CDC operations runbook

CDC interview question on primary failover

Solution Using pg_failover_slots + binlog-position bookkeeping

Choosing the right CDC primitive (cheat sheet)

Frequently asked questions

What is Change Data Capture and why is log-based CDC the default choice?

How does Debezium work and what does the snapshot phase do?

What's the dual writes problem and how does the outbox pattern solve it?

How does CDC into Kafka work — what's the topic and key convention?

How do CDC sinks apply Debezium events to a target warehouse?

What's the difference between BACKWARD and FORWARD schema compatibility for CDC?

Practice on PipeCode

`log based cdc vs trigger based cdc` is the most-asked CDC interview question — know all three

`debezium architecture` is "one initial snapshot, then forever streaming the WAL"

Solution Using `incremental` snapshot

Solution Using `events.count / consumer_throughput` math

`cdc schema evolution` is solved by Schema Registry + compatibility modes; `dual writes` is solved by the outbox pattern

Worked example — implement the outbox pattern with Debezium's `EventRouter` SMT

`cdc to snowflake` and `cdc to bigquery` = bulk stage + op-aware MERGE on primary key

Solution Using two-tier — append-only `_raw` plus materialised current state

Solution Using `pg_failover_slots` + binlog-position bookkeeping