DEV Community

Cover image for Change Data Capture (CDC) for Data Engineering Interviews: Debezium, Log-Based vs Trigger-Based, Kafka Connect
Gowtham Potureddi
Gowtham Potureddi

Posted on

Change Data Capture (CDC) for Data Engineering Interviews: Debezium, Log-Based vs Trigger-Based, Kafka Connect

change data capture interview questions appear in every senior data engineering loop because CDC is the substrate that connects every transactional database to every analytical destination in modern data platforms. Interviewers don't stop at "what is CDC?" — they probe whether you understand log based cdc vs trigger based cdc as a real trade-off, debezium architecture as the open-source workhorse, the dual writes anti-pattern as a common failure mode, and cdc to snowflake / cdc to bigquery patterns as the production reality.

This guide walks through the seven CDC primitives that show up most often in data engineering interview questions at FAANG, fintech, and SaaS shops. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works. By the end you'll be able to defend log-based CDC over polling, explain the outbox pattern as the fix for dual writes, walk Debezium through snapshot + streaming phases, and ship a replay-safe op-aware MERGE sink — the exact shape cdc data engineer interview rounds reward when debezium kafka connect and cdc schema evolution come up.

PipeCode blog header for a Change Data Capture interview prep guide for data engineers — bold white headline 'Change Data Capture · CDC' with subtitle 'Debezium · log-based vs trigger-based · Kafka Connect' and a minimal source-DB-to-Kafka-to-warehouse flow on a dark gradient with purple, green, and orange accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the streaming practice library →, browse real-time analytics drills →, or rehearse database SQL problems → for the warehouse-MERGE side.


On this page


1. Why CDC shows up in every senior data engineering interview

CDC is the only sustainable way to ship transactional data to analytical destinations

The one-sentence invariant: CDC streams every INSERT, UPDATE, and DELETE from a source database to one or more downstream destinations, in near real time, without scanning the source on every load. Once you internalise that — every row change, ordered, replayable, with the original operation type — every change data capture interview questions prompt resolves to "which CDC strategy, which sink, which schema evolution mode."

Why CDC matters in 2026 DE interviews.

  • Real-time analytics. Dashboards that lag by hours are dead; product analytics, fraud detection, and ops need sub-minute freshness.
  • Microservices integration. Hundreds of services with their own databases need to feed a shared data plane without coupling.
  • Warehouse / lakehouse hydration. Snowflake, BigQuery, Databricks all expect a CDC feed for low-latency tables.
  • Event-driven architectures. CDC events feed downstream consumers (search index, cache invalidation, ML features) without dual writes.

The four CDC concepts every interview opens with.

  • Capture. Detect every row change on the source.
  • Order. Preserve the order of changes per key (so the downstream end state is correct).
  • Deliver. Push events to the destination (Kafka, S3, target warehouse).
  • Apply. Materialise the changes downstream (MERGE on PK, append-only history, etc.).

What interviewers listen for.

  • Do you reach for log-based CDC the moment "low latency + low source-DB load" comes up? — senior signal.
  • Do you mention the outbox pattern when "dual writes" are discussed? — senior-architectural signal.
  • Do you say MERGE on primary key when "warehouse sink" is asked? — required answer.
  • Do you bring up snapshot + streaming when "initial backfill" comes up? — Debezium-fluency signal.

Worked example — pick a CDC strategy for PostgreSQL → Snowflake

Detailed explanation. A common opener: "We have an OLTP PostgreSQL with 200 tables; we need them replicated into Snowflake with < 60-second lag and minimal source-DB load. What's your architecture?" The senior answer is log-based CDC via Debezium → Kafka → Snowflake Sink Connector with MERGE.

Question. Sketch the architecture and justify each component.

Code (component view).

# Debezium PostgreSQL source connector (Kafka Connect)
name: postgres-cdc-source
config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector
  database.hostname: postgres-primary
  database.user: replication_user
  database.dbname: production
  topic.prefix: cdc
  table.include.list: public.orders,public.customers,public.products
  slot.name: debezium_slot
  plugin.name: pgoutput
  snapshot.mode: initial
  schema.history.internal.kafka.bootstrap.servers: kafka:9092

# Snowflake sink connector
name: snowflake-cdc-sink
config:
  connector.class: com.snowflake.kafka.connector.SnowflakeSinkConnector
  topics.regex: cdc\.public\..*
  snowflake.database.name: PROD
  snowflake.schema.name: RAW
  key.converter: io.confluent.connect.avro.AvroConverter
  value.converter: io.confluent.connect.avro.AvroConverter
  value.converter.schema.registry.url: http://schema-registry:8081
  buffer.flush.time: 30
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Debezium reads the PostgreSQL WAL (Write-Ahead Log) via a logical replication slot. Source-DB load is the cost of one streaming reader — comparable to a replica.
  2. Snapshot mode = initial captures every existing row once at connector startup, then transitions to streaming the WAL from the snapshot's LSN cutoff.
  3. Each change becomes a Kafka message with key = primary key and value = { before, after, op, ts_ms, source }.
  4. Topic-per-table convention — cdc.public.orders, cdc.public.customers, etc. — keeps namespaces clean.
  5. Snowflake Sink Connector buffers messages, bulk-loads to a staging table, then MERGE INTO target keyed on the primary key. Idempotent on retry.

Output.

Component Role Latency contribution
Postgres WAL source-of-truth log 0ms
Debezium tail WAL → Kafka <1s
Kafka durable buffer <100ms
Sink connector buffer + MERGE up to buffer.flush.time (30s)
Snowflake final destination <5s for MERGE

Rule of thumb. Log-based + Kafka + topic-per-table + op-aware MERGE is the default 2026 CDC architecture. Reach for query-based polling only when WAL access isn't available.

CDC interview question on the four-component split

A senior probe: "Why split CDC into Debezium + Kafka + Sink? Why not write a single application that reads Postgres and writes Snowflake?"

Solution Using a decoupled Kafka-centric architecture

[ Postgres WAL ]
       │
       ▼
[ Debezium ] ─── slot+publication ───┐
                                     │
                                     ▼
                              [ Kafka topics ]   ← durable buffer; multiple consumers
                                     │
                  ┌──────────────────┼──────────────────┐
                  ▼                  ▼                  ▼
        [ Snowflake sink ]   [ BigQuery sink ]   [ Search indexer ]
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Decision Single-app approach Kafka-centric approach
1. Multiple downstream destinations duplicate work in app each sink subscribes independently
2. Replay history re-snapshot or custom log Kafka retention replay
3. Backpressure on slow sink source-DB pressure Kafka absorbs
4. Add a new consumer new app or fork new sink connector
5. Schema evolution manual Schema Registry

Output:

Concern Single app Kafka-centric
New consumer new code, new app new sink config
Replay history re-snapshot reset offset
Slow consumer source DB feels it Kafka decouples
Schema evolution per-app handling Registry-managed

Why this works — concept by concept:

  • Decoupled capture vs delivery — Debezium captures into Kafka; sinks pull from Kafka. Each side scales independently.
  • Kafka as durable buffer — retention is the replay surface; consumers can pause without affecting the source DB.
  • Topic-per-table convention — natural partitioning of the CDC namespace; permits per-table retention and access controls.
  • Schema Registry — centralises Avro/Protobuf schema evolution across producers and consumers.
  • Cost — one extra hop (Kafka) adds <1s latency and substantial operational simplicity. Almost always worth it.

Streaming
Topic — streaming pipelines
CDC pipeline design problems

Practice →


2. The three CDC strategies — query-based, trigger-based, log-based

log based cdc vs trigger based cdc is the most-asked CDC interview question — know all three

Comparison diagram of the three CDC strategies — Query-based (polling SELECT WHERE updated_at > last_max), Trigger-based (DB triggers write to an audit/change table on every DML), and Log-based (Debezium reads the WAL / binlog / redo log); three side-by-side cards summarising latency, source-DB load, captures deletes, and ordering for each; on a light PipeCode card.

The mental model: all three strategies turn "what changed since last time" into a stream of events, but they pay different costs.

Query-based (polling) CDC.

  • How: SELECT * FROM orders WHERE updated_at > $last_max ORDER BY updated_at on a cadence.
  • Pros: trivial; no special DB permissions; works on every database.
  • Cons:
    • Cannot detect DELETEs unless you also poll a tombstone table.
    • Misses fast intra-second updates that share the same updated_at.
    • Source-DB load scales with poll frequency.
    • Ordering within the polling window is best-effort.
  • When acceptable: low-volume tables, daily lag is fine, you don't care about DELETEs.

Trigger-based CDC.

  • How: DB triggers on every INSERT/UPDATE/DELETE write a row to an audit_orders table; a consumer polls or tails the audit table.
  • Pros: catches DELETEs, exact ordering, works on any RDBMS.
  • Cons:
    • Source-DB write amplification — every write becomes 2 writes.
    • Triggers can corrupt OLTP performance at scale.
    • Audit-table growth needs aggressive pruning.
  • When acceptable: small-volume mission-critical tables on a DB without WAL-access; auditing requirements.

Log-based CDC.

  • How: read the database's transaction log directly — Postgres WAL, MySQL binlog, Oracle redo log, SQL Server CDC tables.
  • Pros: lowest source-DB load (one streaming reader, like a replica); lowest latency (sub-second); captures DELETEs; strict per-LSN ordering.
  • Cons: requires WAL access (REPLICATION role on Postgres, binlog_format=ROW on MySQL); more setup; tool dependency (Debezium, Fivetran, AWS DMS).
  • When: the 2026 default for any production CDC pipeline.

Cost matrix.

Property Query-based Trigger-based Log-based
Latency minutes–hours seconds sub-second
Source-DB load medium–high high low
Captures DELETE no yes yes
Ordering best-effort insertion order log order (LSN)
Setup difficulty easy medium medium
OLTP impact reads only write amplification one reader

Worked example — convert a polling pipeline to log-based CDC

Detailed explanation. A team runs an Airflow DAG every 15 minutes that does SELECT * FROM orders WHERE updated_at > $last. It misses ~3% of fast updates and never catches DELETEs. Convert to Debezium.

Question. Show the polling SQL anti-pattern and the Debezium replacement.

Input (polling anti-pattern).

@task
def poll_orders():
    last = Variable.get("orders_last_max")  # ⚠ stateful, fragile
    rows = pg.execute(f"""
        SELECT * FROM orders
        WHERE updated_at > '{last}'
        ORDER BY updated_at
    """)
    target.upsert(rows, key="order_id")
    Variable.set("orders_last_max", max(r["updated_at"] for r in rows))
# Misses: DELETEs · fast intra-second updates · holes during DB outages.
Enter fullscreen mode Exit fullscreen mode

Code (Debezium replacement).

name: postgres-orders-cdc
config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector
  database.hostname: postgres-primary
  database.dbname: production
  topic.prefix: cdc
  table.include.list: public.orders
  slot.name: debezium_orders
  plugin.name: pgoutput
  snapshot.mode: initial
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The polling query reads rows where updated_at > last_max — anything missing an updated_at column or two updates in the same millisecond is invisible.
  2. DELETEs leave no trace in the source table, so the polling consumer never sees them.
  3. Debezium tails the WAL — every row mutation, including DELETE, emits an event with op=c|u|d|r.
  4. Per-key ordering is preserved by the LSN of each WAL record.
  5. Backfills are free: change snapshot.mode to initial and Debezium re-snapshots the entire table once.

Output (event stream comparison).

Source event Polling result Debezium event
INSERT order_id=1 captured op=c, key=1, after={...}
UPDATE order_id=1 (1ms later) possibly missed op=u, key=1, before={...}, after={...}
DELETE order_id=1 invisible op=d, key=1, before={...}, after=null
INSERT order_id=2 during DB blip depends on retry op=c, key=2 (after WAL recovers)

Rule of thumb. Whenever you see "SELECT … WHERE updated_at > last" in production code, it's an upgrade candidate.

CDC interview question on choosing between trigger-based and log-based

The probe: "We're on Oracle and the DBA won't grant WAL access. CDC anyway?"

Solution Using trigger-based audit tables with bounded retention

CREATE TABLE audit_orders (
    audit_id    BIGSERIAL PRIMARY KEY,
    op_type     VARCHAR(1) NOT NULL,  -- I, U, D
    captured_at TIMESTAMPTZ DEFAULT NOW(),
    order_id    BIGINT NOT NULL,
    before_row  JSONB,
    after_row   JSONB
);

CREATE OR REPLACE TRIGGER trg_orders_audit
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW
EXECUTE FUNCTION audit_orders_fn();

-- consumer
SELECT * FROM audit_orders
WHERE audit_id > :last_seen
ORDER BY audit_id
LIMIT 10000;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Event Trigger fires Audit row Consumer view
INSERT id=1 yes op=I, before=null, after={...} sees op=I
UPDATE id=1 yes op=U, before={...}, after={...} sees op=U
DELETE id=1 yes op=D, before={...}, after=null sees op=D
consumer crashes rows persist re-reads from last seen audit_id

Output:

Property Value
Captures DELETE yes
Ordering by audit_id (insertion order)
Source-DB write amp 2× (one row + one audit row)
Audit-table growth unbounded — needs retention
Pruning nightly DELETE FROM audit_orders WHERE captured_at < NOW() - 30 days

Why this works — concept by concept:

  • Triggers as the capture mechanism — every DML automatically writes the audit row inside the same transaction. Atomic by definition.
  • Audit table as the queue — consumer just polls WHERE audit_id > last_seen. Per-key ordering preserved.
  • Bounded retention — without pruning, the audit table grows forever and starts dominating storage. 30-day retention is a common policy.
  • Write amplification cost — every DML now writes 2 rows. For low-volume tables this is fine; for high-volume tables this can double the DB load.
  • Cost — write throughput halved at the source DB; storage = O(retention × write_rate × row_size). Acceptable for small tables; log-based is much cheaper at high volume.

Streaming
Topic — real-time analytics
CDC strategy selection problems

Practice →


3. Debezium architecture — snapshot then stream

debezium architecture is "one initial snapshot, then forever streaming the WAL"

Diagram of Debezium architecture — a source database (Postgres) with WAL on the left, a Debezium connector inside a Kafka Connect worker in the middle (with snapshot phase and streaming phase shown), a Kafka cluster with topic-per-table on the right, and Schema Registry attached storing Avro schemas; arrows show the snapshot phase as a one-time read and the streaming phase as continuous WAL consumption; on a light PipeCode card.

Debezium is the open-source log-based CDC platform built on top of Kafka Connect. It supports PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, MongoDB, Db2, Cassandra, and more.

The Debezium event shape.

{
  "before": {"order_id": 42, "status": "paid"},
  "after":  {"order_id": 42, "status": "shipped"},
  "source": {"connector": "postgresql", "ts_ms": 1716840000000, "lsn": "0/16D2EBC0"},
  "op": "u",
  "ts_ms": 1716840000123
}
Enter fullscreen mode Exit fullscreen mode
  • op is one of c (create / INSERT), u (update), d (delete), r (read — snapshot).
  • before is null for INSERTs; after is null for DELETEs.
  • source carries the LSN / binlog offset for replay positioning.

Snapshot modes.

  • initial (default) — full table snapshot once, then stream the WAL from the snapshot's LSN cutoff. Most common.
  • never — never snapshot; start streaming from the current WAL position. Use when historical state is already populated.
  • when_needed — snapshot only if a snapshot hasn't been completed yet.
  • schema_only — capture schemas, no data. Useful for schema-evolution registration.
  • incremental (Debezium 1.6+) — chunk-based snapshot that runs concurrently with streaming. Best for very large tables.

Source-specific quirks.

  • PostgreSQL: needs wal_level=logical, a publication, and a replication slot. Slot grows the WAL until the consumer commits — slot lag is the #1 ops issue.
  • MySQL: needs binlog_format=ROW, binlog_row_image=FULL. GTID mode helps with replication-source failover.
  • Oracle: requires LogMiner or XStream APIs; XStream needs the GoldenGate license.
  • SQL Server: needs sys.sp_cdc_enable_db and per-table cdc_enable_table. Reads from system CDC tables, not the t-log directly.

Worked example — debug "Debezium fell behind the source"

Detailed explanation. A senior probe: "Our Debezium connector is hours behind. Where do you start?"

Question. Diagnostic walkthrough for Debezium lag on PostgreSQL.

Code (diagnostic commands).

-- 1. Replication slot lag (bytes of WAL retained for the slot)
SELECT slot_name, active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS slot_lag
FROM pg_replication_slots
WHERE slot_name = 'debezium_orders';

-- 2. Slot is consuming WAL? (if confirmed_flush_lsn isn't advancing → consumer is dead)
SELECT slot_name, confirmed_flush_lsn, last_used FROM pg_replication_slots;

-- 3. Source write rate
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS wal_position;
Enter fullscreen mode Exit fullscreen mode
# 4. Connector status from Kafka Connect REST
curl -s http://kafka-connect:8083/connectors/postgres-orders-cdc/status | jq

# 5. Connector lag metrics from JMX
echo "MillisecondsBehindSource: $(jmxterm-get debezium ...)"
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Slot lag is the canonical signal. If it grows unboundedly, the consumer isn't keeping up (or is paused).
  2. Connector statusRUNNING vs FAILED vs PAUSED. A failed connector retains the slot but stops consuming, so WAL piles up.
  3. MillisecondsBehindSource JMX metric — exposes the gap between WAL writes and Kafka publishes.
  4. Kafka consumer lag on the Snowflake sink — cdc.public.orders topic. If the sink is slow, the messages back up in Kafka but the Debezium source side is fine.
  5. Disk space — if WAL is filling the source-DB disk, you have minutes before things get serious.

Output (decision matrix).

Symptom Likely cause Fix
Slot lag growing, connector status FAILED crashed connector restart; fix root error
Slot lag growing, status RUNNING downstream Kafka slow / full scale Kafka, drain sink
Slot lag stable, sink consumer lag growing sink throughput scale sink workers
WAL filling disk fast hot writes + slow consumer bump connector resources or kill slot temporarily
Schema-history topic compaction issues misconfigured retention switch to compacted topic

Rule of thumb. Slot lag is your North Star. Alert on slot_lag > 1GB for production CDC.

CDC interview question on snapshot strategies for large tables

The probe: "We need to CDC a 2 TB orders table. The default initial snapshot would take 12 hours and hold a row-level lock — what do you do?"

Solution Using incremental snapshot

config:
  ...
  snapshot.mode: when_needed
  snapshot.locking.mode: none
  incremental.snapshot.chunk.size: 10000
  incremental.snapshot.watermarking.strategy: insert_insert
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Action Effect
1 Connector starts; streaming phase begins immediately WAL events flow from the current LSN
2 Operator sends a signal message requesting incremental snapshot for public.orders Debezium opens a chunked snapshot loop
3 Snapshot loop reads orders in 10k-row chunks ordered by PK each chunk emits op=r events into the topic
4 Streaming + snapshot run concurrently replay-safe because both emit to the same topic with consistent keys
5 Snapshot completes; the loop terminates only streaming continues

Output:

Approach Lock time Initial latency Disk impact
initial snapshot minutes–hours of row-level lock hours before streaming starts huge slot growth during snapshot
incremental snapshot none (no lock) streaming starts immediately bounded slot growth

Why this works — concept by concept:

  • Chunked snapshot — instead of one giant SELECT, Debezium reads chunks of N rows ordered by PK. Each chunk is a tiny transaction.
  • Concurrent with streaming — the streaming phase keeps emitting WAL events for the latest writes; the snapshot phase emits historical reads. Both land on the same topic with the same key.
  • Watermarking — Debezium writes insert/insert watermark events around each chunk so consumers can reason about chunk boundaries if needed.
  • No source-DB lock — chunks are PK-range queries, served from the read replica or with LOCK MODE NONE.
  • Cost — slightly more code in the connector; substantially less operational pain for huge tables.

Streaming
Topic — streaming pipelines
Debezium operational design problems

Practice →


4. CDC into Kafka — topic-per-table, key choice, partitioning

Topic-per-table + key=PK + partition by key = ordered, replayable, partitionable CDC

The CDC-into-Kafka layout matters as much as the CDC source itself. Wrong key choice → wrong ordering. Wrong topic shape → impossible-to-manage namespaces. Wrong partition count → bottlenecked replication.

The three CDC-on-Kafka conventions.

  • Topic-per-table — one source table → one Kafka topic, named {prefix}.{schema}.{table} (e.g. cdc.public.orders).
  • Key = primary key — every CDC event uses the row's PK as the Kafka key. All events for one row land in one partition.
  • Partition count tuned to the table's write rate — high-traffic tables get more partitions; low-traffic ones get fewer.

Tombstones.

  • DELETE events emit two messages per row by default: the op=d event AND a tombstone (null value, same key).
  • Tombstones tell log-compacted topics to drop the key entirely on the next compaction round.
  • Useful when you want the topic to materialise as "current state per key."

Schema considerations.

  • Avro / Protobuf / JSON Schema via the Schema Registry — the Debezium-recommended path.
  • Schema id is the 4-byte header on every message — consumers fetch the schema by id.
  • JSON without schema is fine for development but loses schema evolution discipline.

Outbox-pattern topics.

  • When using the outbox pattern (see §5), Debezium reads the outbox table and re-keys/re-topics each event using SMTs (Single Message Transforms).
  • Result: one outbox feed → many domain-event topics (orders-events, payment-events, etc.).

Worked example — convert a flat all-tables topic into topic-per-table

Detailed explanation. A naive Debezium config writes every table to one giant topic. The senior fix splits by table using a RegexRouter SMT.

Question. Take a topic.prefix=cdc config and ensure each source table lands on its own topic.

Code (correct config).

config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector
  topic.prefix: cdc
  table.include.list: public.orders,public.customers,public.products
  # By default Debezium writes to ${topic.prefix}.${schema}.${table}, e.g. cdc.public.orders
  # No router needed for the default behaviour. But if you want to flatten the namespace:
  transforms: route
  transforms.route.type: org.apache.kafka.connect.transforms.RegexRouter
  transforms.route.regex: cdc\.public\.(.*)
  transforms.route.replacement: warehouse.$1
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Default Debezium → topic name = cdc.public.orders (prefix . schema . table).
  2. The RegexRouter SMT renames cdc.public.orders to warehouse.orders for cleaner downstream subscriber UX.
  3. The key is unchanged — still the row's primary key.
  4. Each topic gets its own partition count via auto.create.topics.enable + num.partitions per topic (or pre-create with explicit kafka-topics.sh --create).
  5. Tombstones are emitted by default — useful for compacted topics, disable with tombstones.on.delete=false if your sink can't handle them.

Output.

Source table Default Kafka topic After router
public.orders cdc.public.orders warehouse.orders
public.customers cdc.public.customers warehouse.customers
public.products cdc.public.products warehouse.products

Rule of thumb. Topic-per-table with the natural Debezium naming convention. Use a RegexRouter only when the prefix needs to be cleaner for downstream consumers.

CDC interview question on partition count for a hot table

The probe: "The events table writes 50k rows/sec. How many partitions?"

Solution Using events.count / consumer_throughput math

# pre-create the topic with the right partition count
kafka-topics.sh --bootstrap-server kafka:9092 \
  --create --topic cdc.public.events \
  --partitions 30 \
  --replication-factor 3 \
  --config min.insync.replicas=2 \
  --config retention.ms=604800000
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Metric Value Source
Write rate 50k rows/sec source DB
Single-consumer throughput ~5k rows/sec benchmark of the Snowflake sink
Required partitions 50k / 5k = 10 math
Over-provision factor future growth
Final partition count 30 rounded

Output:

Decision Value Why
Partitions 30 50k/5k = 10 × 3 over-provision
Replication 3 survive one broker loss
Key event_id (PK) per-event ordering
Retention 7 days replay window

Why this works — concept by concept:

  • Partition count = parallelism cap — N consumers max per consumer group. 30 partitions means 30 parallel sink workers.
  • Key = PK preserves per-row ordering — all changes to event 42 land on the same partition, so the sink applies them in source order.
  • Pre-create the topic — never rely on auto-create for production CDC topics; you'll get default partition counts that don't match your throughput.
  • Retention as replay window — 7 days means you can blow up the sink and rebuild it from Kafka without re-snapshotting the source DB.
  • Cost — partitions cost a few KB of metadata each; under-provisioning costs throughput. Over-provision is cheap.

Streaming
Topic — streaming pipelines
CDC topic + partition design

Practice →


5. Schema evolution and the dual-writes trap

cdc schema evolution is solved by Schema Registry + compatibility modes; dual writes is solved by the outbox pattern

Schema evolution and dual writes are two of the most common CDC failure modes — and both have well-known, named solutions.

Side-by-side diagram of the dual writes anti-pattern vs the outbox pattern — left: a service writes to Postgres AND publishes to Kafka separately, marked with a red 'inconsistent on partial failure' warning; right: a service writes to a single Postgres transaction containing both the business row and an outbox row, with Debezium tailing the outbox table and producing the Kafka event, marked with a green 'atomic' checkmark; on a light PipeCode card.

Schema evolution.

  • BACKWARD (default in Confluent Registry) — new schemas can READ old data. Consumers can upgrade before producers.
  • FORWARD — old consumers can READ data from new producers.
  • FULL — both directions.
  • NONE — anything goes (don't).
  • For CDC topics, BACKWARD is the typical choice: producers (Debezium) add columns, sinks (Snowflake) handle additive changes.

The dual-writes anti-pattern.

# ⚠️ anti-pattern
def place_order(order):
    db.execute("INSERT INTO orders VALUES (...)")    # ① durable in DB
    kafka.send("order_events", order.to_event())     # ② durable in Kafka
# Either step can fail independently — split-brain state.
Enter fullscreen mode Exit fullscreen mode
  • DB succeeds, Kafka fails → ghost order with no event.
  • DB fails, Kafka sees the event → duplicate event downstream.
  • Two durable systems, two transactions — no atomicity.

The outbox pattern — the fix.

# ✓ outbox pattern
def place_order(order):
    with db.transaction():
        db.execute("INSERT INTO orders VALUES (...)")
        db.execute("INSERT INTO outbox (event_type, payload) VALUES ('order_placed', ...)")
    # Both rows commit atomically — single transaction, single durable system.
# Debezium tails the outbox table and produces the Kafka event.
Enter fullscreen mode Exit fullscreen mode
  • ONE transaction, ONE durable system (the DB).
  • Debezium does the Kafka publishing as a separate concern.
  • Routing the outbox table to per-event-type topics is a 5-line SMT (EventRouter).

Worked example — implement the outbox pattern with Debezium's EventRouter SMT

Detailed explanation. The outbox table holds raw event payloads; Debezium routes each row to the correct domain topic using an SMT.

Question. Design the outbox table and the Debezium config to route order_placed events to order-events topic.

Input (outbox table).

CREATE TABLE outbox (
    id          BIGSERIAL PRIMARY KEY,
    aggregate   VARCHAR(64) NOT NULL,   -- e.g. 'order'
    aggregate_id VARCHAR(64) NOT NULL,  -- e.g. '42' (the order_id)
    event_type  VARCHAR(64) NOT NULL,   -- e.g. 'order_placed'
    payload     JSONB NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_outbox_created ON outbox(created_at);
Enter fullscreen mode Exit fullscreen mode

Code (Debezium config with EventRouter SMT).

config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector
  database.hostname: postgres-primary
  database.dbname: production
  topic.prefix: outbox
  table.include.list: public.outbox
  transforms: outbox
  transforms.outbox.type: io.debezium.transforms.outbox.EventRouter
  transforms.outbox.route.topic.replacement: ${routedByValue}-events
  transforms.outbox.route.by.field: aggregate
  transforms.outbox.table.field.event.id: id
  transforms.outbox.table.field.event.key: aggregate_id
  transforms.outbox.table.field.event.payload: payload
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The service writes orders row + outbox row in one transaction.
  2. Debezium tails the WAL and sees both rows as separate events. Only the outbox event matters for downstream.
  3. The EventRouter SMT reads the aggregate column → routes to order-events topic.
  4. The Kafka key is aggregate_id (the order's natural key).
  5. The Kafka value is the payload JSON column (already-serialised event).

Output (event flow).

Source row Outbox row Kafka topic Kafka key Kafka value
orders.id=42 outbox.id=101, aggregate=order, type=order_placed order-events 42 {"order_id":42,...}
orders.id=42 (UPDATE) outbox.id=102, aggregate=order, type=order_shipped order-events 42 {"order_id":42,"status":"shipped"}

Rule of thumb. Whenever your service needs to write to a database AND publish an event, use the outbox pattern. Never trust two durable systems to commit in lockstep without it.

CDC interview question on backward-compatible schema changes

The probe: "We need to add a column to the source table. How do we evolve the CDC pipeline?"

Solution Using BACKWARD compatibility + nullable additions

-- additive change in the source DB
ALTER TABLE orders ADD COLUMN promo_code VARCHAR(20) NULL;
Enter fullscreen mode Exit fullscreen mode
# Schema Registry policy
"subject": "cdc.public.orders-value"
"compatibility": "BACKWARD"
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Action Effect
1 DBA runs ALTER TABLE orders ADD COLUMN promo_code column appears in Postgres
2 Next write fires Debezium WAL new Avro schema registered (v2) with extra nullable field
3 Schema Registry checks compatibility: BACKWARD → new schema can read old data passes
4 Consumer running on schema v1 reads message with schema v2 new field appears as null (default)
5 Consumer upgraded to schema v2 reads the column normally

Output:

Schema change BACKWARD compatible? Allowed without consumer upgrade?
Add nullable column yes yes
Add column with default yes yes
Remove column no (only FORWARD) no
Rename column no no
Change column type depends on type pair no in general

Why this works — concept by concept:

  • BACKWARD compatibility — new schema can read old data; safe for the producer to evolve while consumers catch up.
  • Additive changes are always safe — adding a nullable column has zero impact on existing consumers.
  • Destructive changes (drop, rename) require coordination — pause producers, drain consumers, deploy new schema, resume.
  • Schema Registry as the gate — every new schema is checked before it's accepted; impossible to push a breaking schema accidentally.
  • Cost — Schema Registry adds one lookup per new schema id (cached); operational overhead is the disciplined compatibility mode.

Streaming
Topic — real-time analytics
Outbox + schema evolution problems

Practice →


6. CDC sinks — warehouses, lakehouses, op-aware MERGE

cdc to snowflake and cdc to bigquery = bulk stage + op-aware MERGE on primary key

Diagram of a CDC sink applying Debezium events into a Snowflake target table via MERGE — left: Kafka topic 'cdc.public.orders' showing three records (op=c, op=u, op=d) keyed by order_id; middle: a Kafka Connect Snowflake sink connector card; right: a target 'orders' table in Snowflake with the row's final state (id=42 deleted, id=43 inserted, id=42 updated then deleted etc.); on a light PipeCode card.

The mental model: sinks accumulate Kafka events into a staging table, then apply them with a single op-aware MERGE keyed on the primary key. The MERGE is what makes the sink idempotent — replaying the topic produces the same end state.

The five sink choices.

  • Snowflake Sink Connector (Confluent / Snowflake) — buffers + MERGE.
  • BigQuery Sink Connector — buffers + MERGE (or stream-insert with dedup).
  • Databricks Auto Loader + DLT — Delta tables with CDC merge.
  • Iceberg / Hudi / Delta Lake — lakehouse formats with native CDC merge ops.
  • Custom consumer — when none of the above fit.

Op-aware MERGE.

MERGE INTO target t
USING stg s
ON t.id = s.id
WHEN MATCHED AND s.op = 'd' THEN DELETE
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED AND s.op != 'd' THEN INSERT ...;
Enter fullscreen mode Exit fullscreen mode

The op column from Debezium drives which branch the MERGE takes:

  • op=c → INSERT
  • op=u → UPDATE
  • op=d → DELETE
  • op=r (snapshot) → UPSERT (treat as INSERT or UPDATE)

Replay safety.

  • Re-consume the topic from offset 0 → MERGE applies the same final state.
  • No duplicate rows, no missed deletes.
  • Idempotent by construction.

Worked example — apply Debezium events to a Snowflake target

Detailed explanation. Walk through how a three-event sequence (INSERT, UPDATE, DELETE) for order_id=42 ends up in the target table.

Question. Trace op=c, op=u, op=d for order_id=42 through the Snowflake sink. Show the staging table and the MERGE output.

Input (Kafka topic cdc.public.orders).

// msg 1
{"op":"c","after":{"order_id":42,"status":"paid","amount":100}}
// msg 2
{"op":"u","before":{"status":"paid"},"after":{"order_id":42,"status":"shipped","amount":100}}
// msg 3
{"op":"d","before":{"order_id":42,"status":"shipped","amount":100},"after":null}
Enter fullscreen mode Exit fullscreen mode

Code (sink processing).

-- Staging table populated by the sink connector batch
SELECT * FROM stg;
-- ┌──────────┬────┬───────────┬────────┐
-- │ order_id │ op │ status    │ amount │
-- ├──────────┼────┼───────────┼────────┤
-- │       42 │ c  │ paid      │    100 │
-- │       42 │ u  │ shipped   │    100 │
-- │       42 │ d  │ <null>    │ <null> │
-- └──────────┴────┴───────────┴────────┘

-- pre-MERGE collapse: keep only the LAST event per key (by Kafka offset)
WITH last_per_key AS (
    SELECT order_id, op, status, amount,
           ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY kafka_offset DESC) AS rn
    FROM stg
)
MERGE INTO orders t
USING (SELECT * FROM last_per_key WHERE rn = 1) s
ON t.order_id = s.order_id
WHEN MATCHED AND s.op = 'd' THEN DELETE
WHEN MATCHED THEN UPDATE SET status = s.status, amount = s.amount
WHEN NOT MATCHED AND s.op != 'd' THEN INSERT (order_id, status, amount)
                                  VALUES (s.order_id, s.status, s.amount);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The sink connector buffers messages and writes them to a staging table.
  2. Pre-MERGE collapse — for the same order_id, only the LAST event matters; intermediate UPDATE before DELETE doesn't need to be applied.
  3. The MERGE walks each row in the collapsed staging table.
  4. For order_id=42 the last event is op=d → matched → DELETE.
  5. End state: row is gone from orders. Replay produces the same result.

Output (final orders table).

order_id status amount
(empty — id=42 deleted)

Rule of thumb. Pre-MERGE collapse + op-aware MERGE = replay-safe sink. Never write a sink that applies each event individually without collapsing.

CDC interview question on append-only history vs current state

A senior probe: "Do we land CDC as current state, or as append-only history?"

Solution Using two-tier — append-only _raw plus materialised current state

-- Tier 1: append-only CDC log
CREATE TABLE orders_raw (
    kafka_offset BIGINT,
    op           VARCHAR(1),
    captured_at  TIMESTAMPTZ,
    order_id     BIGINT,
    status       VARCHAR(20),
    amount       NUMERIC(12,2)
);

-- Tier 2: materialised view of current state (refreshed by MERGE)
CREATE MATERIALIZED VIEW orders_current AS
WITH last_per_key AS (
    SELECT order_id, op, status, amount,
           ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY kafka_offset DESC) AS rn
    FROM orders_raw
)
SELECT order_id, status, amount
FROM last_per_key
WHERE rn = 1 AND op != 'd';
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Tier Stored Use case
Tier 1 (orders_raw) every event ever, append-only audit, replay, point-in-time queries
Tier 2 (orders_current) current state per key dashboards, joins

Output:

Question Answer with single-tier MERGE Answer with two-tier
What was the order's status at 14:00 yesterday? impossible (state was overwritten) query orders_raw
What's the current status? query orders query orders_current
Replay history possible if Kafka has it possible from orders_raw
Storage cost low high (every event)

Why this works — concept by concept:

  • Append-only raw layer — preserves full history; storage cost = O(events × row_size), but storage is cheap and audits are priceless.
  • Materialised current state — derived from raw via a windowed MERGE; refreshable as often as the SLA needs.
  • Decoupled audit from current state — different consumers ask different questions; one layer for each.
  • Replay confidence — Tier 1 itself is replay-safe (append-only on kafka_offset); Tier 2 can be rebuilt from Tier 1 at any time.
  • Cost — extra storage in exchange for time-travel. Acceptable for tables that benefit from audit history.

Streaming
Topic — database
CDC sink design problems (MERGE, replay)

Practice →


7. Operational CDC — backfills, snapshots, replication lag, observability

Production CDC is replication-slot management plus four metrics

CDC interview rounds often end with operational realism: how do you spot a problem, how do you backfill safely, how do you survive a primary failover.

The four CDC metrics every senior team monitors.

  • Replication slot lag (bytes) — bytes of WAL retained for the slot. Growing = consumer not keeping up.
  • Connector lag (MillisecondsBehindSource) — gap between WAL writes and Kafka publishes.
  • Consumer lag — partition-by-partition Kafka consumer lag on every sink.
  • Sink throughput (rows/sec) — actual application rate at the destination.

Backfills.

  • For a known-good window, just re-snapshot via snapshot.mode=initial on a new connector (different slot).
  • For a partial backfill, use incremental snapshot scoped to a subset of tables.
  • For a brand-new sink, set the consumer offset to earliest on the Kafka topic (provided retention covers the window).

Primary failover.

  • PostgreSQL: physical replication followers don't carry logical slots by default. Failover to a streaming replica requires pg_create_logical_replication_slot on the new primary and re-syncing — or use a tool like pg_failover_slots.
  • MySQL: GTID mode + binlog filename/position tracking. Debezium auto-resumes from the recorded position if the new primary's binlog covers it.
  • Operationally: always know the LSN/GTID of the last delivered event so you can rebuild the slot on the new primary.

Observability stack.

  • Debezium exposes JMX metrics → Prometheus → Grafana dashboards.
  • Schema Registry has REST endpoints for schema-version inventory.
  • Kafka brokers expose consumer-lag metrics.
  • Source DB exposes pg_replication_slots / equivalent.

Worked example — design the CDC operations runbook

Detailed explanation. A senior probe might be: "Write the on-call runbook for our CDC pipeline."

Question. Document the three most-paged scenarios and the steps to resolve.

Code (runbook structure).

# CDC On-Call Runbook

## Scenario A — replication slot lag growing
Symptom: `slot_lag > 1GB` for > 5 min.
1. Check connector status: GET /connectors/{name}/status. If FAILED, see logs and restart.
2. Check connector lag (`MillisecondsBehindSource`). If high, scale connector resources.
3. Check Kafka consumer lag on downstream sinks. If they're behind, scale sink workers.
4. If WAL fills source-DB disk in < 30 min, emergency: drop the slot, force-clear, accept data loss, re-snapshot.

## Scenario B — connector keeps crashing
Symptom: Status: FAILED repeatedly.
1. Read the connector log; look for `OutOfMemoryError`, `SchemaParseException`, `NotEnoughReplicasException`.
2. Common fix: increase heap for the Kafka Connect worker.
3. Common fix: schema-history topic compaction; recreate it as compacted.
4. Common fix: skip a poison-pill message via `errors.tolerance=all` + DLQ.

## Scenario C — sink applying slowly
Symptom: Kafka consumer lag growing, source-DB slot lag stable.
1. Check sink throughput; benchmark single-partition consumer.
2. Add partitions to the topic (rebalance required).
3. Scale sink workers (Kafka Connect tasks).
4. Tune `buffer.flush.time` / `buffer.flush.count` for the sink.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Distinguish source-side problems (slot lag, Debezium failure) from sink-side problems (consumer lag, throughput).
  2. Every scenario has a low-risk first action (read status / logs) and a high-risk last action (drop the slot, force-clear, re-snapshot).
  3. Sink scaling problems are usually solved by partition count + worker count; rarely by sink-side connector config alone.
  4. Poison-pill messages are handled via the dead-letter queue pattern — errors.tolerance=all and an explicit DLQ topic.
  5. Make the runbook executable — paste-able commands, not prose.

Output (runbook quality check).

Question Yes / No
Distinguishes source-side from sink-side? Yes
Includes the "drop the slot" emergency? Yes
Gives explicit commands? Yes
Names the relevant metrics? Yes
Covers schema-history compaction? Yes

Rule of thumb. A runbook that doesn't include the disaster scenario (drop the slot) isn't a runbook.

CDC interview question on primary failover

The probe: "What happens to your CDC pipeline if Postgres primary fails over?"

Solution Using pg_failover_slots + binlog-position bookkeeping

# Install the pg_failover_slots extension on the primary and all replicas
CREATE EXTENSION pg_failover_slots;

# Promote a replica
pg_ctl promote -D /var/lib/postgresql/data

# Verify the slot was promoted to the new primary
SELECT slot_name, active, restart_lsn FROM pg_replication_slots;
# Debezium should reconnect within seconds, resuming from restart_lsn
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Without pg_failover_slots With pg_failover_slots
1 Primary fails Primary fails
2 Replica promoted Replica promoted
3 New primary has NO slots New primary has the slot, advanced to last-known LSN
4 Debezium tries to reconnect Debezium reconnects to the new primary
5 Connector fails (no slot) Connector resumes from where it left off
6 Manual recovery: snapshot from scratch Zero downtime, no re-snapshot

Output:

Concern Without With
Downtime hours (re-snapshot 200 tables) seconds
Data loss possible (events between failure and new slot) none
Operational complexity high low
Extension required no yes (pg_failover_slots)

Why this works — concept by concept:

  • pg_failover_slots — replicates the logical slot state to standby servers; on failover, the new primary already has the slot at the right LSN.
  • Restart from restart_lsn — Debezium reconnects and resumes streaming without missing any WAL records.
  • Zero re-snapshot — no need to re-read the full table; the slot already covers the gap.
  • Operationally simple — one extension, one command on the new primary, no manual intervention.
  • Cost — minor write overhead on the primary to maintain slot state on standbys; massive win on failover.

Streaming
Topic — streaming pipelines
CDC ops + failover problems

Practice →


Choosing the right CDC primitive (cheat sheet)

  • Need low latency + low source-DB load? Log-based CDC (Debezium). Not query polling.
  • Source DB has no WAL access? Trigger-based CDC with a bounded-retention audit table.
  • Need atomic DB + event write? Outbox pattern + Debezium tailing the outbox table. Never dual writes.
  • Need topic-per-table? Default Debezium naming convention ({prefix}.{schema}.{table}).
  • Need replay-safe sink? Op-aware MERGE on primary key + pre-MERGE collapse to last event per key.
  • Need audit history? Two-tier: append-only _raw table + materialised _current view derived from raw.
  • Need schema evolution? Schema Registry + BACKWARD compatibility for additive changes.
  • Need to backfill a huge table? incremental snapshot mode — concurrent with streaming, no lock.
  • Need failover survival? pg_failover_slots on Postgres; GTID mode on MySQL.
  • Need to monitor? Slot lag + connector lag + consumer lag + sink throughput.

Frequently asked questions

What is Change Data Capture and why is log-based CDC the default choice?

CDC streams every INSERT, UPDATE, and DELETE from a source database to one or more downstream destinations, in near real time. Log-based CDC reads the database's transaction log (Postgres WAL, MySQL binlog, Oracle redo log) directly — it has the lowest source-DB load, the lowest latency (sub-second), captures DELETEs, and preserves strict per-key ordering by LSN. It's the 2026 default because the alternatives — query polling and DB triggers — either miss DELETEs or impose heavy write-amplification on the source.

How does Debezium work and what does the snapshot phase do?

Debezium is the open-source log-based CDC platform built on Kafka Connect. On startup it runs a snapshot phase — a one-time read of the entire table to capture existing rows — then transitions to streaming mode, tailing the WAL forever and emitting one Kafka event per row mutation. Each event carries before, after, op (c/u/d/r), and source LSN. For very large tables, use incremental snapshot mode, which runs chunked snapshots concurrently with streaming and avoids long-held locks.

What's the dual writes problem and how does the outbox pattern solve it?

Dual writes is the anti-pattern of writing to a database AND publishing to a message broker in two separate steps — either step can fail independently, leaving the system in an inconsistent state. The outbox pattern fixes this: write the business row AND a row in an outbox table inside a single DB transaction. Then have Debezium tail the outbox table and publish the events to Kafka. Only one durable system per transaction, atomic by definition. Debezium's EventRouter SMT routes outbox rows to per-event-type topics.

How does CDC into Kafka work — what's the topic and key convention?

The convention is topic-per-table (Debezium defaults to {prefix}.{schema}.{table}) and key = primary key. This guarantees that all events for one row land in the same partition, preserving per-key ordering. Partition count is sized to the table's write rate divided by single-consumer throughput, with 2–3× over-provisioning for growth. DELETEs emit two messages by default — the op=d event plus a tombstone (null value, same key) — which lets compacted topics drop the key entirely.

How do CDC sinks apply Debezium events to a target warehouse?

CDC sinks buffer Kafka messages into a staging table, collapse them to "last event per primary key" (using Kafka offset as the tiebreaker), then run an op-aware MERGE keyed on PK: WHEN MATCHED AND op='d' THEN DELETE, WHEN MATCHED THEN UPDATE, WHEN NOT MATCHED THEN INSERT. This MERGE is what makes the sink replay-safe — re-consuming the topic from offset 0 produces the same final state. Many teams also keep a Tier-1 append-only _raw table to preserve full history for audit and point-in-time queries.

What's the difference between BACKWARD and FORWARD schema compatibility for CDC?

BACKWARD compatibility (the default for CDC) means the new schema can READ data written with the old schema — so producers can upgrade before consumers. FORWARD means old consumers can READ data written with new producers (less common). FULL means both. For CDC, BACKWARD is right because the database adds columns (a producer change), and the sink picks up the new column on its next deploy. Always combine with Schema Registry to gate every schema change — destructive changes (drop, rename) require coordinated upgrades.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every CDC concept above ships with hands-on practice rooms where you design log-based CDC pipelines, configure real Debezium connectors, and trace op-aware MERGEs. Start with the streaming library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.

Practice streaming now →
Real-time analytics drills →

Top comments (0)