Gowtham Potureddi

Posted on Jun 30

PostgreSQL Logical Replication for CDC: Slots, Publications & Conflict Handling

#sql #python #interview #dataengineering

postgres logical replication is the quiet plumbing under almost every modern CDC pipeline — and the part of Postgres senior data engineers most often handwave through until a phantom replication slot fills the source disk at 3am and the on-call shift becomes a forensic investigation. Streaming replication ships byte-for-byte WAL between Postgres instances; pg_logical ships decoded SQL-level changes — INSERT / UPDATE / DELETE rows with column values — out to anything that can speak the replication protocol. That single shift turns Postgres into a real change-data-capture source for Kafka, Snowflake, an audit warehouse, or another Postgres on a newer major version.

This guide is the senior-DE walkthrough you wished existed the first time an interviewer asked "explain postgres cdc end-to-end" or "what does a replication slot actually hold, and why is it dangerous?" It walks through logical decoding (WAL → output plugin → change events), the slot machinery (pgoutput vs wal2json, max_slot_wal_keep_size, the phantom-slot anti-pattern), publications + REPLICA IDENTITY (why your UPDATEs silently disappear without a primary key), and the production CDC pattern teams ship with debezium postgres — snapshot-then-stream, heartbeat topics, signal-table ad-hoc snapshots, and idempotent sinks for the inevitable replay. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

When you want hands-on reps immediately after reading, drill the SQL practice library →, rehearse on the ETL practice library →, and sharpen the streaming axis with the streaming practice library →.

On this page

Why logical replication is the CDC backbone
Logical replication architecture
Replication slots — power and danger
Publications and REPLICA IDENTITY
CDC into Kafka via Debezium
Cheat sheet — logical replication recipes
Frequently asked questions
Practice on PipeCode

1. Why logical replication is the CDC backbone

Physical replication ships bytes, `postgres logical replication` ships row-level changes — the moment you want Postgres data anywhere other than another identical Postgres, you need the logical path

The one-sentence invariant: physical (streaming) replication ships the byte-for-byte WAL between two identical Postgres clusters; logical replication decodes the WAL into per-row INSERT/UPDATE/DELETE events that any consumer — Postgres, Kafka, Snowflake, a custom Go service — can apply. Every other downstream CDC tool (Debezium, Fivetran, Airbyte, AWS DMS, Striim, Snowflake's Postgres connector) is a wrapper on top of logical decoding. Once you internalise "physical = bytes, logical = decoded changes," the entire pg_logical interview surface collapses to a sequence of consequences from that one structural difference.

Three axes interviewers actually probe.

Decoding. Logical replication runs an output plugin inside the Postgres backend that walks the WAL and emits a stream of decoded change events. pgoutput is the in-tree binary protocol used by Postgres-to-Postgres subscriptions and by Debezium 2.x by default; wal2json emits JSON and is the historical default for tools that wanted a human-readable wire format; decoderbufs emits Protobuf for binary efficiency. The choice of plugin determines what your consumer code looks like.
Slot ownership. Every consumer of logical decoding holds a replication slot — a persistent cursor into the WAL that prevents Postgres from recycling segments the consumer has not yet read. The slot is the source of every "Postgres CDC ate my disk" outage you have ever read about. Slot lifecycle is the senior interview probe.
Replica identity. UPDATEs and DELETEs need a way to identify the row that changed. With a primary key, Postgres ships the PK columns in the change event. Without one, you must explicitly choose REPLICA IDENTITY FULL (ship the entire previous row) or REPLICA IDENTITY USING INDEX (ship a non-PK unique index). Get this wrong and your sink silently misses updates.

The DSL split — same WAL, different consumer wire format.

pgoutput ships a compact binary protocol native to Postgres. Used by every Postgres-to-Postgres CREATE SUBSCRIPTION and by Debezium 2.x by default. Smallest wire size, fastest decoding, slightly opaque to debug.
wal2json ships a JSON object per change with table name, op type, and column values. Easy to debug; larger wire size; the historical CDC default for the wider Postgres ecosystem.
decoderbufs ships Protobuf; the Debezium 1.x default. Compact and language-agnostic, but tied to a Debezium-shipped plugin binary that has to be loaded into Postgres explicitly.

The 2026 reality — what changed since 2022.

PG15 added row filtering (CREATE PUBLICATION p FOR TABLE orders WHERE (status = 'shipped')) and column lists (FOR TABLE orders (id, total)). Before PG15 you replicated whole tables; now you can ship a subset to a downstream that does not need everything.
PG16 added bi-directional logical replication support (with conflict handling), real parallel apply workers on the subscriber, and proper logical decoding from standbys (huge for read-replica-style CDC).
Debezium 2.x switched the default plugin from decoderbufs to pgoutput, removed the requirement for a custom plugin binary, and now supports READ_ONLY snapshot mode on Postgres replicas.
Fivetran, Airbyte, AWS DMS, Snowflake's Postgres connector all sit on top of pgoutput or wal2json. Knowing the underlying plugin protocol is what differentiates a senior interview answer from a vendor-comparison spec sheet.

What interviewers listen for.

Do you say "logical replication = WAL → output plugin → replication slot → subscription" in the first sentence? — senior signal.
Do you mention "the slot retains WAL until the consumer reads it; if the consumer dies, WAL fills the disk" unprompted? — senior signal.
Do you describe REPLICA IDENTITY as "what columns the source ships for UPDATE/DELETE" rather than "a setting"? — required answer.
Do you push back on "just use Fivetran" with "Fivetran is a managed wrapper on top of pgoutput; you still need to understand slots and replica identity"? — senior signal.

Worked example — physical vs logical replication side-by-side

Detailed explanation. The first interview question on Postgres replication is almost always a comparison. Physical replication uses streaming WAL between two Postgres clusters of the same major version with identical schemas and identical OIDs — it is the high-availability primitive. Logical replication uses logical decoding to ship row-level changes between any two consumers, regardless of major version, schema, or even database engine. The two solve different problems; teams often confuse them.

Question. Compare physical (streaming) replication and logical replication on five axes: granularity, version compatibility, target type, schema requirement, and what happens to DDL.

Input.

Axis	Physical	Logical
Granularity	whole cluster	per-table, per-DML
Major-version skew	same	any
Target	another Postgres	Postgres / Kafka / anything
Schema	identical	can differ
DDL	replicates automatically	does not replicate (must be applied separately)

Code.

-- Physical replication setup (primary)
ALTER SYSTEM SET wal_level = 'replica';
ALTER SYSTEM SET max_wal_senders = 10;
SELECT pg_create_physical_replication_slot('standby_1');

-- On standby
-- recovery.conf / postgresql.auto.conf:
-- primary_conninfo = 'host=primary user=replicator'
-- primary_slot_name = 'standby_1'

-- Logical replication setup (publisher)
ALTER SYSTEM SET wal_level = 'logical';
CREATE PUBLICATION orders_pub FOR TABLE orders;

-- On subscriber (separate Postgres, possibly newer major version)
CREATE SUBSCRIPTION orders_sub
  CONNECTION 'host=publisher dbname=app user=replicator'
  PUBLICATION orders_pub;

Step-by-step explanation.

Physical replication requires wal_level = replica (the default since PG10). The primary streams WAL bytes to the standby; the standby applies them with a WAL replay process. The standby is a binary clone — same OIDs, same TIDs, same on-disk layout.
Logical replication requires wal_level = logical. The primary still writes the same WAL, but a logical decoding process walks the WAL and emits decoded change events. The subscriber applies those events with regular SQL DML (INSERT … ON CONFLICT, UPDATE, DELETE).
Because logical replication uses normal SQL on the subscriber, the subscriber can be on a different major version (PG13 → PG16 is fine), a different schema (extra columns on the subscriber side are tolerated), or even a different system (Kafka via Debezium, Snowflake via Fivetran).
The biggest gotcha: logical replication does NOT replicate DDL. If you ALTER TABLE orders ADD COLUMN tax NUMERIC on the publisher, the subscriber will not see the column; the next INSERT that includes a tax value will fail with "column does not exist." You must apply DDL on both sides (or use a tool like pg_dump --schema-only to keep them in sync).
Physical = HA / read replicas / point-in-time recovery. Logical = CDC / cross-version upgrades / data lake ingestion. Pick by what you need; don't try to use one for the other's job.

Output.

Use case	Pick
Hot standby for failover	physical
Read replica for the same cluster	physical
Postgres 13 → Postgres 16 zero-downtime upgrade	logical
Stream changes to Kafka	logical (via Debezium)
Snowflake / Iceberg / BigQuery sync	logical (via Fivetran / Airbyte / DMS)
Audit log warehouse	logical

Rule of thumb. If you want byte-for-byte cloning between two Postgres of the same version, use physical. The moment you want changes anywhere else — different major version, different engine, different schema, different downstream tool — you want logical.

Worked example — picking an output plugin

Detailed explanation. The output plugin is a Postgres extension loaded at WAL-decoding time. Three plugins dominate: pgoutput (in-tree binary), wal2json (out-of-tree JSON), decoderbufs (Debezium 1.x Protobuf). The choice affects wire size, debuggability, and which tools you can plug in downstream.

Question. Pick the right output plugin for three scenarios: Postgres-to-Postgres subscription, Debezium 2.x into Kafka, and a Python script that wants human-readable change events.

Input.

Scenario	Consumer	Wire format preference
PG → PG subscription	Postgres subscriber	binary, native
Debezium 2.x → Kafka	Debezium connector	binary, fast
Python script	custom replication client	JSON, debuggable

Code.

-- pgoutput — the in-tree default
CREATE PUBLICATION p FOR TABLE orders;
-- (no plugin selection needed; pgoutput is implicit for SUBSCRIPTION)

-- wal2json — installed as an extension on the publisher
-- CREATE EXTENSION wal2json;  -- pre-installed in most managed Postgres
SELECT pg_create_logical_replication_slot('my_slot', 'wal2json');

-- Stream changes as JSON
SELECT data
  FROM pg_logical_slot_get_changes('my_slot', NULL, NULL,
                                   'include-xids', '1',
                                   'pretty-print', '1');

-- decoderbufs — Debezium 1.x style (Protobuf)
SELECT pg_create_logical_replication_slot('debezium_slot', 'decoderbufs');

Step-by-step explanation.

For PG → PG subscriptions, you do not pick a plugin at all — CREATE SUBSCRIPTION always uses pgoutput. It is in-tree, binary, and tightly coupled to the Postgres apply worker.
For Debezium 2.x → Kafka, the recommended plugin is pgoutput (the Debezium default since 2.0). It removes the need for an out-of-tree plugin binary on the Postgres host — every modern Postgres ships pgoutput automatically. Debezium 1.x used decoderbufs and required loading the Debezium-shipped .so file; that step is gone.
For a Python replication client, wal2json is the friendlier choice. The wire format is one JSON object per change, with table name, op type, and column values — easy to consume with psycopg.replication or a similar library, and you can pretty-print and inspect in psql while debugging.
Performance: pgoutput and decoderbufs are both binary and within ~10% of each other on throughput. wal2json is roughly 1.5–2x larger on the wire and noticeably slower to decode, but the debugging win is large for hand-rolled consumers.
The plugin is loaded inside the Postgres backend process — bad plugins can degrade WAL-sender CPU. Production teams pick pgoutput unless they have a specific reason for JSON.

Output.

Plugin	Format	Default for	When to pick
`pgoutput`	binary, native	PG ↔ PG subscriptions; Debezium 2.x	most production CDC
`wal2json`	JSON	hand-rolled clients, ad-hoc inspection	debuggability, custom tooling
`decoderbufs`	Protobuf	Debezium 1.x	legacy Debezium deployments

Rule of thumb. Default to pgoutput. Switch to wal2json only when you are writing a custom consumer and the JSON wire format is genuinely useful for debugging.

Worked example — bi-directional logical replication on PG16

Detailed explanation. PG16 added the missing primitives for symmetric two-node logical replication — origin = 'NONE' to prevent loops, and a real conflict-resolution layer. Before PG16 you had to hand-roll loop prevention with triggers. Now it is a one-flag affair, with pg_subscription tracking the origin of every applied change.

Question. Set up bi-directional logical replication between two Postgres 16 nodes such that writes on either side replicate to the other, without infinite loops.

Input.

Node	Role	Writes
pg-a	publisher + subscriber	local writes from app A
pg-b	publisher + subscriber	local writes from app B

Code.

-- On pg-a
CREATE PUBLICATION pub_a FOR TABLE orders;
CREATE SUBSCRIPTION sub_b
  CONNECTION 'host=pg-b dbname=app user=replicator'
  PUBLICATION pub_b
  WITH (origin = 'NONE', copy_data = false);

-- On pg-b
CREATE PUBLICATION pub_b FOR TABLE orders;
CREATE SUBSCRIPTION sub_a
  CONNECTION 'host=pg-a dbname=app user=replicator'
  PUBLICATION pub_a
  WITH (origin = 'NONE', copy_data = false);

Step-by-step explanation.

Each node creates its own publication of the same table (orders) and subscribes to the other node's publication. Without loop prevention, an insert on pg-a would replicate to pg-b, which would re-emit it back to pg-a, and so on forever.
The flag origin = 'NONE' is the loop breaker. Each applied change is tagged with its origin (the upstream publication node). The apply worker on a node only re-emits a change if its origin is "local" — never if it came from another logical replication source.
copy_data = false skips the initial snapshot. Both sides already have the same data (assume a careful operator); you only want to stream incremental changes from now on.
Conflicts are still possible: two app servers writing to the same row at the same time will produce divergent values. PG16 gives you an ALTER SUBSCRIPTION … SET (disable_on_error = true) knob and a pg_subscription_origin_status view to inspect conflict state. Production setups add a write-key partitioning scheme on top (each row has a "home" node).
Bi-directional replication is powerful for active-active deployments but multiplies the operational surface. The vast majority of CDC pipelines remain one-way (source → CDC → sink) — bi-directional is the niche case.

Output.

Knob	Default	What it does
`origin = 'NONE'`	`'ANY'`	only re-emit changes that originated locally — breaks replication loops
`copy_data = false`	`true`	skip the initial snapshot at subscription create time
`disable_on_error = true`	`false`	pause the subscription on apply error instead of retrying forever
`streaming = 'parallel'`	`'off'`	apply large transactions in parallel on the subscriber (PG16+)

Rule of thumb. Bi-directional logical replication is a real PG16+ feature, but reach for it only when you genuinely need active-active. For 95% of CDC pipelines, one-way source → sink is simpler and safer.

Senior interview question on logical replication scope

A senior interviewer often opens with: "Your team wants to replicate Postgres into Snowflake and into a Kafka topic for the real-time fraud team. Walk me through how you would design the logical replication setup — one slot or two, one publication or two, and why."

Solution Using two slots and two publications with shared output plugin

-- Two publications — Snowflake and fraud team have different needs
CREATE PUBLICATION snowflake_pub
  FOR TABLE orders, payments, customers, products
  WITH (publish = 'insert, update, delete, truncate');

CREATE PUBLICATION fraud_pub
  FOR TABLE payments WHERE (amount > 100),
  TABLE login_events
  WITH (publish = 'insert, update');

-- Two slots — independent consumer cursors
-- Snowflake uses the Fivetran Postgres connector (creates its own slot)
-- Fraud team uses Debezium Postgres connector (creates its own slot)

-- Inspect:
SELECT slot_name, plugin, slot_type, database, active, wal_status,
       confirmed_flush_lsn, restart_lsn
  FROM pg_replication_slots;

Step-by-step trace.

Step	Decision	Why
1. One publication or two?	Two	Snowflake needs all tables + truncate; fraud team only needs `payments` (filtered) + `login_events`
2. One slot or two?	Two	Slots are per-consumer cursors; one shared slot would force both consumers to advance at the slowest one (= disk fill if either stalls)
3. Output plugin	`pgoutput` for both	Standard, binary, supported by Debezium 2.x and Fivetran's Postgres connector
4. Row filter on `fraud_pub`	`WHERE amount > 100`	reduces WAL volume to the fraud team by ~5x
5. Monitoring	poll `pg_replication_slots` every 60s	alert if `pg_current_wal_lsn() - restart_lsn > 50 GB`

After the design pass, both consumers are independent — the fraud team can stall for a few minutes without blocking Snowflake, and Snowflake can have a multi-hour outage without disturbing fraud (within max_slot_wal_keep_size).

Output:

Layer	Snowflake	Fraud team
Publication	`snowflake_pub` (4 tables, full DML + truncate)	`fraud_pub` (filtered `payments` + `login_events`)
Slot	`fivetran_slot`	`debezium_slot`
Plugin	`pgoutput`	`pgoutput`
WAL volume	100%	~20% (row filter)
Independence	yes (own slot)	yes (own slot)

Why this works — concept by concept:

Two publications, two intents — Snowflake wants the full table set for downstream analytics; fraud wants a filtered slice in real time. Two publications keep the wire payload minimal for each consumer.
One slot per consumer — a slot is a cursor into the WAL. If two consumers share a slot, the slot advances only when the slowest consumer commits — meaning a slow consumer holds WAL for the fast one. Independent slots decouple their lifetimes.
pgoutput across the board — both downstream tools (Fivetran's Postgres connector and Debezium 2.x) speak pgoutput natively. No per-consumer plugin choice, no .so files to deploy.
Row filter at publication time — pushing the amount > 100 filter into the publication means the fraud slot never sees the small-amount events; less WAL pressure, less network egress, fewer events to drop downstream.
Cost — slots add O(WAL volume) retention pressure on the source; two slots = O(slowest of the two). Monitor restart_lsn per slot and alert on lag thresholds; this turns a midnight outage into a 4pm Slack ping.

SQL
Topic — SQL
Postgres CDC SQL practice problems

Practice →

ETL Topic — ETL CDC pipeline design problems

Practice →

2. Logical replication architecture

`logical decoding` is a four-stage pipeline — WAL → output plugin → replication slot → subscription — and every CDC tool plugs into one of those stages

The mental model in one line: logical replication is a pipeline of four components — the WAL on the source, an output plugin that decodes WAL into change events, a replication slot that holds those events durably until consumed, and a subscription (or external consumer) that applies them downstream. Once you say that out loud, every pg_logical interview question becomes a deduction from "WAL is the source of truth; the slot is the cursor; the plugin is the wire format; the subscription is the apply worker."

The four stages, end-to-end.

WAL (write-ahead log). Every Postgres modification — INSERT, UPDATE, DELETE, DDL — appends records to the WAL before touching the heap. WAL is the source of truth for both crash recovery and replication. wal_level = logical adds extra information to each WAL record so the decoder can reconstruct the row-level change (not just the byte change). The extra info costs ~10–15% WAL volume compared to wal_level = replica.
Output plugin. A loadable Postgres extension that reads WAL records and emits a stream of decoded change events. pgoutput (in-tree), wal2json (JSON), decoderbufs (Protobuf). The plugin runs inside the Postgres backend process — specifically inside the WAL sender backend that handles the replication connection.
Replication slot. A persistent server-side cursor into the WAL. Holds two LSNs — restart_lsn (the WAL position we must keep for restart) and confirmed_flush_lsn (the position the consumer has confirmed). The slot prevents Postgres from recycling WAL until the consumer has acked. Survives restarts.
Subscription / external consumer. Either a Postgres CREATE SUBSCRIPTION (Postgres-to-Postgres) or an external consumer using the replication protocol (Debezium, Fivetran, custom Python with psycopg.replication, etc.). The subscription opens a streaming connection, advances the slot as it applies, and is responsible for crash safety on the consumer side.

The replication protocol — what wire actually carries.

The publisher exposes a special connection type — the replication connection — that you reach with psql 'replication=database host=… user=…' or with a client library's replication mode. This is distinct from a regular SQL connection.
On a replication connection you can issue START_REPLICATION SLOT my_slot LOGICAL 0/0 (proto_version '4', publication_names 'my_pub') to begin streaming.
The server then pushes a sequence of XLogData messages (containing decoded change events) and periodic Primary keepalive messages. The client acks progress with Standby status update messages, advancing confirmed_flush_lsn on the slot.

Snapshot + stream — how new subscribers catch up.

A brand-new subscription needs to start somewhere. Postgres ships an initial snapshot — a serialisable read of every published table at the slot's consistent_point LSN — followed by streaming every WAL change after that point.
Debezium calls this snapshot.mode = 'initial' (the default). Once the snapshot is done, it switches to streaming mode and reads from the slot.
The snapshot uses regular SQL SELECT * FROM …; for very large tables this can take hours. Production teams often pre-populate the target via pg_dump and start the subscription with copy_data = false.

DDL — the great gotcha.

Logical replication does not replicate DDL. ALTER TABLE orders ADD COLUMN tax NUMERIC on the publisher does not run on the subscriber.
The next INSERT that includes the tax column either fails (if the subscriber column does not exist) or silently drops the value (if the publication does not include the column in its column list).
Production teams keep schemas in sync via a separate migration tool (Liquibase, Flyway, Sqitch) applied to both sides. Apply DDL on the subscriber first, then on the publisher — this avoids "column missing on subscriber" errors during the apply.

Common interview probes on the architecture.

"Where does logical decoding run — in the backend, the WAL writer, a separate process?" — inside the WAL sender backend that handles the replication connection.
"What is the difference between restart_lsn and confirmed_flush_lsn?" — restart_lsn is the earliest WAL position the slot must keep (for restart safety); confirmed_flush_lsn is the position the consumer has confirmed it durably applied.
"Does logical replication replicate DDL?" — no, you must apply DDL on both sides via a separate tool.
"How does a new subscriber catch up — replay WAL from offset 0?" — no, it takes a snapshot at the consistent_point LSN and then streams from there.

Worked example — full-loop PG → PG logical replication in 6 commands

Detailed explanation. The canonical "set up logical replication" question. Two Postgres instances, one table, six SQL commands. The output is a continuous row-level stream from the publisher to the subscriber. Memorising this sequence is the floor for the postgres logical replication interview surface.

Question. Set up logical replication of an orders table from pg-pub to pg-sub. Show every step and explain the role of each command.

Input.

Side	Schema
pg-pub	`orders (id int PK, customer_id int, amount numeric, status text)`
pg-sub	`orders` table created with identical schema (DDL applied separately)

Code.

-- 1. On pg-pub — set wal_level and restart
ALTER SYSTEM SET wal_level = 'logical';
ALTER SYSTEM SET max_wal_senders = 10;
ALTER SYSTEM SET max_replication_slots = 10;
-- pg_ctl restart   (wal_level needs a restart)

-- 2. On pg-pub — create a publication
CREATE PUBLICATION orders_pub FOR TABLE orders;

-- 3. On pg-pub — create a replication user
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secret';
GRANT SELECT ON orders TO replicator;
-- pg_hba.conf: host replication replicator pg-sub-ip/32 scram-sha-256

-- 4. On pg-sub — create the target table (DDL is NOT replicated)
CREATE TABLE orders (id int PRIMARY KEY, customer_id int,
                     amount numeric, status text);

-- 5. On pg-sub — create the subscription
CREATE SUBSCRIPTION orders_sub
  CONNECTION 'host=pg-pub dbname=app user=replicator password=secret'
  PUBLICATION orders_pub;

-- 6. On pg-sub — verify it is streaming
SELECT subname, subenabled, subslotname FROM pg_subscription;
SELECT * FROM pg_stat_subscription;

Step-by-step explanation.

wal_level = 'logical' instructs Postgres to record extra info in every WAL record so the decoder can reconstruct row-level changes. Without this, logical decoding fails with "logical decoding requires wal_level >= logical." The setting requires a server restart.
CREATE PUBLICATION orders_pub FOR TABLE orders advertises the orders table as a logical replication source. The default publish list is 'insert, update, delete, truncate'. You can scope to subset DML (WITH (publish = 'insert') for insert-only feeds).
The replication user needs the REPLICATION role attribute and LOGIN. pg_hba.conf must allow the user to connect with the special replication database name (this is hardcoded; you do not create a database literally named replication).
The subscriber's table is regular — Postgres applies normal SQL DML on this side. You MUST create the table before the subscription, otherwise the snapshot phase fails with "relation does not exist."
CREATE SUBSCRIPTION orders_sub does three things atomically: connects to the publisher, creates a logical replication slot on the publisher (default name = subscription name), and starts a snapshot+stream. Within seconds, pg_stat_subscription.received_lsn starts advancing on the subscriber.
pg_subscription shows the configured subscription; pg_stat_subscription shows live activity (last received LSN, last applied LSN, worker PID). Use both to verify the loop is healthy.

Output.

Component	Lives on	What it does
`wal_level = logical`	publisher	adds row-level info to WAL records
`PUBLICATION orders_pub`	publisher	advertises which tables + DML to ship
`REPLICATION` role	publisher	grants the consumer permission to open a replication connection
`pg_hba.conf` rule	publisher	network-level allow for the replication user
`CREATE TABLE orders`	subscriber	the target table — DDL is not replicated, you create it by hand
`SUBSCRIPTION orders_sub`	subscriber	creates the slot on the publisher and starts streaming

Rule of thumb. If you can recite this 6-command sequence from memory in an interview, you have demonstrated end-to-end fluency with postgres logical replication. Every CDC tool (Debezium, Fivetran, Airbyte) collapses to a wrapper on top of steps 1–3 — they create the slot and stream from it, just like step 5 does.

Worked example — peeking at decoded changes without a subscriber

Detailed explanation. A great debugging trick: you can attach a logical slot to a SQL function and read decoded change events directly in psql, no Kafka, no Debezium, no subscriber. This is the fastest way to see what your pgoutput or wal2json plugin actually emits — useful in interviews when an interviewer pushes "show me what a change event looks like on the wire."

Question. Create a logical slot with the wal2json plugin, run a few INSERTs and UPDATEs, and read the decoded change events using pg_logical_slot_get_changes. Explain the difference between _get_ and _peek_ variants.

Input.

Operation	Row
INSERT	`(1, 'alice', 100)`
UPDATE	`id=1 SET amount = 150`
DELETE	`id=1`

Code.

-- Create the slot (one-time)
SELECT pg_create_logical_replication_slot('debug_slot', 'wal2json');

-- Run some DML
INSERT INTO orders (id, customer, amount) VALUES (1, 'alice', 100);
UPDATE orders SET amount = 150 WHERE id = 1;
DELETE FROM orders WHERE id = 1;

-- Drain the slot — _get_ ADVANCES the slot
SELECT lsn, xid, data
  FROM pg_logical_slot_get_changes('debug_slot', NULL, NULL,
                                   'include-xids', '1',
                                   'pretty-print', '0',
                                   'include-timestamp', '1');

-- Peek without advancing — _peek_ leaves the slot in place
SELECT data FROM pg_logical_slot_peek_changes('debug_slot', NULL, NULL);

-- Drop when done — slots that linger fill the disk
SELECT pg_drop_replication_slot('debug_slot');

Step-by-step explanation.

pg_create_logical_replication_slot('debug_slot', 'wal2json') creates a slot bound to the wal2json plugin. From this moment on, Postgres retains WAL on this slot's behalf — even if you never read from it. (This is the dangerous bit.)
pg_logical_slot_get_changes('debug_slot', NULL, NULL, …) reads decoded changes since the last position and advances confirmed_flush_lsn. WAL up to that point can now be recycled. The first two NULLs are upto_lsn and upto_nchanges — leaving them NULL drains everything.
The output for the INSERT looks like {"change":[{"kind":"insert","schema":"public","table":"orders","columnnames":["id","customer","amount"],"columnvalues":[1,"alice",100]}]}. The UPDATE is similar with "kind":"update" and an extra "oldkeys" block (or "oldtuple" if REPLICA IDENTITY FULL).
pg_logical_slot_peek_changes reads the same data but does not advance. Useful when you want to inspect the wire without affecting consumer progress.
pg_drop_replication_slot('debug_slot') releases the slot. Forgetting this step is the #1 way to fill your disk — a leftover slot retains WAL forever until dropped.

Output (one decoded event).

LSN	XID	Decoded JSON
0/16ABCDE	7423	`{"change":[{"kind":"insert","table":"orders","columnvalues":[1,"alice",100]}]}`
0/16ABF12	7424	`{"change":[{"kind":"update","table":"orders","columnvalues":[1,"alice",150],"oldkeys":{"keyvalues":[1]}}]}`
0/16AC056	7425	`{"change":[{"kind":"delete","table":"orders","oldkeys":{"keyvalues":[1]}}]}`

Rule of thumb. When debugging a CDC pipeline that "isn't emitting events," create a parallel slot with wal2json, peek at the same DML, and confirm Postgres is decoding correctly. If pg_logical_slot_peek_changes shows the events, the bug is downstream (in Debezium / Fivetran / the apply worker). If it does not, the bug is in wal_level, REPLICA IDENTITY, or the publication.

Worked example — row filtering and column lists since PG15

Detailed explanation. Before PG15, a publication was "all rows of these tables." Since PG15, you can scope to a WHERE clause per table (row filter) and to a subset of columns (column list). This dramatically reduces wire size on tenant-multi pipelines where each tenant only wants its own rows.

Question. Create a publication that ships only status = 'shipped' orders, and only the (id, customer_id, amount) columns — not notes or internal_score.

Input.

Column	Ship?
id (PK)	yes
customer_id	yes
amount	yes
status	no (filter only)
notes	no
internal_score	no

Code.

-- Row filter + column list (PG15+)
CREATE PUBLICATION shipped_orders_pub
  FOR TABLE orders (id, customer_id, amount)
  WHERE (status = 'shipped');

-- Verify
SELECT pubname, prrelid::regclass, prqual, prattrs
  FROM pg_publication p
  JOIN pg_publication_rel pr ON p.oid = pr.prpubid
 WHERE pubname = 'shipped_orders_pub';

Step-by-step explanation.

FOR TABLE orders (id, customer_id, amount) is the column list — only these three columns appear in the change event payload. status is referenced in the filter but does not ship to the subscriber.
WHERE (status = 'shipped') is the row filter — only rows where status = 'shipped' produce events. An UPDATE that transitions from 'pending' → 'shipped' emits an INSERT-equivalent change to the subscriber (since the row newly matches the filter). A transition from 'shipped' → 'cancelled' emits a DELETE-equivalent (since the row no longer matches).
Filters must be deterministic — no now(), no random(), no user-defined function unless it's marked IMMUTABLE. Postgres rejects non-deterministic filters at CREATE PUBLICATION time.
Filters CANNOT reference columns excluded from the column list. If you filter on internal_score > 0.8, you must include internal_score in the column list (even if you don't want to ship it — workaround: include + drop on the subscriber side).
Row filters are evaluated during decoding, so they reduce wire bytes from the publisher to the subscriber. Network savings compound on multi-tenant pipelines.

Output.

Source row	After row filter + column list
`(1, 42, 100, 'pending', 'note', 0.7)`	(filtered out — status != 'shipped')
`(2, 43, 50, 'shipped', 'note2', 0.9)`	`(2, 43, 50)` ships
`(3, 44, 200, 'shipped', NULL, NULL)`	`(3, 44, 200)` ships
`(4, 45, 30, 'cancelled', '', 0.1)`	(filtered out — status != 'shipped')

Rule of thumb. Use row filters to slice multi-tenant data per consumer; use column lists to redact PII from analytics-only sinks. Both compose; both reduce wire volume; both are PG15+ only.

Senior interview question on the architecture

A senior interviewer might ask: "A new logical replication subscription has been running for 30 minutes; pg_stat_subscription.received_lsn is advancing on the subscriber, but no rows are appearing in the target table. Walk me through how you debug this."

Solution Using staged inspection of WAL, slot, snapshot, apply worker

-- Step 1 — confirm WAL is being generated on the source
SELECT pg_current_wal_lsn();   -- should advance every second under load

-- Step 2 — confirm the slot is active and not stale
SELECT slot_name, active, active_pid, restart_lsn, confirmed_flush_lsn,
       pg_current_wal_lsn() - restart_lsn AS retained_bytes
  FROM pg_replication_slots
 WHERE slot_name = 'orders_sub';

-- Step 3 — confirm the publication includes the right tables
SELECT pubname, tablename
  FROM pg_publication_tables
 WHERE pubname = 'orders_pub';

-- Step 4 — on subscriber, check the snapshot phase
SELECT subname, srrelid::regclass, srsubstate, srsublsn
  FROM pg_subscription s
  JOIN pg_subscription_rel sr ON s.oid = sr.srsubid
 WHERE subname = 'orders_sub';
-- srsubstate values: i=init, d=copy in progress, s=synchronized, r=ready

-- Step 5 — on subscriber, look at apply worker logs
-- tail -f /var/log/postgresql/postgresql.log | grep -i logical

Step-by-step trace.

Step	What you check	What "broken" looks like
1. WAL is moving	`pg_current_wal_lsn()` increments	no WAL = no source activity
2. Slot is active	`active = true`, `confirmed_flush_lsn` advances	`active = false` = consumer disconnected
3. Publication has the table	`pg_publication_tables` row exists	empty = `FOR TABLE orders` was scoped wrong
4. Snapshot done	`srsubstate = 'r'` (ready) on the subscriber	`'i'` or `'d'` = snapshot still running
5. Apply worker healthy	no error in log	"could not apply" = downstream bug

In our scenario, Step 4 is the smoking gun — srsubstate = 'd' means the snapshot is still copying the table. received_lsn advances during snapshot but no rows commit on the subscriber side until snapshot completes. The fix is patience (for small tables) or copy_data = false plus a pre-populated target (for huge tables).

Output:

Phase	`srsubstate`	What rows you see on the subscriber
init	`i`	none
copy in progress	`d`	none (snapshot is one big serialisable read)
synchronized	`s`	snapshot done, streaming caught up
ready	`r`	live streaming, rows appear seconds after publisher commits

Why this works — concept by concept:

Slot vs snapshot vs apply — three independent stages that can each stall for different reasons. The slot is on the publisher; the snapshot is the initial table copy; the apply worker is the SQL DML on the subscriber. Diagnosing the right stage is half the work.
received_lsn vs applied_lsn — received_lsn advances as bytes arrive at the subscriber. applied_lsn (or latest_end_lsn) advances as rows commit. The two diverge during a snapshot.
srsubstate is the truth — the per-table subscription state is in pg_subscription_rel. Always check this view first; it tells you which phase each table is in.
pg_hba.conf is the silent killer — if the replication user can connect but cannot SELECT (missing GRANT), the snapshot fails halfway. Always check the publisher's role grants.
Cost — debugging is O(stages × tables). With 5 stages and N tables, the matrix is small enough to walk top-down in a single 10-minute investigation. Senior signal is doing this systematically rather than guessing.

ETL
Topic — ETL
Logical replication debug drills

Practice →

SQL Topic — SQL Postgres internals SQL problems

Practice →

3. Replication slots — power and danger

A `replication slot` is a persistent cursor into the WAL — useful for crash-safe consumers, lethal when the consumer dies and the slot lives on

The mental model in one line: a replication slot is a server-side cursor that promises "I will not throw away WAL beyond this LSN until the named consumer has acked it" — which is exactly what makes CDC consumers crash-safe, and exactly what fills your disk when a consumer dies and nobody drops the slot. Once you internalise "a slot retains WAL," every senior replication slot interview question becomes a deduction from "who drops dead slots, and how do you bound the damage."

The two LSNs every slot tracks.

restart_lsn — the earliest WAL position Postgres must retain so the consumer can restart from a clean point. Postgres can recycle WAL strictly older than restart_lsn for every slot. The minimum restart_lsn across all slots controls how much WAL accumulates on disk.
confirmed_flush_lsn — the position the consumer has confirmed it has durably applied. The next stream restart will resume from here. For logical slots, confirmed_flush_lsn advances on every Standby status update from the consumer.
The relationship: restart_lsn ≤ confirmed_flush_lsn ≤ pg_current_wal_lsn(). Slot lag is pg_current_wal_lsn() - restart_lsn.

Slot types — physical vs logical.

Physical slot — used by streaming (physical) replication. Holds WAL for a hot-standby connection. Cursor advances with the standby's pg_stat_replication.flush_lsn.
Logical slot — used by logical replication. Bound to an output plugin (pgoutput, wal2json, …) and a database. Cursor advances on consumer acks.
A given Postgres has max_replication_slots slots total (default 10). Each slot, of either kind, retains WAL until released.

The phantom slot — the production failure mode.

A consumer disconnects (Debezium pod crashed, Fivetran connector paused, custom script terminated). The slot remains on the publisher, marked active = false.
The slot still holds restart_lsn at the consumer's last ack. WAL beyond that point keeps accumulating.
Hours later, pg_wal/ is 200GB and the database starts refusing writes ("no space left on device"). This is the phantom slot anti-pattern. Every senior Postgres outage post-mortem you have ever read has a section on it.
The fix: monitor pg_replication_slots.confirmed_flush_lsn and alert if a slot has not advanced in N minutes; drop the slot when the consumer is gone for good.

max_slot_wal_keep_size — the brake.

Introduced in PG13. Configurable cap on how much WAL a slot can retain. When exceeded, Postgres marks the slot invalidated (wal_status = 'lost') and allows WAL recycling.
This trades consumer correctness for source survival: an invalidated slot must be dropped + recreated + the consumer must replay from a snapshot. But the source database stays up.
Set this. Always. max_slot_wal_keep_size = 50GB (or whatever fits 4–8 hours of WAL on your busiest hour) is the standard production recipe.

Slot inspection views.

pg_replication_slots — every slot, its plugin, its LSNs, its active flag, its wal_status (reserved, extended, unreserved, lost).
pg_stat_replication — every live replication connection (physical AND logical). state, sent_lsn, flush_lsn, replay_lsn.
pg_stat_replication_slots — per-slot stats (spilled transactions, stream bytes). Useful for tuning logical_decoding_work_mem.

Common interview probes on slots.

"What does a replication slot actually hold — bytes? offsets?" — two LSNs (restart_lsn, confirmed_flush_lsn) and metadata. The WAL itself is in pg_wal/; the slot just prevents recycling.
"What happens to a slot when the consumer disconnects?" — active = false but the slot persists. WAL retention continues until you drop or invalidate the slot.
"What is max_slot_wal_keep_size?" — the per-cluster cap on how much WAL a slot can retain. Beyond it, the slot is invalidated.
"How do you safely drop an active slot?" — terminate the WAL sender backend first (pg_terminate_backend(active_pid)), then pg_drop_replication_slot.

Worked example — inspecting a healthy slot

Detailed explanation. The first thing you do on any production Postgres CDC pipeline is inspect the slot. Three views — pg_replication_slots, pg_stat_replication, pg_current_wal_lsn — tell you whether the consumer is alive, how far behind it is, and how much WAL it is currently holding.

Question. Show the SQL that gives you a one-row-per-slot snapshot of slot health, including the lag in bytes and a derived "minutes of WAL" estimate.

Input.

Slot	restart_lsn	confirmed_flush_lsn	active	wal_status
debezium_slot	0/3A00 0000	0/3A00 0000	true	reserved
fivetran_slot	0/38FF 8000	0/38FF C000	true	reserved

Code.

SELECT slot_name,
       plugin,
       slot_type,
       active,
       active_pid,
       wal_status,
       pg_size_pretty(pg_current_wal_lsn() - restart_lsn) AS retained_wal,
       pg_size_pretty(pg_current_wal_lsn() - confirmed_flush_lsn) AS lag,
       pg_current_wal_lsn() AS source_lsn,
       restart_lsn,
       confirmed_flush_lsn
  FROM pg_replication_slots
 ORDER BY (pg_current_wal_lsn() - restart_lsn) DESC;

Step-by-step explanation.

pg_current_wal_lsn() is the publisher's current WAL position. Subtracting restart_lsn gives the retained WAL bytes — the actual disk pressure on pg_wal/.
pg_current_wal_lsn() - confirmed_flush_lsn gives the consumer's lag — how far behind real time it is. The difference between this and retained_wal is the gap between "consumer ack" and "what restart needs."
active = true means a WAL sender backend is connected. active = false is the danger signal — slot persisting without a consumer is the phantom-slot scenario.
wal_status ladders from reserved (healthy) → extended (over wal_keep_size, still allowed) → unreserved (about to hit max_slot_wal_keep_size) → lost (invalidated; slot must be dropped).
pg_size_pretty(...) formats the byte count as 1.2 GB / 300 MB — easier to skim than raw bytes. Production dashboards do this and alert when any slot crosses 1 GB retained.

Output.

slot_name	active	retained_wal	lag	wal_status
debezium_slot	true	14 MB	0 bytes	reserved
fivetran_slot	true	270 MB	256 MB	reserved
zombie_slot	false	18 GB	18 GB	extended

Rule of thumb. Two thresholds. Page on retained_wal > 5 GB for any slot. Auto-drop on active = false AND retained_wal > 20 GB. The first warns; the second protects the database from running out of disk.

Worked example — engineering a phantom slot (and then fixing it)

Detailed explanation. The fastest way to internalise the phantom-slot failure mode is to engineer one. Create a slot, do some DML, kill the consumer, watch WAL pile up, then inspect and clean up. Production teams run this drill as part of on-call training.

Question. Simulate a phantom slot on a test Postgres: create a slot, generate WAL, never read from it, observe pg_wal/ growth, then drop the slot to release the WAL.

Input.

Step	Action
1	create slot, no consumer ever attaches
2	run a tight INSERT loop for a minute
3	inspect `pg_wal/` size
4	drop the slot
5	inspect again — should drop back

Code.

-- 1. Create the slot — no consumer attaches
SELECT pg_create_logical_replication_slot('phantom_slot', 'pgoutput');

-- 2. Generate WAL
DO $$
BEGIN
  FOR i IN 1..1000000 LOOP
    INSERT INTO orders(id, customer_id, amount, status)
      VALUES (i, (i % 1000)+1, (random()*1000)::numeric(10,2), 'pending');
    IF i % 50000 = 0 THEN COMMIT; END IF;
  END LOOP;
END$$;

-- 3. Inspect
SELECT slot_name, pg_size_pretty(pg_current_wal_lsn() - restart_lsn) AS retained
  FROM pg_replication_slots WHERE slot_name = 'phantom_slot';

-- Shell:  du -sh /var/lib/postgresql/16/main/pg_wal

-- 4. Drop the slot — releases WAL
SELECT pg_drop_replication_slot('phantom_slot');

-- 5. Force a checkpoint so WAL files are recycled
CHECKPOINT;

-- Shell again: du -sh ... — should now shrink

Step-by-step explanation.

The slot exists, plugin = pgoutput, but no client ever calls START_REPLICATION SLOT phantom_slot LOGICAL …. The restart_lsn is frozen at slot-creation time.
The INSERT loop generates ~150–300 MB of WAL on a small box. Without a slot, Postgres recycles WAL every checkpoint. With our phantom slot, none of it can be recycled — pg_wal/ grows by the full ~300 MB.
pg_replication_slots shows retained ≈ 300 MB and active = false. The active = false flag is the signal that a real on-call alert would fire on.
pg_drop_replication_slot('phantom_slot') removes the slot. Postgres no longer needs to retain its WAL.
Issuing CHECKPOINT forces WAL recycling. pg_wal/ shrinks. The database is back to normal.

Output.

Stage	`pg_wal/` size	slot count	slot active?
Before INSERT loop	64 MB	1 (phantom)	false
After INSERT loop (slot still there)	320 MB	1 (phantom)	false
After drop slot + CHECKPOINT	64 MB	0	n/a

Rule of thumb. Phantom slots are the most preventable Postgres CDC outage. Three guardrails kill it for good: (1) max_slot_wal_keep_size = 50GB; (2) on-call alert on active = false for any logical slot > 5 minutes; (3) the consumer's deploy script always either reuses the existing slot or drops the old one before creating a new one.

Worked example — `max_slot_wal_keep_size` as a circuit breaker

Detailed explanation. max_slot_wal_keep_size (PG13+) is the brake. When a slot's retained WAL exceeds the cap, Postgres invalidates the slot and recycles WAL anyway. The slot can be dropped + recreated; the consumer must replay from a snapshot. The database stays up — that is the win.

Question. Set max_slot_wal_keep_size = 8GB, then drive a slot past the cap (no consumer reading), and observe the invalidation.

Input.

Knob	Value
`max_slot_wal_keep_size`	8GB
WAL generated with no consumer	12GB

Code.

-- 1. Set the cap (requires reload, not restart)
ALTER SYSTEM SET max_slot_wal_keep_size = '8GB';
SELECT pg_reload_conf();

-- 2. Create a slot, generate > 8GB WAL with no consumer attached
SELECT pg_create_logical_replication_slot('cap_test', 'pgoutput');

-- ... DML loop until pg_wal/ ~ 9GB ...

-- 3. Observe invalidation
SELECT slot_name, wal_status, active,
       pg_size_pretty(pg_current_wal_lsn() - restart_lsn) AS retained
  FROM pg_replication_slots WHERE slot_name = 'cap_test';
-- wal_status transitions: reserved → extended → unreserved → lost

-- 4. The slot is now 'lost' — drop it and recreate
SELECT pg_drop_replication_slot('cap_test');
SELECT pg_create_logical_replication_slot('cap_test', 'pgoutput');
-- consumer must replay from a snapshot

Step-by-step explanation.

max_slot_wal_keep_size = '8GB' caps each slot's retention at 8GB. pg_reload_conf() applies it without restart.
As the slot's retained WAL grows past 8GB, Postgres flips wal_status from reserved to extended (between wal_keep_size and max_slot_wal_keep_size) to unreserved (about to exceed the cap) to lost (cap exceeded, WAL recycled, slot invalidated).
When wal_status = 'lost', the slot can no longer be used for replication — the WAL it needed for restart has been recycled. Any consumer attempting to attach to this slot gets an error.
The recovery: drop the slot, create a new one, and the consumer must do a fresh snapshot. This is painful but bounded — the database does not run out of disk.
The trade-off: a strict cap protects the database but breaks consumers that go offline longer than the cap covers. Set the cap to 4–8x your worst-case consumer downtime; alert at 50% of the cap so on-call gets a chance to intervene before invalidation.

Output.

Retained WAL	`wal_status`	Slot usable?	Database health
2 GB	reserved	yes	healthy
6 GB	extended	yes (over `wal_keep_size`)	warning
7.5 GB	unreserved	yes (close to cap)	page on-call
9 GB	lost	NO (invalidated)	database stays up; consumer must reinit

Rule of thumb. max_slot_wal_keep_size is the difference between "database falls over" and "consumer needs to replay." Always pick the latter. Set the cap to cover your acceptable consumer downtime and tune up if you keep invalidating slots, not the other way around.

Senior interview question on slot operations

A senior interviewer might ask: "A CDC consumer (Debezium) crashed three days ago and the team never noticed. pg_wal/ is now at 180GB and writes are starting to slow. Walk me through how you stabilise the database without losing the CDC stream."

Solution Using triage, invalidation, snapshot replay

-- Step 1 — identify the dead slot
SELECT slot_name, active, active_pid, wal_status,
       pg_size_pretty(pg_current_wal_lsn() - restart_lsn) AS retained,
       confirmed_flush_lsn
  FROM pg_replication_slots
 WHERE active = false
 ORDER BY (pg_current_wal_lsn() - restart_lsn) DESC;

-- Step 2 — decide: revive the consumer (preserves stream) OR drop (loses gap)
-- Quick-revive path:
--   start a new Debezium pod with the SAME slot name
--   Debezium picks up from confirmed_flush_lsn, replays the 3-day gap

-- Crisis path (database about to die):
SELECT pg_drop_replication_slot('debezium_slot');
CHECKPOINT;                         -- recycle the WAL immediately
-- Recreate + snapshot from scratch:
--   delete Debezium connector state in Kafka Connect
--   redeploy with snapshot.mode='initial'
--   Debezium re-snapshots all tables, then streams from current WAL

Step-by-step trace.

Step	Decision	Trade-off
1. Diagnose	`active = false` slot with 180 GB retained	confirms phantom
2. Revive vs drop	revive if consumer can come back within hours; drop if not	revive preserves stream; drop loses the gap
3. Revive path	redeploy Debezium with same slot name	catches up via WAL replay (slow but lossless)
4. Drop + snapshot path	`pg_drop_replication_slot` → `CHECKPOINT` → restart with `snapshot.mode='initial'`	database recovers immediately; snapshot reseeds
5. Post-mortem	add alert on `active = false` for any logical slot > 5 min	prevents the next outage

In our scenario, the revive path is preferred IF Debezium can come back quickly — but with 180GB of WAL retained, the replay alone will take hours, during which the database is still at risk. The pragmatic call is drop + re-snapshot unless the downstream genuinely cannot tolerate a snapshot reseed.

Output:

Path	Database recovery	Consumer impact
Revive same slot	hours (WAL replay)	zero data loss
Drop + re-snapshot	seconds (after `CHECKPOINT`)	downstream sees a brief "reset" — duplicates possible until first commit
Do nothing	database runs out of disk	full outage on the source

Why this works — concept by concept:

Slot retention is the root cause — the slot's restart_lsn is pinning 180GB of WAL. Until the slot is either advanced or dropped, that WAL stays. No amount of VACUUM or CHECKPOINT reclaims it.
Revive preserves history — Debezium's consumer state in Kafka Connect remembers the last LSN it acked; reattaching the same slot resumes from there. Lossless but slow on huge gaps.
Drop forces a snapshot reseed — the downstream loses the unflushed deltas, but Debezium's snapshot.mode = 'initial' re-reads every table at the new consistent_point and resumes from there. Bounded recovery time.
Alert on active = false — the single highest-ROI alert in a Postgres CDC stack. Every minute of detection delay = MBs to GBs of additional WAL retention.
Cost — drop-path is O(snapshot), dominated by table size. Revive-path is O(WAL gap), dominated by retained WAL. Pick whichever is smaller; in 180 GB-of-WAL territory, snapshot usually wins.

SQL
Topic — SQL
Replication slot SQL problems

Practice →

ETL Topic — ETL Postgres CDC operations drills

Practice →

4. Publications and REPLICA IDENTITY

A `publication subscription` pair picks what gets replicated; REPLICA IDENTITY picks what columns the source ships for UPDATEs and DELETEs — get both wrong and your sink silently drops rows

The mental model in one line: a publication is the contract on the publisher ("these tables, these DML operations, optionally these rows and columns"); REPLICA IDENTITY is the contract on each table ("for UPDATE/DELETE, here is how to identify the row that changed") — and both must align with the subscriber's apply logic, or rows go missing without errors. Once you say that out loud, every senior publication subscription interview probe becomes a question about which knob is wrong.

Publications — the publisher-side contract.

CREATE PUBLICATION p FOR TABLE orders, payments — list of tables to ship.
WITH (publish = 'insert, update, delete, truncate') — DML scope. Default is all four. Setting publish = 'insert' ships only INSERTs (audit-log pattern).
WHERE (status = 'shipped') — row filter (PG15+). Only matching rows ship.
(id, amount) after the table name — column list (PG15+). Only listed columns ship.
FOR ALL TABLES — every table in the database; auto-includes future tables. Used by Debezium when you want full-database CDC.

REPLICA IDENTITY — the per-table ID contract.

DEFAULT — ships the primary key columns in the change event. The standard case. Requires a PRIMARY KEY constraint on the table.
FULL — ships every column of the previous row state. Used when you have no PK or when downstream needs the full pre-image. Doubles WAL volume for UPDATEs and DELETEs.
USING INDEX idx_name — ships the columns of a specified UNIQUE index. Useful when the table has a natural unique key but no formal PK.
NOTHING — ships nothing for UPDATE/DELETE. INSERTs still ship. The table effectively becomes INSERT-only for logical replication.

The classic gotcha — UPDATE on a table without PK.

Default REPLICA IDENTITY = DEFAULT, which requires a PK.
Run UPDATE orders SET amount = 100 WHERE order_id = 5 on a table without a PK.
The UPDATE succeeds on the publisher. On the subscriber, the apply worker logs cannot find row to update because no row identifier was shipped.
The row "disappears" from the subscriber's view. Every senior Postgres CDC engineer has seen this bug at least once.
The fix: either add a PK (always preferred) or ALTER TABLE … REPLICA IDENTITY FULL.

Row filters and column lists — combining them.

Filters and column lists compose: FOR TABLE orders (id, customer_id, amount) WHERE (status = 'shipped').
A row that changes from matching the filter to not matching emits a DELETE on the subscriber (the row is no longer in the published set).
A row that changes from not matching to matching emits an INSERT.
Filters must reference only columns in the column list (or columns that are part of the REPLICA IDENTITY).
Filters must be IMMUTABLE — no now(), no random(), no user-defined function unless explicitly marked.

TRUNCATE and DDL — the special cases.

TRUNCATE is replicated only if the publication's publish list includes truncate (default yes). The subscriber TRUNCATES its copy.
DDL is NOT replicated in logical replication. ALTER TABLE … ADD COLUMN … runs on the publisher only. The subscriber must add the column separately (and ideally first, so the next INSERT does not fail).
Sequences are not replicated either. Use CREATE PUBLICATION … FOR ALL SEQUENCES (PG16+) only for the special case; most teams skip it.

Common interview probes on publications + REPLICA IDENTITY.

"Why does my UPDATE on a table without PK silently disappear on the subscriber?" — REPLICA IDENTITY DEFAULT requires a PK; without one, no row identifier is shipped.
"What does REPLICA IDENTITY FULL cost?" — doubles WAL for UPDATE/DELETE (must record the entire previous row state).
"How do you replicate a subset of a table?" — CREATE PUBLICATION p FOR TABLE t (c1, c2) WHERE (cond) (PG15+).
"Does logical replication handle DDL?" — no, apply DDL on both sides via a migration tool.

Worked example — REPLICA IDENTITY FULL for a table with no primary key

Detailed explanation. A legacy audit_log table never had a PK — every row is a tuple of (actor, action, timestamp, payload) with natural duplicates. Logical replication of UPDATEs and DELETEs requires REPLICA IDENTITY. The fix: REPLICA IDENTITY FULL.

Question. Set up logical replication of audit_log (no PK) such that UPDATEs and DELETEs replicate correctly to the subscriber.

Input.

Column	Type
actor	text
action	text
ts	timestamptz
payload	jsonb

Code.

-- 1. Mark the table with REPLICA IDENTITY FULL
ALTER TABLE audit_log REPLICA IDENTITY FULL;

-- 2. Verify
SELECT relname, CASE relreplident
                  WHEN 'd' THEN 'default (PK)'
                  WHEN 'f' THEN 'full'
                  WHEN 'i' THEN 'using index'
                  WHEN 'n' THEN 'nothing'
                END AS replica_identity
  FROM pg_class WHERE relname = 'audit_log';

-- 3. Now create the publication
CREATE PUBLICATION audit_pub FOR TABLE audit_log;

Step-by-step explanation.

REPLICA IDENTITY FULL instructs Postgres to record the entire previous row state in WAL whenever the row is UPDATED or DELETED. The change event now ships all four columns plus the new values, so the subscriber can locate the row by full-row equality.
pg_class.relreplident exposes the per-table setting. d = DEFAULT, f = FULL, i = USING INDEX, n = NOTHING.
Adding the publication is now safe — UPDATEs and DELETEs ship enough info for the subscriber to apply them.
The cost is WAL volume. For an UPDATE on a row with payload as a 4KB JSON blob, WAL records the entire 4KB twice (old + new). On high-update tables this can double WAL size; budget accordingly.
The right long-term fix is to add a real PK (e.g., a BIGSERIAL audit_id column). REPLICA IDENTITY FULL is a tactical bridge while you can't backfill a PK.

Output.

Knob	WAL volume cost on UPDATE	Subscriber locates row by
`REPLICA IDENTITY DEFAULT` (PK)	normal	primary key
`REPLICA IDENTITY FULL`	~2x	full previous row
`REPLICA IDENTITY USING INDEX uidx`	~1.2x	index columns
`REPLICA IDENTITY NOTHING`	UPDATE/DELETE error	n/a — silent drop

Rule of thumb. Add a PK if you possibly can — that is always the cheapest. Use REPLICA IDENTITY FULL only when a PK is impractical (legacy schemas, write-once audit tables). Never leave REPLICA IDENTITY NOTHING on a table being replicated; the rows will silently vanish on the subscriber.

Worked example — subset replication with row filter

Detailed explanation. A multi-tenant SaaS replicates Postgres into a per-tenant Snowflake schema. Only the rows where tenant_id = $current_tenant should ship to each tenant's slot. Row filters (PG15+) do this cleanly.

Question. Create a publication that ships only tenant_id = 42's orders + payments, with column lists that exclude internal columns.

Input.

Column	orders	payments
id (PK)	yes	yes
tenant_id	yes (filter only)	yes (filter only)
customer_id	yes (ship)	—
amount	yes (ship)	yes (ship)
internal_notes	yes (skip)	yes (skip)

Code.

CREATE PUBLICATION tenant_42_pub
  FOR TABLE orders   (id, customer_id, amount) WHERE (tenant_id = 42),
      TABLE payments (id, amount)              WHERE (tenant_id = 42)
  WITH (publish = 'insert, update, delete');

Step-by-step explanation.

Each table can have its own row filter and column list, all inside the same publication.
The filter tenant_id = 42 is evaluated during WAL decoding on the publisher — only matching rows ever enter the slot's event stream. Network and downstream both win.
The column list (id, customer_id, amount) ships only those columns. tenant_id is referenced in the filter but does not ship (PG15+ allows filter-only columns when REPLICA IDENTITY DEFAULT covers them; if not, you must include the filter column in the column list).
publish = 'insert, update, delete' is the default for non-truncate DML. Truncate is excluded here because TRUNCATE is table-wide and would clobber other tenants on the subscriber side.
A row that flips tenant_id from 42 to 99 emits a DELETE on the subscriber (the row left the published set). A row that flips from 99 to 42 emits an INSERT. The wire correctly represents the membership change.

Output.

Action on publisher	Effect on tenant 42's subscriber
INSERT `(id=1, tenant_id=42, amount=100)`	INSERT `(id=1, amount=100, customer_id=…)`
INSERT `(id=2, tenant_id=99, amount=50)`	nothing (filtered out)
UPDATE `id=1 tenant_id 42 → 99`	DELETE `id=1` (left the published set)
UPDATE `id=2 tenant_id 99 → 42`	INSERT `(id=2, amount=50, customer_id=…)` (joined the published set)
DELETE `id=1` (when tenant_id=42)	DELETE `id=1`

Rule of thumb. Row filters are the cleanest tenant-isolation primitive Postgres ships. Per-tenant publication + per-tenant slot is the standard multi-tenant CDC architecture. Avoid putting the tenant filter on the subscriber side — it wastes WAL and network and creates a lateral data leak risk if the subscriber's filter has a bug.

Worked example — column list to redact PII from an analytics sink

Detailed explanation. A customers table contains email, phone, and address columns subject to PII regulations. The analytics warehouse sink only needs customer_id, signup_date, segment. Column lists (PG15+) let you redact PII at the publication layer, so the data never reaches the analytics warehouse.

Question. Publish customers to the analytics warehouse with only (customer_id, signup_date, segment), dropping email, phone, address entirely.

Input.

Column	Ship to analytics?
customer_id (PK)	yes
signup_date	yes
segment	yes
email	NO (PII)
phone	NO (PII)
address	NO (PII)

Code.

CREATE PUBLICATION customers_analytics_pub
  FOR TABLE customers (customer_id, signup_date, segment);

-- Verify the column list
SELECT pubname, attname
  FROM pg_publication p
  JOIN pg_publication_rel pr ON p.oid = pr.prpubid
  JOIN pg_class c ON pr.prrelid = c.oid
  JOIN pg_attribute a ON a.attrelid = c.oid AND a.attnum = ANY(pr.prattrs::int2[])
 WHERE pubname = 'customers_analytics_pub'
 ORDER BY a.attnum;

Step-by-step explanation.

FOR TABLE customers (customer_id, signup_date, segment) is the column list. Only these three columns ship — email, phone, address never appear in the change event payload.
PII redaction at the publication layer is stronger than column-level GRANTs because no privileged downstream tool can introspect the column. The data physically does not leave the publisher.
The column list must include the REPLICA IDENTITY columns (the PK). Postgres rejects publications that omit them with "REPLICA IDENTITY columns must be in the column list."
UPDATE of a non-listed column (e.g., a phone-number change) still emits a change event on WAL, but the event only contains the listed columns. If the row's listed columns did not change, the subscriber sees an UPDATE with no actual data change — harmless but slightly wasteful.
Column lists compose with row filters — you can ship tenant_id = 42 rows with only PII-free columns, double-locked.

Output.

Action on publisher	Wire event to analytics
INSERT `(id=1, email=a@x.com, phone=…, address=…, segment='gold', signup_date=…)`	INSERT `(id=1, segment='gold', signup_date=…)`
UPDATE `id=1 SET phone='…'`	UPDATE `id=1` with no real data change (`segment` and `signup_date` unchanged)
UPDATE `id=1 SET segment='platinum'`	UPDATE `id=1 SET segment='platinum'`
DELETE `id=1`	DELETE `id=1`

Rule of thumb. PII redaction via column list is the strongest physical guarantee Postgres offers. Combined with a separate non-analytics publication for the OLTP pipeline, you get an "untrusted analytics sink" pattern that survives a downstream compromise.

Senior interview question on UPDATE replication

A senior interviewer might ask: "You set up logical replication on an events table that has no primary key. INSERTs are showing up on the subscriber but UPDATEs and DELETEs are not. Explain why, and walk through every option for fixing it."

Solution Using REPLICA IDENTITY analysis + the right knob for the schema

-- Diagnose: what is REPLICA IDENTITY for the table?
SELECT relname, relreplident
  FROM pg_class WHERE relname = 'events';
-- relreplident = 'd' (DEFAULT) + no PK = no row identifier shipped for UPDATE/DELETE

-- Fix option A — add a primary key (the right answer if possible)
ALTER TABLE events ADD COLUMN event_id BIGSERIAL PRIMARY KEY;

-- Fix option B — use an existing unique index (if you have one)
CREATE UNIQUE INDEX events_uniq ON events(actor, action, ts);
ALTER TABLE events REPLICA IDENTITY USING INDEX events_uniq;

-- Fix option C — REPLICA IDENTITY FULL (last resort)
ALTER TABLE events REPLICA IDENTITY FULL;
-- 2x WAL cost on UPDATEs, full-row scan on the subscriber side

Step-by-step trace.

Step	Diagnostic	Decision
1. Inspect REPLICA IDENTITY	`relreplident = 'd'`, no PK	UPDATEs ship no row id → subscriber cannot find row
2. Can we add a PK?	yes — a `BIGSERIAL event_id` makes sense	pick option A
3. If table is write-once + huge	adding a PK locks the table for hours	pick option B (existing unique index) or C (FULL)
4. If no unique index either	natural duplicates exist	pick option C (FULL) and accept the 2x WAL cost
5. After fix	verify with `pg_class.relreplident` flipped to `d` / `i` / `f`	re-run an UPDATE → check subscriber

In our scenario, option A is the right answer if the schema can take it. If the table is multi-TB and live, option B (existing unique index) is the cheapest middle ground. Option C is reserved for genuinely PK-less tables where you cannot add a unique index either.

Output:

Option	WAL cost on UPDATE	Subscriber locates row by	Effort
A — add PK	normal	PK	medium (DDL + maybe backfill)
B — REPLICA IDENTITY USING INDEX	~1.2x	index columns	low (DDL only)
C — REPLICA IDENTITY FULL	~2x	full previous row	low (single ALTER)
Do nothing	n/a — UPDATEs silently vanish	—	zero — but the bug is unsurvivable

Why this works — concept by concept:

REPLICA IDENTITY is per-table, not per-publication — you set it once on the table; every publication that includes the table inherits it.
DEFAULT requires a PK — the most common bug is "everyone assumed REPLICA IDENTITY DEFAULT works" without realising the table has no PK. The fix is always an explicit ALTER.
FULL is a tax, not a feature — useful for legacy tables but doubles WAL. Treat as the last resort, not the first.
USING INDEX bridges the gap — when the table has a natural unique key but no formal PK, USING INDEX gives you the cheaper REPLICA IDENTITY without a PK constraint.
Cost — WAL cost scales with O(updates × full_row_size) for FULL, O(updates × index_size) for USING INDEX, O(updates × pk_size) for DEFAULT. On a heavy-update table the difference is large; pick the cheapest viable option.

SQL
Topic — SQL
REPLICA IDENTITY SQL problems

Practice →

ETL Topic — ETL CDC publication design drills

Practice →

5. CDC into Kafka via Debezium

`debezium postgres` is a Kafka Connect source that talks the logical replication protocol — snapshot first, then stream, with heartbeats to keep idle slots fresh and a signal table for ad-hoc snapshots

The mental model in one line: Debezium is a Kafka Connect source connector that opens a logical replication connection to Postgres, creates (or reuses) a slot, takes an initial snapshot of every published table, and then streams every WAL change as a Kafka record on a per-table topic — with heartbeats to advance idle slots and a signal table to trigger ad-hoc snapshots. Once you internalise "Debezium is a pgoutput consumer with Kafka Connect ergonomics," every production postgres cdc interview question reduces to "what Debezium knob covers this case."

Debezium Postgres connector — the moving parts.

Kafka Connect — runs the connector. Provides offset storage, schema-history storage, REST API for config, distributed worker scaling.
Debezium PG connector — Java code that opens a JDBC connection (for snapshot) AND a replication connection (for streaming).
pgoutput plugin — the Debezium 2.x default. Reads WAL on the publisher and emits change events.
Per-table Kafka topics — Debezium emits each row change to {topic-prefix}.{schema}.{table} (e.g., app.public.orders). One topic per table.
Schema registry — Avro / Protobuf schemas published per-table. Snapshot phase creates the schemas; streaming phase emits records with them.

Snapshot + stream — the two phases.

Snapshot mode initial (default) — at connector start, Debezium opens a serialisable transaction on the publisher, takes a snapshot LSN (consistent_point), reads every published table via SELECT *, and emits each row as a Kafka INSERT event tagged with op = 'r' (read). When the snapshot finishes, Debezium switches to streaming mode and reads from the slot starting at consistent_point.
Snapshot mode never — skip the snapshot entirely, start streaming from the current LSN. Use when you have pre-loaded the target via pg_dump.
Snapshot mode always — always re-snapshot on connector start. Useful for debugging; never use in production.
Snapshot mode initial_only — snapshot once, never stream. One-off bulk load.

Heartbeats — keeping idle slots fresh.

If a published table has no DML for hours, the slot's confirmed_flush_lsn does not advance. Other (unrelated) WAL on the database still accumulates; the slot holds it.
The fix: heartbeat.interval.ms = 30000 — every 30s Debezium writes a no-op row to a debezium_heartbeat table on the publisher (which is included in the publication), which advances the slot.
Without heartbeats, an idle CDC table on a busy database can pin GBs of WAL. With heartbeats, the slot never falls behind.

Signal table — the ad-hoc operations channel.

A small table debezium_signal (id text, type text, data text) that Debezium watches.
Insert a row like ('snapshot-1', 'execute-snapshot', '{"data-collections":["public.orders"]}') and Debezium runs an incremental snapshot of orders on the next polling tick.
Used for re-snapshotting a single table after a schema migration, without restarting the whole connector or losing your place on the slot.

Schema evolution — the great gotcha.

A new column added on the publisher: Debezium detects the schema change at the next WAL record, updates the Kafka schema, and starts emitting events with the new column populated.
A column dropped on the publisher: Debezium emits subsequent events with the column omitted, but the Kafka schema may still reference it (depending on registry compat mode).
A column type changed (text → int): often a breaking schema change. Debezium may halt with a SchemaChangeException. Plan migrations as two-step (add new column → backfill → drop old column) to avoid in-flight breaks.

Common interview probes on Debezium.

"How does Debezium catch a new subscriber up — replay all WAL?" — no, initial snapshot via SELECT * at consistent_point, then stream from there.
"What is a heartbeat for?" — advance the slot when the published tables are idle so other-table WAL does not pile up.
"How do you re-snapshot one table after a schema migration?" — insert a row into the signal table requesting an incremental snapshot of that table.
"How do you guarantee exactly-once into the Kafka topic?" — at-least-once + idempotent consumer; the source connector cannot itself give EOS, only the consumer can.

Worked example — minimal Debezium PG connector config

Detailed explanation. The smallest Debezium connector that actually works. JSON posted to the Kafka Connect REST API. Highlights the must-have knobs: connection string, slot name, publication name, snapshot mode, heartbeat interval.

Question. Write a Debezium Postgres connector config for an orders + payments CDC pipeline, with heartbeat and a single slot, using pgoutput.

Input.

Setting	Value
publisher	pg-prod:5432
database	app
tables	public.orders, public.payments
target	Kafka topic prefix `prod.app`
heartbeat	30s

Code.

{
  "name": "pg-orders-cdc",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": "1",

    "database.hostname": "pg-prod",
    "database.port":     "5432",
    "database.user":     "debezium",
    "database.password": "${file:/secrets/debezium.pwd}",
    "database.dbname":   "app",

    "topic.prefix":           "prod.app",
    "schema.include.list":    "public",
    "table.include.list":     "public.orders,public.payments",

    "plugin.name":            "pgoutput",
    "slot.name":              "debezium_orders_slot",
    "publication.name":       "debezium_orders_pub",
    "publication.autocreate.mode": "filtered",

    "snapshot.mode":          "initial",
    "snapshot.isolation.mode": "serializable",

    "heartbeat.interval.ms":  "30000",
    "heartbeat.action.query": "INSERT INTO public.debezium_heartbeat(ts) VALUES (now()) ON CONFLICT (ts) DO NOTHING",

    "signal.data.collection": "public.debezium_signal",

    "key.converter":   "io.confluent.connect.avro.AvroConverter",
    "value.converter": "io.confluent.connect.avro.AvroConverter",
    "value.converter.schema.registry.url": "http://schema-registry:8081"
  }
}

Step-by-step explanation.

connector.class = PostgresConnector and tasks.max = 1 — Debezium PG always runs a single task because the slot itself is single-consumer.
plugin.name = 'pgoutput', slot.name, publication.name are the three knobs that wire Debezium to the logical replication primitives. With publication.autocreate.mode = 'filtered' Debezium creates a publication scoped to table.include.list if one does not exist.
snapshot.mode = 'initial' — on first start, Debezium snapshots both tables under a serialisable transaction, emitting each row with op = 'r'. On subsequent restarts, Debezium picks up from the slot's last LSN and skips the snapshot.
heartbeat.interval.ms = 30000 + heartbeat.action.query writes a no-op row every 30s to keep the slot advancing on idle tables. The ON CONFLICT … DO NOTHING keeps the heartbeat table from growing unbounded.
signal.data.collection = 'public.debezium_signal' tells Debezium which table to watch for ad-hoc commands (incremental snapshots, log markers).
The Avro converter + schema registry wires Kafka records to a typed schema; downstream consumers in Java / Python / etc. get typed access without parsing JSON.

Output (per-table Kafka topics emitted).

Table	Topic
`public.orders`	`prod.app.public.orders`
`public.payments`	`prod.app.public.payments`
heartbeat	`prod.app.public.debezium_heartbeat`
schema changes	`prod.app.schema-changes`

Rule of thumb. Minimal viable Debezium = connector + slot + publication + snapshot.mode=initial + heartbeat. Add signal table for operability. Add Avro + schema registry for typed downstream consumption. Skip schema registry only if all your downstream consumers can speak JSON.

Worked example — signal-table ad-hoc snapshot after a schema change

Detailed explanation. A new column tax_amount was added to orders on the publisher. The historical rows do not yet have the column populated in Kafka. The fix: trigger an incremental snapshot via the signal table, which re-reads every row of orders and emits them as Kafka events with the new schema.

Question. Use Debezium's signal table to trigger an incremental snapshot of public.orders after a schema migration added tax_amount.

Input.

Step	Action
1	On publisher: `ALTER TABLE orders ADD COLUMN tax_amount numeric`
2	On subscriber (Kafka downstream): `ALTER TABLE orders ADD COLUMN tax_amount numeric` (or schema evolution in the warehouse)
3	Backfill `tax_amount` on the publisher historical rows
4	Trigger incremental snapshot via signal table

Code.

-- On the publisher
ALTER TABLE orders ADD COLUMN tax_amount numeric;
UPDATE orders SET tax_amount = round(amount * 0.08, 2);

-- Trigger Debezium incremental snapshot of orders
INSERT INTO debezium_signal(id, type, data)
VALUES ('snapshot-orders-2026-06-22',
        'execute-snapshot',
        '{
           "data-collections": ["public.orders"],
           "type": "incremental"
         }');

-- Debezium picks up the signal on its next poll (sub-second)
-- and starts a chunk-by-chunk SELECT * of orders.
-- Each row is emitted to Kafka with op='r' (read).
-- Streaming continues concurrently — Debezium interleaves chunks with live changes.

Step-by-step explanation.

The schema migration on the publisher adds the column. Debezium notices the schema change at the next WAL event and updates the Kafka topic schema. New INSERTs already carry tax_amount.
Historical rows on the publisher still need their tax_amount populated in Kafka. A full re-snapshot would work but takes the connector offline. Instead, use the signal table.
The signal row is an INSERT into public.debezium_signal. Debezium polls that table every ~5s and picks up the request.
The incremental snapshot reads orders in chunks (configurable via incremental.snapshot.chunk.size), emitting each row with op = 'r'. Critically, streaming continues during the snapshot — concurrent INSERTs / UPDATEs / DELETEs interleave with snapshot chunks; Debezium's chunk algorithm handles the consistency.
Downstream consumers see snapshot rows tagged with op = 'r'. Idempotent consumers ignore re-reads of rows they already have; non-idempotent consumers may produce duplicates and need an upsert sink.

Output.

Phase	What Debezium emits
Pre-signal	live INSERT / UPDATE / DELETE events with `op = c/u/d`
Signal received	log marker + start of incremental snapshot
Snapshot chunk 1	rows 1..1000 with `op = 'r'` (interleaved with live events)
Snapshot chunk 2	rows 1001..2000 with `op = 'r'` (interleaved)
Snapshot done	log marker, back to streaming-only

Rule of thumb. Use incremental snapshots whenever you need to re-read a table without restarting the connector. The chunk-by-chunk + interleaved-live design is one of Debezium's best operational features; learn it for senior interviews because it is a common probe.

Worked example — idempotent sink on at-least-once delivery

Detailed explanation. Debezium's source side is at-least-once: on restart, it may re-emit events between the last acked LSN and the latest position. End-to-end exactly-once requires either Kafka transactions all the way through (rare in CDC) or an idempotent sink that absorbs duplicates. The Postgres CDC standard is upsert-by-PK on the sink.

Question. Write a sink consumer that absorbs duplicate Debezium events on orders into a Postgres target table idempotently, regardless of how many times Debezium replays.

Input — Kafka topic prod.app.public.orders.

op	id (PK)	customer_id	amount
c	1	42	100
u	1	42	150
u	1	42	150 (duplicate replay)
d	1	—	—

Code.

# Idempotent sink — upsert on PK
import psycopg
from confluent_kafka import Consumer
import json

consumer = Consumer({'bootstrap.servers': 'kafka:9092',
                     'group.id': 'orders-sink',
                     'enable.auto.commit': False})
consumer.subscribe(['prod.app.public.orders'])

conn = psycopg.connect("host=warehouse dbname=dw user=loader")

UPSERT_SQL = """
INSERT INTO dw.orders (id, customer_id, amount, source_lsn)
VALUES (%(id)s, %(customer_id)s, %(amount)s, %(lsn)s)
ON CONFLICT (id) DO UPDATE
   SET customer_id = EXCLUDED.customer_id,
       amount      = EXCLUDED.amount,
       source_lsn  = EXCLUDED.source_lsn
 WHERE EXCLUDED.source_lsn > dw.orders.source_lsn;
"""
DELETE_SQL = "DELETE FROM dw.orders WHERE id = %(id)s AND source_lsn < %(lsn)s;"

while True:
    msg = consumer.poll(1.0)
    if msg is None: continue
    evt = json.loads(msg.value())
    payload = evt['payload']
    op = payload['op']
    lsn = payload['source']['lsn']
    after = payload.get('after') or {}
    before = payload.get('before') or {}

    with conn.cursor() as cur:
        if op in ('c', 'r', 'u'):
            cur.execute(UPSERT_SQL, {**after, 'lsn': lsn})
        elif op == 'd':
            cur.execute(DELETE_SQL, {'id': before['id'], 'lsn': lsn})
    conn.commit()
    consumer.commit(msg)

Step-by-step explanation.

The sink table has a source_lsn column. Every upsert records the LSN of the source event. On a replay, the LSN of the duplicate event is <= the existing source_lsn, so the WHERE EXCLUDED.source_lsn > dw.orders.source_lsn clause skips it.
INSERT / READ / UPDATE all funnel into the same upsert — Debezium's op tag indicates the source operation but the apply logic on the sink is "upsert the row." This is the idempotency trick.
DELETE is also LSN-guarded — only delete the row if the source LSN is newer than the last recorded one. Protects against out-of-order replay where a stale DELETE arrives after a fresh INSERT.
consumer.commit(msg) is after the Postgres commit — at-least-once on the Kafka side, with the database commit providing the deduplication boundary. If the consumer crashes between DB commit and Kafka commit, the next start replays the message and the upsert is a no-op.
Cost: each replayed event still pays a round-trip to Postgres for the dedup check. For high-volume streams, batch the upserts with executemany and pre-filter duplicates on a sliding window of recently-seen LSNs.

Output.

Event	sink table state after
`c id=1 amount=100 lsn=10`	`{id=1, amount=100, source_lsn=10}`
`u id=1 amount=150 lsn=20`	`{id=1, amount=150, source_lsn=20}`
`u id=1 amount=150 lsn=20` (replay)	unchanged (LSN 20 not > 20)
`d id=1 lsn=30`	row deleted (LSN 30 > 20)

Rule of thumb. The Debezium contract is at-least-once on the source side; idempotency lives in the sink. Always include the source LSN (or transaction ID) on the sink row and guard upserts + deletes with it. This is the senior-DE pattern that distinguishes a working CDC pipeline from one that silently corrupts data on every replay.

Senior interview question on Debezium production patterns

A senior interviewer might ask: "Walk me through how you'd ship a Postgres-to-Kafka CDC pipeline to production. What are the five production knobs you tune, and what failure does each one cover?"

Solution Using the five-knob production checklist

Debezium PG → Kafka production knobs
====================================

1. slot.name + publication.name (explicit, stable)
   Covers: slot leakage on connector redeploy.
   Anti-pattern: letting Debezium auto-name (random name on each restart).

2. snapshot.mode = 'initial' + signal table enabled
   Covers: schema migrations, single-table re-snapshot without full restart.
   Anti-pattern: snapshot.mode='always' (re-snapshots every restart, kills throughput).

3. heartbeat.interval.ms = 30000 + heartbeat.action.query
   Covers: idle-table slot stall when other tables generate WAL.
   Anti-pattern: no heartbeat — slot lag grows when CDC tables are idle.

4. max_slot_wal_keep_size on Postgres + monitoring on retained WAL
   Covers: phantom-slot disk-fill outage.
   Anti-pattern: unbounded slot retention.

5. Idempotent sink with source-LSN dedup
   Covers: at-least-once replay duplicates.
   Anti-pattern: trusting "exactly-once" without an LSN guard on the sink.

Step-by-step trace.

Knob	Default	Production value	Failure it covers
`slot.name`	random	explicit, e.g. `debezium_orders_slot`	slot leak on redeploy
`publication.name`	`dbz_publication`	explicit, e.g. `debezium_orders_pub`	conflicts with other connectors
`snapshot.mode`	`initial`	`initial`	full reseed only on first start
`heartbeat.interval.ms`	0 (disabled)	30000	idle-table slot stall
`max_slot_wal_keep_size` (PG)	-1 (unbounded)	8GB–50GB	phantom-slot disk-fill
Sink upsert with `source_lsn` guard	n/a (you write this)	required	duplicate replay

After applying all five, the pipeline survives a connector restart (knob 1, 2, 3), survives the source database under WAL pressure (knob 4), and survives the at-least-once replay (knob 5). This is the production-grade debezium postgres baseline.

Output:

Failure mode	Without knobs	With all five knobs
Connector redeploy leaks slots	yes	no (explicit slot name reused)
Schema migration breaks the connector	yes	no (signal-table incremental snapshot)
Slot lag grows on idle CDC tables	yes	no (heartbeat)
Phantom slot fills disk	yes	bounded at `max_slot_wal_keep_size`
Replay duplicates corrupt the sink	yes	no (LSN guard)

Why this works — concept by concept:

Explicit names are the operability primitive — slots and publications outlive the connector; naming them explicitly makes them inspectable and reattachable.
Heartbeat advances the slot — the slot's confirmed_flush_lsn advances only on consumer ack of some table in the publication. Heartbeat manufactures activity on a dummy table, keeping the slot moving even when the real tables are idle.
Signal table = ad-hoc operations channel — incremental snapshot, log marker, custom commands. The lowest-friction way to give Debezium an instruction without restarting it.
max_slot_wal_keep_size protects the source — the brake of last resort. Pair with a monitoring alert at 50% of the cap so you intervene before invalidation.
Cost — knobs 1–4 are configuration only (zero ongoing cost). Knob 5 (idempotent sink) is one extra column + one extra WHERE clause per upsert. The total cost of "production-grade" is a one-time setup; the cost of skipping it is a midnight outage.

ETL
Topic — ETL
Debezium production drills

Practice →

Streaming
Topic — streaming
Postgres CDC streaming problems

Practice →

Cheat sheet — logical replication recipes

Full-loop 6-command setup. On publisher: ALTER SYSTEM SET wal_level = 'logical' + restart; CREATE PUBLICATION p FOR TABLE t; CREATE ROLE replicator REPLICATION LOGIN. On subscriber: CREATE TABLE t (…) (same schema); CREATE SUBSCRIPTION s CONNECTION '…' PUBLICATION p; verify via pg_stat_subscription. Memorise this — it is the floor of every postgres logical replication interview.
Slot health one-liner. SELECT slot_name, active, wal_status, pg_size_pretty(pg_current_wal_lsn() - restart_lsn) AS retained, pg_size_pretty(pg_current_wal_lsn() - confirmed_flush_lsn) AS lag FROM pg_replication_slots ORDER BY 4 DESC;. Run this every time you suspect a phantom slot.
REPLICA IDENTITY ladder. DEFAULT (PK; cheapest) → USING INDEX uidx (existing unique index; ~1.2x WAL) → FULL (entire previous row; ~2x WAL) → NOTHING (never use on replicated tables — UPDATEs silently vanish).
Slot disk-fill prevention. max_slot_wal_keep_size = 8GB (or 4–8x your worst-case consumer downtime) + monitoring alert on active = false for any logical slot > 5 minutes. The two together kill the phantom-slot outage.
Pick a plugin. Default to pgoutput (in-tree, binary, supported by Debezium 2.x and Postgres native subscriptions). Switch to wal2json only for custom consumers that benefit from JSON readability. decoderbufs is legacy Debezium 1.x.
Row filter + column list (PG15+). CREATE PUBLICATION p FOR TABLE t (c1, c2, c3) WHERE (tenant_id = $X). Filters must reference REPLICA IDENTITY columns; column list must include them. Combine for per-tenant + PII-redacted feeds.
DDL is NOT replicated. Apply schema changes on both sides via a migration tool (Liquibase / Flyway / Sqitch). Add columns on the subscriber FIRST to avoid in-flight apply errors.
Debezium minimum config. plugin.name=pgoutput, explicit slot.name, explicit publication.name, snapshot.mode=initial, heartbeat.interval.ms=30000, signal.data.collection=public.debezium_signal. Five knobs cover 90% of production cases.
Heartbeat keeps the slot moving. heartbeat.interval.ms = 30000 + INSERT INTO debezium_heartbeat (ts) VALUES (now()) action query. Without it, idle CDC tables on a busy database silently pin GBs of WAL.
Signal-table snapshot. INSERT INTO debezium_signal(id, type, data) VALUES ('id-123', 'execute-snapshot', '{"data-collections":["public.orders"], "type":"incremental"}') — re-snapshot one table without restarting the connector.
Idempotent sink with LSN guard. Every sink row carries a source_lsn column; every upsert is WHERE EXCLUDED.source_lsn > target.source_lsn. Absorbs at-least-once replays without corruption.
Postgres → Snowflake / Iceberg / BigQuery. Use a managed CDC tool (Fivetran, Airbyte, AWS DMS) on top of pgoutput. You still own the slot, the publication, and the REPLICA IDENTITY — the vendor only owns the consumer.
Postgres → Postgres major-version upgrade. Set up logical replication from old (PG13) to new (PG16) with matching schemas, let it catch up, then cut over with a brief application pause. The shortest-downtime Postgres major upgrade pattern in 2026.
PG16 parallel apply. CREATE SUBSCRIPTION … WITH (streaming = 'parallel') applies large transactions in parallel on the subscriber. Big win for write-heavy publishers.
Postgres on RDS / Cloud SQL / Aurora. All major managed Postgres services expose wal_level = logical via a parameter group flag and ship pgoutput. Slot lifecycle and REPLICA IDENTITY semantics are identical to self-hosted; the management plane differs.

Frequently asked questions

What is the difference between physical and logical replication in Postgres?

Physical (streaming) replication ships byte-for-byte WAL between two Postgres clusters of the same major version with identical schemas — used for high-availability standbys and read replicas. Logical replication uses logical decoding (pgoutput / wal2json / decoderbufs) to ship decoded row-level changes that any consumer can apply — same Postgres, a newer Postgres major version, Kafka via Debezium, Snowflake via Fivetran, anything. Physical = HA / read replicas / PITR; logical = CDC / cross-version upgrades / data lakes. The big asymmetry: logical replication does NOT replicate DDL — you apply schema changes on both sides separately. Picking physical when you needed logical (or vice versa) is the most common postgres logical replication design mistake.

Why do UPDATEs and DELETEs fail without a primary key?

Because REPLICA IDENTITY DEFAULT requires a PRIMARY KEY — Postgres ships the PK columns in the change event so the subscriber can locate the row. Without a PK and with default REPLICA IDENTITY, UPDATE and DELETE events ship no row identifier; the subscriber's apply worker logs cannot find row to update (or silently drops the operation, depending on Postgres version). The fix is one of three: add a real PK (best), set REPLICA IDENTITY USING INDEX <unique_index> (cheap if you have one), or REPLICA IDENTITY FULL (~2x WAL volume — last resort for legacy tables). Every senior Postgres CDC engineer has hit this bug at least once; it is a guaranteed interview probe.

What is a replication slot and why is it dangerous?

A replication slot is a persistent server-side cursor that promises "I will not throw away WAL beyond this LSN until the named consumer has acked it" — which is exactly what makes logical replication crash-safe, and exactly what fills your disk when a consumer dies and nobody drops the slot. The dangerous case is the phantom slot: a consumer (Debezium pod, Fivetran connector, custom script) disconnects, the slot remains marked active = false, and WAL keeps accumulating until the database runs out of disk. Two guardrails kill this outage: max_slot_wal_keep_size = 50GB (PG13+) as the brake, plus an on-call alert on active = false for any logical slot > 5 minutes. Both are mandatory production knobs; skipping either is the most preventable Postgres outage.

Can Postgres logical replication replicate DDL?

No. Logical replication only ships DML (INSERT, UPDATE, DELETE, TRUNCATE if enabled). DDL — ALTER TABLE … ADD COLUMN …, new tables, schema migrations — must be applied on both sides manually or via a separate migration tool (Liquibase, Flyway, Sqitch). The recommended pattern: apply schema changes on the subscriber FIRST, then on the publisher. This avoids the failure mode where the publisher emits an INSERT with a new column the subscriber does not have. PG16 added some infrastructure for future DDL replication, but as of 2026 it remains application-managed. Treating DDL like DML is the second-most-common pg_logical mistake (after the phantom slot).

How do I prevent WAL from filling the disk in a CDC pipeline?

Stack four defences. (1) max_slot_wal_keep_size = 8GB-50GB on Postgres — caps per-slot retention; beyond it the slot is invalidated but the database stays up. (2) Monitor pg_replication_slots.active — alert immediately if any logical slot reports active = false for more than 5 minutes. (3) Heartbeat in your CDC consumer (Debezium: heartbeat.interval.ms = 30000 plus heartbeat.action.query) — keeps slots advancing even when published tables are idle. (4) Auto-drop dead slots — if a slot reports active = false for > 30 minutes AND no recovery is in progress, drop it. The combination converts the phantom-slot midnight outage into a 4pm Slack ping. Skip any one of the four and you eventually get the outage anyway.

Postgres CDC — Debezium vs Fivetran vs Snowflake's Postgres connector?

All three sit on top of pgoutput and read from a Postgres replication slot — the underlying primitive is identical. Debezium is open-source, runs on your Kafka Connect, gives you full control of the slot, the publication, the snapshot policy, and the destination Kafka topic structure — best when you have a Kafka platform and want full control. Fivetran is managed, hides the slot lifecycle, picks pgoutput, manages the schema in Snowflake / BigQuery — best when you want zero ops and are willing to pay per-row. Snowflake's native Postgres connector is even more vertically integrated — Snowflake owns both ends, so the schema-mapping story is simpler — best when you are a Snowflake-first shop. The senior interview answer: regardless of vendor, you still own wal_level = logical, the slot, the publication, the REPLICA IDENTITY, and max_slot_wal_keep_size. Pick the vendor on consumer ergonomics; the Postgres-side responsibilities are identical.

Practice on PipeCode

Drill the ETL practice library → for the snapshot + stream + idempotent-sink family of CDC probes.
Rehearse on the streaming practice library → for the Kafka topic + at-least-once + replay patterns that sit downstream of Debezium.
Sharpen the SQL practice library → for the pg_replication_slots / pg_stat_subscription introspection queries every senior interview probes.
Stack the joins practice library → for the publisher-subscriber co-ordination + multi-table CDC drills.
Layer the aggregation practice library → for the windowed downstream consumers that turn CDC streams into business metrics.

Lock in Postgres CDC muscle memory

Postgres docs explain slots + publications. PipeCode drills explain the decision — when REPLICA IDENTITY FULL is unavoidable, when a phantom slot fills your disk, when Debezium beats Fivetran for cost. Pipecode.ai is Leetcode for Data Engineering — pattern-first practice tuned for the production trade-offs senior data engineers actually face.

Practice ETL problems →
Practice streaming problems →

1. Why logical replication is the CDC backbone

Physical replication ships bytes, postgres logical replication ships row-level changes — the moment you want Postgres data anywhere other than another identical Postgres, you need the logical path

Worked example — physical vs logical replication side-by-side

Worked example — picking an output plugin

Worked example — bi-directional logical replication on PG16

Senior interview question on logical replication scope

Solution Using two slots and two publications with shared output plugin

2. Logical replication architecture

logical decoding is a four-stage pipeline — WAL → output plugin → replication slot → subscription — and every CDC tool plugs into one of those stages

Worked example — full-loop PG → PG logical replication in 6 commands

Worked example — peeking at decoded changes without a subscriber

Worked example — row filtering and column lists since PG15

Senior interview question on the architecture

Solution Using staged inspection of WAL, slot, snapshot, apply worker

3. Replication slots — power and danger

A replication slot is a persistent cursor into the WAL — useful for crash-safe consumers, lethal when the consumer dies and the slot lives on

Worked example — inspecting a healthy slot

Worked example — engineering a phantom slot (and then fixing it)

Worked example — max_slot_wal_keep_size as a circuit breaker

Senior interview question on slot operations

Solution Using triage, invalidation, snapshot replay

4. Publications and REPLICA IDENTITY

A publication subscription pair picks what gets replicated; REPLICA IDENTITY picks what columns the source ships for UPDATEs and DELETEs — get both wrong and your sink silently drops rows

Worked example — REPLICA IDENTITY FULL for a table with no primary key

Worked example — subset replication with row filter

Worked example — column list to redact PII from an analytics sink

Senior interview question on UPDATE replication

Solution Using REPLICA IDENTITY analysis + the right knob for the schema

5. CDC into Kafka via Debezium

debezium postgres is a Kafka Connect source that talks the logical replication protocol — snapshot first, then stream, with heartbeats to keep idle slots fresh and a signal table for ad-hoc snapshots

Worked example — minimal Debezium PG connector config

Worked example — signal-table ad-hoc snapshot after a schema change

Worked example — idempotent sink on at-least-once delivery

Senior interview question on Debezium production patterns

Solution Using the five-knob production checklist

Cheat sheet — logical replication recipes

Frequently asked questions

What is the difference between physical and logical replication in Postgres?

Why do UPDATEs and DELETEs fail without a primary key?

What is a replication slot and why is it dangerous?

Can Postgres logical replication replicate DDL?

How do I prevent WAL from filling the disk in a CDC pipeline?

Postgres CDC — Debezium vs Fivetran vs Snowflake's Postgres connector?

Practice on PipeCode

Lock in Postgres CDC muscle memory

Physical replication ships bytes, `postgres logical replication` ships row-level changes — the moment you want Postgres data anywhere other than another identical Postgres, you need the logical path

`logical decoding` is a four-stage pipeline — WAL → output plugin → replication slot → subscription — and every CDC tool plugs into one of those stages

A `replication slot` is a persistent cursor into the WAL — useful for crash-safe consumers, lethal when the consumer dies and the slot lives on

Worked example — `max_slot_wal_keep_size` as a circuit breaker

A `publication subscription` pair picks what gets replicated; REPLICA IDENTITY picks what columns the source ships for UPDATEs and DELETEs — get both wrong and your sink silently drops rows

`debezium postgres` is a Kafka Connect source that talks the logical replication protocol — snapshot first, then stream, with heartbeats to keep idle slots fresh and a signal table for ad-hoc snapshots