postgres logical replication is the quiet plumbing under almost every modern CDC pipeline — and the part of Postgres senior data engineers most often handwave through until a phantom replication slot fills the source disk at 3am and the on-call shift becomes a forensic investigation. Streaming replication ships byte-for-byte WAL between Postgres instances; pg_logical ships decoded SQL-level changes — INSERT / UPDATE / DELETE rows with column values — out to anything that can speak the replication protocol. That single shift turns Postgres into a real change-data-capture source for Kafka, Snowflake, an audit warehouse, or another Postgres on a newer major version.
This guide is the senior-DE walkthrough you wished existed the first time an interviewer asked "explain postgres cdc end-to-end" or "what does a replication slot actually hold, and why is it dangerous?" It walks through logical decoding (WAL → output plugin → change events), the slot machinery (pgoutput vs wal2json, max_slot_wal_keep_size, the phantom-slot anti-pattern), publications + REPLICA IDENTITY (why your UPDATEs silently disappear without a primary key), and the production CDC pattern teams ship with debezium postgres — snapshot-then-stream, heartbeat topics, signal-table ad-hoc snapshots, and idempotent sinks for the inevitable replay. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the SQL practice library →, rehearse on the ETL practice library →, and sharpen the streaming axis with the streaming practice library →.
On this page
- Why logical replication is the CDC backbone
- Logical replication architecture
- Replication slots — power and danger
- Publications and REPLICA IDENTITY
- CDC into Kafka via Debezium
- Cheat sheet — logical replication recipes
- Frequently asked questions
- Practice on PipeCode
1. Why logical replication is the CDC backbone
Physical replication ships bytes, postgres logical replication ships row-level changes — the moment you want Postgres data anywhere other than another identical Postgres, you need the logical path
The one-sentence invariant: physical (streaming) replication ships the byte-for-byte WAL between two identical Postgres clusters; logical replication decodes the WAL into per-row INSERT/UPDATE/DELETE events that any consumer — Postgres, Kafka, Snowflake, a custom Go service — can apply. Every other downstream CDC tool (Debezium, Fivetran, Airbyte, AWS DMS, Striim, Snowflake's Postgres connector) is a wrapper on top of logical decoding. Once you internalise "physical = bytes, logical = decoded changes," the entire pg_logical interview surface collapses to a sequence of consequences from that one structural difference.
Three axes interviewers actually probe.
-
Decoding. Logical replication runs an output plugin inside the Postgres backend that walks the WAL and emits a stream of decoded change events.
pgoutputis the in-tree binary protocol used by Postgres-to-Postgres subscriptions and by Debezium 2.x by default;wal2jsonemits JSON and is the historical default for tools that wanted a human-readable wire format;decoderbufsemits Protobuf for binary efficiency. The choice of plugin determines what your consumer code looks like. - Slot ownership. Every consumer of logical decoding holds a replication slot — a persistent cursor into the WAL that prevents Postgres from recycling segments the consumer has not yet read. The slot is the source of every "Postgres CDC ate my disk" outage you have ever read about. Slot lifecycle is the senior interview probe.
-
Replica identity. UPDATEs and DELETEs need a way to identify the row that changed. With a primary key, Postgres ships the PK columns in the change event. Without one, you must explicitly choose
REPLICA IDENTITY FULL(ship the entire previous row) orREPLICA IDENTITY USING INDEX(ship a non-PK unique index). Get this wrong and your sink silently misses updates.
The DSL split — same WAL, different consumer wire format.
-
pgoutputships a compact binary protocol native to Postgres. Used by every Postgres-to-PostgresCREATE SUBSCRIPTIONand by Debezium 2.x by default. Smallest wire size, fastest decoding, slightly opaque to debug. -
wal2jsonships a JSON object per change with table name, op type, and column values. Easy to debug; larger wire size; the historical CDC default for the wider Postgres ecosystem. -
decoderbufsships Protobuf; the Debezium 1.x default. Compact and language-agnostic, but tied to a Debezium-shipped plugin binary that has to be loaded into Postgres explicitly.
The 2026 reality — what changed since 2022.
-
PG15 added row filtering (
CREATE PUBLICATION p FOR TABLE orders WHERE (status = 'shipped')) and column lists (FOR TABLE orders (id, total)). Before PG15 you replicated whole tables; now you can ship a subset to a downstream that does not need everything. - PG16 added bi-directional logical replication support (with conflict handling), real parallel apply workers on the subscriber, and proper logical decoding from standbys (huge for read-replica-style CDC).
-
Debezium 2.x switched the default plugin from
decoderbufstopgoutput, removed the requirement for a custom plugin binary, and now supportsREAD_ONLYsnapshot mode on Postgres replicas. -
Fivetran, Airbyte, AWS DMS, Snowflake's Postgres connector all sit on top of
pgoutputorwal2json. Knowing the underlying plugin protocol is what differentiates a senior interview answer from a vendor-comparison spec sheet.
What interviewers listen for.
- Do you say "logical replication = WAL → output plugin → replication slot → subscription" in the first sentence? — senior signal.
- Do you mention "the slot retains WAL until the consumer reads it; if the consumer dies, WAL fills the disk" unprompted? — senior signal.
- Do you describe REPLICA IDENTITY as "what columns the source ships for UPDATE/DELETE" rather than "a setting"? — required answer.
- Do you push back on "just use Fivetran" with "Fivetran is a managed wrapper on top of
pgoutput; you still need to understand slots and replica identity"? — senior signal.
Worked example — physical vs logical replication side-by-side
Detailed explanation. The first interview question on Postgres replication is almost always a comparison. Physical replication uses streaming WAL between two Postgres clusters of the same major version with identical schemas and identical OIDs — it is the high-availability primitive. Logical replication uses logical decoding to ship row-level changes between any two consumers, regardless of major version, schema, or even database engine. The two solve different problems; teams often confuse them.
Question. Compare physical (streaming) replication and logical replication on five axes: granularity, version compatibility, target type, schema requirement, and what happens to DDL.
Input.
| Axis | Physical | Logical |
|---|---|---|
| Granularity | whole cluster | per-table, per-DML |
| Major-version skew | same | any |
| Target | another Postgres | Postgres / Kafka / anything |
| Schema | identical | can differ |
| DDL | replicates automatically | does not replicate (must be applied separately) |
Code.
-- Physical replication setup (primary)
ALTER SYSTEM SET wal_level = 'replica';
ALTER SYSTEM SET max_wal_senders = 10;
SELECT pg_create_physical_replication_slot('standby_1');
-- On standby
-- recovery.conf / postgresql.auto.conf:
-- primary_conninfo = 'host=primary user=replicator'
-- primary_slot_name = 'standby_1'
-- Logical replication setup (publisher)
ALTER SYSTEM SET wal_level = 'logical';
CREATE PUBLICATION orders_pub FOR TABLE orders;
-- On subscriber (separate Postgres, possibly newer major version)
CREATE SUBSCRIPTION orders_sub
CONNECTION 'host=publisher dbname=app user=replicator'
PUBLICATION orders_pub;
Step-by-step explanation.
- Physical replication requires
wal_level = replica(the default since PG10). The primary streams WAL bytes to the standby; the standby applies them with a WAL replay process. The standby is a binary clone — same OIDs, same TIDs, same on-disk layout. - Logical replication requires
wal_level = logical. The primary still writes the same WAL, but a logical decoding process walks the WAL and emits decoded change events. The subscriber applies those events with regular SQL DML (INSERT … ON CONFLICT,UPDATE,DELETE). - Because logical replication uses normal SQL on the subscriber, the subscriber can be on a different major version (PG13 → PG16 is fine), a different schema (extra columns on the subscriber side are tolerated), or even a different system (Kafka via Debezium, Snowflake via Fivetran).
- The biggest gotcha: logical replication does NOT replicate DDL. If you
ALTER TABLE orders ADD COLUMN tax NUMERICon the publisher, the subscriber will not see the column; the next INSERT that includes ataxvalue will fail with "column does not exist." You must apply DDL on both sides (or use a tool likepg_dump --schema-onlyto keep them in sync). - Physical = HA / read replicas / point-in-time recovery. Logical = CDC / cross-version upgrades / data lake ingestion. Pick by what you need; don't try to use one for the other's job.
Output.
| Use case | Pick |
|---|---|
| Hot standby for failover | physical |
| Read replica for the same cluster | physical |
| Postgres 13 → Postgres 16 zero-downtime upgrade | logical |
| Stream changes to Kafka | logical (via Debezium) |
| Snowflake / Iceberg / BigQuery sync | logical (via Fivetran / Airbyte / DMS) |
| Audit log warehouse | logical |
Rule of thumb. If you want byte-for-byte cloning between two Postgres of the same version, use physical. The moment you want changes anywhere else — different major version, different engine, different schema, different downstream tool — you want logical.
Worked example — picking an output plugin
Detailed explanation. The output plugin is a Postgres extension loaded at WAL-decoding time. Three plugins dominate: pgoutput (in-tree binary), wal2json (out-of-tree JSON), decoderbufs (Debezium 1.x Protobuf). The choice affects wire size, debuggability, and which tools you can plug in downstream.
Question. Pick the right output plugin for three scenarios: Postgres-to-Postgres subscription, Debezium 2.x into Kafka, and a Python script that wants human-readable change events.
Input.
| Scenario | Consumer | Wire format preference |
|---|---|---|
| PG → PG subscription | Postgres subscriber | binary, native |
| Debezium 2.x → Kafka | Debezium connector | binary, fast |
| Python script | custom replication client | JSON, debuggable |
Code.
-- pgoutput — the in-tree default
CREATE PUBLICATION p FOR TABLE orders;
-- (no plugin selection needed; pgoutput is implicit for SUBSCRIPTION)
-- wal2json — installed as an extension on the publisher
-- CREATE EXTENSION wal2json; -- pre-installed in most managed Postgres
SELECT pg_create_logical_replication_slot('my_slot', 'wal2json');
-- Stream changes as JSON
SELECT data
FROM pg_logical_slot_get_changes('my_slot', NULL, NULL,
'include-xids', '1',
'pretty-print', '1');
-- decoderbufs — Debezium 1.x style (Protobuf)
SELECT pg_create_logical_replication_slot('debezium_slot', 'decoderbufs');
Step-by-step explanation.
- For PG → PG subscriptions, you do not pick a plugin at all —
CREATE SUBSCRIPTIONalways usespgoutput. It is in-tree, binary, and tightly coupled to the Postgres apply worker. - For Debezium 2.x → Kafka, the recommended plugin is
pgoutput(the Debezium default since 2.0). It removes the need for an out-of-tree plugin binary on the Postgres host — every modern Postgres shipspgoutputautomatically. Debezium 1.x useddecoderbufsand required loading the Debezium-shipped.sofile; that step is gone. - For a Python replication client,
wal2jsonis the friendlier choice. The wire format is one JSON object per change, with table name, op type, and column values — easy to consume withpsycopg.replicationor a similar library, and you can pretty-print and inspect inpsqlwhile debugging. - Performance:
pgoutputanddecoderbufsare both binary and within ~10% of each other on throughput.wal2jsonis roughly 1.5–2x larger on the wire and noticeably slower to decode, but the debugging win is large for hand-rolled consumers. - The plugin is loaded inside the Postgres backend process — bad plugins can degrade WAL-sender CPU. Production teams pick
pgoutputunless they have a specific reason for JSON.
Output.
| Plugin | Format | Default for | When to pick |
|---|---|---|---|
pgoutput |
binary, native | PG ↔ PG subscriptions; Debezium 2.x | most production CDC |
wal2json |
JSON | hand-rolled clients, ad-hoc inspection | debuggability, custom tooling |
decoderbufs |
Protobuf | Debezium 1.x | legacy Debezium deployments |
Rule of thumb. Default to pgoutput. Switch to wal2json only when you are writing a custom consumer and the JSON wire format is genuinely useful for debugging.
Worked example — bi-directional logical replication on PG16
Detailed explanation. PG16 added the missing primitives for symmetric two-node logical replication — origin = 'NONE' to prevent loops, and a real conflict-resolution layer. Before PG16 you had to hand-roll loop prevention with triggers. Now it is a one-flag affair, with pg_subscription tracking the origin of every applied change.
Question. Set up bi-directional logical replication between two Postgres 16 nodes such that writes on either side replicate to the other, without infinite loops.
Input.
| Node | Role | Writes |
|---|---|---|
| pg-a | publisher + subscriber | local writes from app A |
| pg-b | publisher + subscriber | local writes from app B |
Code.
-- On pg-a
CREATE PUBLICATION pub_a FOR TABLE orders;
CREATE SUBSCRIPTION sub_b
CONNECTION 'host=pg-b dbname=app user=replicator'
PUBLICATION pub_b
WITH (origin = 'NONE', copy_data = false);
-- On pg-b
CREATE PUBLICATION pub_b FOR TABLE orders;
CREATE SUBSCRIPTION sub_a
CONNECTION 'host=pg-a dbname=app user=replicator'
PUBLICATION pub_a
WITH (origin = 'NONE', copy_data = false);
Step-by-step explanation.
- Each node creates its own publication of the same table (
orders) and subscribes to the other node's publication. Without loop prevention, an insert on pg-a would replicate to pg-b, which would re-emit it back to pg-a, and so on forever. - The flag
origin = 'NONE'is the loop breaker. Each applied change is tagged with its origin (the upstream publication node). The apply worker on a node only re-emits a change if its origin is "local" — never if it came from another logical replication source. -
copy_data = falseskips the initial snapshot. Both sides already have the same data (assume a careful operator); you only want to stream incremental changes from now on. - Conflicts are still possible: two app servers writing to the same row at the same time will produce divergent values. PG16 gives you an
ALTER SUBSCRIPTION … SET (disable_on_error = true)knob and apg_subscription_origin_statusview to inspect conflict state. Production setups add a write-key partitioning scheme on top (each row has a "home" node). - Bi-directional replication is powerful for active-active deployments but multiplies the operational surface. The vast majority of CDC pipelines remain one-way (
source → CDC → sink) — bi-directional is the niche case.
Output.
| Knob | Default | What it does |
|---|---|---|
origin = 'NONE' |
'ANY' |
only re-emit changes that originated locally — breaks replication loops |
copy_data = false |
true |
skip the initial snapshot at subscription create time |
disable_on_error = true |
false |
pause the subscription on apply error instead of retrying forever |
streaming = 'parallel' |
'off' |
apply large transactions in parallel on the subscriber (PG16+) |
Rule of thumb. Bi-directional logical replication is a real PG16+ feature, but reach for it only when you genuinely need active-active. For 95% of CDC pipelines, one-way source → sink is simpler and safer.
Senior interview question on logical replication scope
A senior interviewer often opens with: "Your team wants to replicate Postgres into Snowflake and into a Kafka topic for the real-time fraud team. Walk me through how you would design the logical replication setup — one slot or two, one publication or two, and why."
Solution Using two slots and two publications with shared output plugin
-- Two publications — Snowflake and fraud team have different needs
CREATE PUBLICATION snowflake_pub
FOR TABLE orders, payments, customers, products
WITH (publish = 'insert, update, delete, truncate');
CREATE PUBLICATION fraud_pub
FOR TABLE payments WHERE (amount > 100),
TABLE login_events
WITH (publish = 'insert, update');
-- Two slots — independent consumer cursors
-- Snowflake uses the Fivetran Postgres connector (creates its own slot)
-- Fraud team uses Debezium Postgres connector (creates its own slot)
-- Inspect:
SELECT slot_name, plugin, slot_type, database, active, wal_status,
confirmed_flush_lsn, restart_lsn
FROM pg_replication_slots;
Step-by-step trace.
| Step | Decision | Why |
|---|---|---|
| 1. One publication or two? | Two | Snowflake needs all tables + truncate; fraud team only needs payments (filtered) + login_events
|
| 2. One slot or two? | Two | Slots are per-consumer cursors; one shared slot would force both consumers to advance at the slowest one (= disk fill if either stalls) |
| 3. Output plugin |
pgoutput for both |
Standard, binary, supported by Debezium 2.x and Fivetran's Postgres connector |
4. Row filter on fraud_pub
|
WHERE amount > 100 |
reduces WAL volume to the fraud team by ~5x |
| 5. Monitoring | poll pg_replication_slots every 60s |
alert if pg_current_wal_lsn() - restart_lsn > 50 GB
|
After the design pass, both consumers are independent — the fraud team can stall for a few minutes without blocking Snowflake, and Snowflake can have a multi-hour outage without disturbing fraud (within max_slot_wal_keep_size).
Output:
| Layer | Snowflake | Fraud team |
|---|---|---|
| Publication |
snowflake_pub (4 tables, full DML + truncate) |
fraud_pub (filtered payments + login_events) |
| Slot | fivetran_slot |
debezium_slot |
| Plugin | pgoutput |
pgoutput |
| WAL volume | 100% | ~20% (row filter) |
| Independence | yes (own slot) | yes (own slot) |
Why this works — concept by concept:
- Two publications, two intents — Snowflake wants the full table set for downstream analytics; fraud wants a filtered slice in real time. Two publications keep the wire payload minimal for each consumer.
- One slot per consumer — a slot is a cursor into the WAL. If two consumers share a slot, the slot advances only when the slowest consumer commits — meaning a slow consumer holds WAL for the fast one. Independent slots decouple their lifetimes.
-
pgoutput across the board — both downstream tools (Fivetran's Postgres connector and Debezium 2.x) speak
pgoutputnatively. No per-consumer plugin choice, no.sofiles to deploy. -
Row filter at publication time — pushing the
amount > 100filter into the publication means the fraud slot never sees the small-amount events; less WAL pressure, less network egress, fewer events to drop downstream. -
Cost — slots add
O(WAL volume)retention pressure on the source; two slots =O(slowest of the two). Monitorrestart_lsnper slot and alert on lag thresholds; this turns a midnight outage into a 4pm Slack ping.
SQL
Topic — SQL
Postgres CDC SQL practice problems
2. Logical replication architecture
logical decoding is a four-stage pipeline — WAL → output plugin → replication slot → subscription — and every CDC tool plugs into one of those stages
The mental model in one line: logical replication is a pipeline of four components — the WAL on the source, an output plugin that decodes WAL into change events, a replication slot that holds those events durably until consumed, and a subscription (or external consumer) that applies them downstream. Once you say that out loud, every pg_logical interview question becomes a deduction from "WAL is the source of truth; the slot is the cursor; the plugin is the wire format; the subscription is the apply worker."
The four stages, end-to-end.
-
WAL (write-ahead log). Every Postgres modification — INSERT, UPDATE, DELETE, DDL — appends records to the WAL before touching the heap. WAL is the source of truth for both crash recovery and replication.
wal_level = logicaladds extra information to each WAL record so the decoder can reconstruct the row-level change (not just the byte change). The extra info costs ~10–15% WAL volume compared towal_level = replica. -
Output plugin. A loadable Postgres extension that reads WAL records and emits a stream of decoded change events.
pgoutput(in-tree),wal2json(JSON),decoderbufs(Protobuf). The plugin runs inside the Postgres backend process — specifically inside the WAL sender backend that handles the replication connection. -
Replication slot. A persistent server-side cursor into the WAL. Holds two LSNs —
restart_lsn(the WAL position we must keep for restart) andconfirmed_flush_lsn(the position the consumer has confirmed). The slot prevents Postgres from recycling WAL until the consumer has acked. Survives restarts. -
Subscription / external consumer. Either a Postgres
CREATE SUBSCRIPTION(Postgres-to-Postgres) or an external consumer using the replication protocol (Debezium, Fivetran, custom Python withpsycopg.replication, etc.). The subscription opens a streaming connection, advances the slot as it applies, and is responsible for crash safety on the consumer side.
The replication protocol — what wire actually carries.
- The publisher exposes a special connection type — the replication connection — that you reach with
psql 'replication=database host=… user=…'or with a client library's replication mode. This is distinct from a regular SQL connection. - On a replication connection you can issue
START_REPLICATION SLOT my_slot LOGICAL 0/0 (proto_version '4', publication_names 'my_pub')to begin streaming. - The server then pushes a sequence of
XLogDatamessages (containing decoded change events) and periodicPrimary keepalivemessages. The client acks progress withStandby status updatemessages, advancingconfirmed_flush_lsnon the slot.
Snapshot + stream — how new subscribers catch up.
- A brand-new subscription needs to start somewhere. Postgres ships an initial snapshot — a serialisable read of every published table at the slot's
consistent_pointLSN — followed by streaming every WAL change after that point. - Debezium calls this
snapshot.mode = 'initial'(the default). Once the snapshot is done, it switches to streaming mode and reads from the slot. - The snapshot uses regular SQL
SELECT * FROM …; for very large tables this can take hours. Production teams often pre-populate the target viapg_dumpand start the subscription withcopy_data = false.
DDL — the great gotcha.
- Logical replication does not replicate DDL.
ALTER TABLE orders ADD COLUMN tax NUMERICon the publisher does not run on the subscriber. - The next INSERT that includes the
taxcolumn either fails (if the subscriber column does not exist) or silently drops the value (if the publication does not include the column in its column list). - Production teams keep schemas in sync via a separate migration tool (Liquibase, Flyway, Sqitch) applied to both sides. Apply DDL on the subscriber first, then on the publisher — this avoids "column missing on subscriber" errors during the apply.
Common interview probes on the architecture.
- "Where does logical decoding run — in the backend, the WAL writer, a separate process?" — inside the WAL sender backend that handles the replication connection.
- "What is the difference between
restart_lsnandconfirmed_flush_lsn?" —restart_lsnis the earliest WAL position the slot must keep (for restart safety);confirmed_flush_lsnis the position the consumer has confirmed it durably applied. - "Does logical replication replicate DDL?" — no, you must apply DDL on both sides via a separate tool.
- "How does a new subscriber catch up — replay WAL from offset 0?" — no, it takes a snapshot at the
consistent_pointLSN and then streams from there.
Worked example — full-loop PG → PG logical replication in 6 commands
Detailed explanation. The canonical "set up logical replication" question. Two Postgres instances, one table, six SQL commands. The output is a continuous row-level stream from the publisher to the subscriber. Memorising this sequence is the floor for the postgres logical replication interview surface.
Question. Set up logical replication of an orders table from pg-pub to pg-sub. Show every step and explain the role of each command.
Input.
| Side | Schema |
|---|---|
| pg-pub | orders (id int PK, customer_id int, amount numeric, status text) |
| pg-sub |
orders table created with identical schema (DDL applied separately) |
Code.
-- 1. On pg-pub — set wal_level and restart
ALTER SYSTEM SET wal_level = 'logical';
ALTER SYSTEM SET max_wal_senders = 10;
ALTER SYSTEM SET max_replication_slots = 10;
-- pg_ctl restart (wal_level needs a restart)
-- 2. On pg-pub — create a publication
CREATE PUBLICATION orders_pub FOR TABLE orders;
-- 3. On pg-pub — create a replication user
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secret';
GRANT SELECT ON orders TO replicator;
-- pg_hba.conf: host replication replicator pg-sub-ip/32 scram-sha-256
-- 4. On pg-sub — create the target table (DDL is NOT replicated)
CREATE TABLE orders (id int PRIMARY KEY, customer_id int,
amount numeric, status text);
-- 5. On pg-sub — create the subscription
CREATE SUBSCRIPTION orders_sub
CONNECTION 'host=pg-pub dbname=app user=replicator password=secret'
PUBLICATION orders_pub;
-- 6. On pg-sub — verify it is streaming
SELECT subname, subenabled, subslotname FROM pg_subscription;
SELECT * FROM pg_stat_subscription;
Step-by-step explanation.
-
wal_level = 'logical'instructs Postgres to record extra info in every WAL record so the decoder can reconstruct row-level changes. Without this, logical decoding fails with "logical decoding requires wal_level >= logical." The setting requires a server restart. -
CREATE PUBLICATION orders_pub FOR TABLE ordersadvertises theorderstable as a logical replication source. The defaultpublishlist is'insert, update, delete, truncate'. You can scope to subset DML (WITH (publish = 'insert')for insert-only feeds). -
The replication user needs the
REPLICATIONrole attribute andLOGIN.pg_hba.confmust allow the user to connect with the specialreplicationdatabase name (this is hardcoded; you do not create a database literally namedreplication). - The subscriber's table is regular — Postgres applies normal SQL DML on this side. You MUST create the table before the subscription, otherwise the snapshot phase fails with "relation does not exist."
-
CREATE SUBSCRIPTION orders_subdoes three things atomically: connects to the publisher, creates a logical replication slot on the publisher (default name = subscription name), and starts a snapshot+stream. Within seconds,pg_stat_subscription.received_lsnstarts advancing on the subscriber. -
pg_subscriptionshows the configured subscription;pg_stat_subscriptionshows live activity (last received LSN, last applied LSN, worker PID). Use both to verify the loop is healthy.
Output.
| Component | Lives on | What it does |
|---|---|---|
wal_level = logical |
publisher | adds row-level info to WAL records |
PUBLICATION orders_pub |
publisher | advertises which tables + DML to ship |
REPLICATION role |
publisher | grants the consumer permission to open a replication connection |
pg_hba.conf rule |
publisher | network-level allow for the replication user |
CREATE TABLE orders |
subscriber | the target table — DDL is not replicated, you create it by hand |
SUBSCRIPTION orders_sub |
subscriber | creates the slot on the publisher and starts streaming |
Rule of thumb. If you can recite this 6-command sequence from memory in an interview, you have demonstrated end-to-end fluency with postgres logical replication. Every CDC tool (Debezium, Fivetran, Airbyte) collapses to a wrapper on top of steps 1–3 — they create the slot and stream from it, just like step 5 does.
Worked example — peeking at decoded changes without a subscriber
Detailed explanation. A great debugging trick: you can attach a logical slot to a SQL function and read decoded change events directly in psql, no Kafka, no Debezium, no subscriber. This is the fastest way to see what your pgoutput or wal2json plugin actually emits — useful in interviews when an interviewer pushes "show me what a change event looks like on the wire."
Question. Create a logical slot with the wal2json plugin, run a few INSERTs and UPDATEs, and read the decoded change events using pg_logical_slot_get_changes. Explain the difference between _get_ and _peek_ variants.
Input.
| Operation | Row |
|---|---|
| INSERT | (1, 'alice', 100) |
| UPDATE | id=1 SET amount = 150 |
| DELETE | id=1 |
Code.
-- Create the slot (one-time)
SELECT pg_create_logical_replication_slot('debug_slot', 'wal2json');
-- Run some DML
INSERT INTO orders (id, customer, amount) VALUES (1, 'alice', 100);
UPDATE orders SET amount = 150 WHERE id = 1;
DELETE FROM orders WHERE id = 1;
-- Drain the slot — _get_ ADVANCES the slot
SELECT lsn, xid, data
FROM pg_logical_slot_get_changes('debug_slot', NULL, NULL,
'include-xids', '1',
'pretty-print', '0',
'include-timestamp', '1');
-- Peek without advancing — _peek_ leaves the slot in place
SELECT data FROM pg_logical_slot_peek_changes('debug_slot', NULL, NULL);
-- Drop when done — slots that linger fill the disk
SELECT pg_drop_replication_slot('debug_slot');
Step-by-step explanation.
-
pg_create_logical_replication_slot('debug_slot', 'wal2json')creates a slot bound to thewal2jsonplugin. From this moment on, Postgres retains WAL on this slot's behalf — even if you never read from it. (This is the dangerous bit.) -
pg_logical_slot_get_changes('debug_slot', NULL, NULL, …)reads decoded changes since the last position and advancesconfirmed_flush_lsn. WAL up to that point can now be recycled. The first two NULLs areupto_lsnandupto_nchanges— leaving them NULL drains everything. - The output for the INSERT looks like
{"change":[{"kind":"insert","schema":"public","table":"orders","columnnames":["id","customer","amount"],"columnvalues":[1,"alice",100]}]}. The UPDATE is similar with"kind":"update"and an extra"oldkeys"block (or"oldtuple"ifREPLICA IDENTITY FULL). -
pg_logical_slot_peek_changesreads the same data but does not advance. Useful when you want to inspect the wire without affecting consumer progress. -
pg_drop_replication_slot('debug_slot')releases the slot. Forgetting this step is the #1 way to fill your disk — a leftover slot retains WAL forever until dropped.
Output (one decoded event).
| LSN | XID | Decoded JSON |
|---|---|---|
| 0/16ABCDE | 7423 | {"change":[{"kind":"insert","table":"orders","columnvalues":[1,"alice",100]}]} |
| 0/16ABF12 | 7424 | {"change":[{"kind":"update","table":"orders","columnvalues":[1,"alice",150],"oldkeys":{"keyvalues":[1]}}]} |
| 0/16AC056 | 7425 | {"change":[{"kind":"delete","table":"orders","oldkeys":{"keyvalues":[1]}}]} |
Rule of thumb. When debugging a CDC pipeline that "isn't emitting events," create a parallel slot with wal2json, peek at the same DML, and confirm Postgres is decoding correctly. If pg_logical_slot_peek_changes shows the events, the bug is downstream (in Debezium / Fivetran / the apply worker). If it does not, the bug is in wal_level, REPLICA IDENTITY, or the publication.
Worked example — row filtering and column lists since PG15
Detailed explanation. Before PG15, a publication was "all rows of these tables." Since PG15, you can scope to a WHERE clause per table (row filter) and to a subset of columns (column list). This dramatically reduces wire size on tenant-multi pipelines where each tenant only wants its own rows.
Question. Create a publication that ships only status = 'shipped' orders, and only the (id, customer_id, amount) columns — not notes or internal_score.
Input.
| Column | Ship? |
|---|---|
| id (PK) | yes |
| customer_id | yes |
| amount | yes |
| status | no (filter only) |
| notes | no |
| internal_score | no |
Code.
-- Row filter + column list (PG15+)
CREATE PUBLICATION shipped_orders_pub
FOR TABLE orders (id, customer_id, amount)
WHERE (status = 'shipped');
-- Verify
SELECT pubname, prrelid::regclass, prqual, prattrs
FROM pg_publication p
JOIN pg_publication_rel pr ON p.oid = pr.prpubid
WHERE pubname = 'shipped_orders_pub';
Step-by-step explanation.
-
FOR TABLE orders (id, customer_id, amount)is the column list — only these three columns appear in the change event payload.statusis referenced in the filter but does not ship to the subscriber. -
WHERE (status = 'shipped')is the row filter — only rows wherestatus = 'shipped'produce events. An UPDATE that transitions from'pending'→'shipped'emits an INSERT-equivalent change to the subscriber (since the row newly matches the filter). A transition from'shipped'→'cancelled'emits a DELETE-equivalent (since the row no longer matches). - Filters must be deterministic — no
now(), norandom(), no user-defined function unless it's markedIMMUTABLE. Postgres rejects non-deterministic filters atCREATE PUBLICATIONtime. - Filters CANNOT reference columns excluded from the column list. If you filter on
internal_score > 0.8, you must includeinternal_scorein the column list (even if you don't want to ship it — workaround: include + drop on the subscriber side). - Row filters are evaluated during decoding, so they reduce wire bytes from the publisher to the subscriber. Network savings compound on multi-tenant pipelines.
Output.
| Source row | After row filter + column list |
|---|---|
(1, 42, 100, 'pending', 'note', 0.7) |
(filtered out — status != 'shipped') |
(2, 43, 50, 'shipped', 'note2', 0.9) |
(2, 43, 50) ships |
(3, 44, 200, 'shipped', NULL, NULL) |
(3, 44, 200) ships |
(4, 45, 30, 'cancelled', '', 0.1) |
(filtered out — status != 'shipped') |
Rule of thumb. Use row filters to slice multi-tenant data per consumer; use column lists to redact PII from analytics-only sinks. Both compose; both reduce wire volume; both are PG15+ only.
Senior interview question on the architecture
A senior interviewer might ask: "A new logical replication subscription has been running for 30 minutes; pg_stat_subscription.received_lsn is advancing on the subscriber, but no rows are appearing in the target table. Walk me through how you debug this."
Solution Using staged inspection of WAL, slot, snapshot, apply worker
-- Step 1 — confirm WAL is being generated on the source
SELECT pg_current_wal_lsn(); -- should advance every second under load
-- Step 2 — confirm the slot is active and not stale
SELECT slot_name, active, active_pid, restart_lsn, confirmed_flush_lsn,
pg_current_wal_lsn() - restart_lsn AS retained_bytes
FROM pg_replication_slots
WHERE slot_name = 'orders_sub';
-- Step 3 — confirm the publication includes the right tables
SELECT pubname, tablename
FROM pg_publication_tables
WHERE pubname = 'orders_pub';
-- Step 4 — on subscriber, check the snapshot phase
SELECT subname, srrelid::regclass, srsubstate, srsublsn
FROM pg_subscription s
JOIN pg_subscription_rel sr ON s.oid = sr.srsubid
WHERE subname = 'orders_sub';
-- srsubstate values: i=init, d=copy in progress, s=synchronized, r=ready
-- Step 5 — on subscriber, look at apply worker logs
-- tail -f /var/log/postgresql/postgresql.log | grep -i logical
Step-by-step trace.
| Step | What you check | What "broken" looks like |
|---|---|---|
| 1. WAL is moving |
pg_current_wal_lsn() increments |
no WAL = no source activity |
| 2. Slot is active |
active = true, confirmed_flush_lsn advances |
active = false = consumer disconnected |
| 3. Publication has the table |
pg_publication_tables row exists |
empty = FOR TABLE orders was scoped wrong |
| 4. Snapshot done |
srsubstate = 'r' (ready) on the subscriber |
'i' or 'd' = snapshot still running |
| 5. Apply worker healthy | no error in log | "could not apply" = downstream bug |
In our scenario, Step 4 is the smoking gun — srsubstate = 'd' means the snapshot is still copying the table. received_lsn advances during snapshot but no rows commit on the subscriber side until snapshot completes. The fix is patience (for small tables) or copy_data = false plus a pre-populated target (for huge tables).
Output:
| Phase | srsubstate |
What rows you see on the subscriber |
|---|---|---|
| init | i |
none |
| copy in progress | d |
none (snapshot is one big serialisable read) |
| synchronized | s |
snapshot done, streaming caught up |
| ready | r |
live streaming, rows appear seconds after publisher commits |
Why this works — concept by concept:
- Slot vs snapshot vs apply — three independent stages that can each stall for different reasons. The slot is on the publisher; the snapshot is the initial table copy; the apply worker is the SQL DML on the subscriber. Diagnosing the right stage is half the work.
-
received_lsn vs applied_lsn —
received_lsnadvances as bytes arrive at the subscriber.applied_lsn(orlatest_end_lsn) advances as rows commit. The two diverge during a snapshot. -
srsubstate is the truth — the per-table subscription state is in
pg_subscription_rel. Always check this view first; it tells you which phase each table is in. -
pg_hba.conf is the silent killer — if the replication user can connect but cannot SELECT (missing
GRANT), the snapshot fails halfway. Always check the publisher's role grants. -
Cost — debugging is
O(stages × tables). With 5 stages and N tables, the matrix is small enough to walk top-down in a single 10-minute investigation. Senior signal is doing this systematically rather than guessing.
ETL
Topic — ETL
Logical replication debug drills
3. Replication slots — power and danger
A replication slot is a persistent cursor into the WAL — useful for crash-safe consumers, lethal when the consumer dies and the slot lives on
The mental model in one line: a replication slot is a server-side cursor that promises "I will not throw away WAL beyond this LSN until the named consumer has acked it" — which is exactly what makes CDC consumers crash-safe, and exactly what fills your disk when a consumer dies and nobody drops the slot. Once you internalise "a slot retains WAL," every senior replication slot interview question becomes a deduction from "who drops dead slots, and how do you bound the damage."
The two LSNs every slot tracks.
-
restart_lsn— the earliest WAL position Postgres must retain so the consumer can restart from a clean point. Postgres can recycle WAL strictly older thanrestart_lsnfor every slot. The minimumrestart_lsnacross all slots controls how much WAL accumulates on disk. -
confirmed_flush_lsn— the position the consumer has confirmed it has durably applied. The next stream restart will resume from here. For logical slots,confirmed_flush_lsnadvances on everyStandby status updatefrom the consumer. - The relationship:
restart_lsn ≤ confirmed_flush_lsn ≤ pg_current_wal_lsn(). Slot lag ispg_current_wal_lsn() - restart_lsn.
Slot types — physical vs logical.
-
Physical slot — used by streaming (physical) replication. Holds WAL for a hot-standby connection. Cursor advances with the standby's
pg_stat_replication.flush_lsn. -
Logical slot — used by logical replication. Bound to an output plugin (
pgoutput,wal2json, …) and a database. Cursor advances on consumer acks. - A given Postgres has
max_replication_slotsslots total (default 10). Each slot, of either kind, retains WAL until released.
The phantom slot — the production failure mode.
- A consumer disconnects (Debezium pod crashed, Fivetran connector paused, custom script terminated). The slot remains on the publisher, marked
active = false. - The slot still holds
restart_lsnat the consumer's last ack. WAL beyond that point keeps accumulating. - Hours later,
pg_wal/is 200GB and the database starts refusing writes ("no space left on device"). This is the phantom slot anti-pattern. Every senior Postgres outage post-mortem you have ever read has a section on it. - The fix: monitor
pg_replication_slots.confirmed_flush_lsnand alert if a slot has not advanced in N minutes; drop the slot when the consumer is gone for good.
max_slot_wal_keep_size — the brake.
- Introduced in PG13. Configurable cap on how much WAL a slot can retain. When exceeded, Postgres marks the slot invalidated (
wal_status = 'lost') and allows WAL recycling. - This trades consumer correctness for source survival: an invalidated slot must be dropped + recreated + the consumer must replay from a snapshot. But the source database stays up.
- Set this. Always.
max_slot_wal_keep_size = 50GB(or whatever fits 4–8 hours of WAL on your busiest hour) is the standard production recipe.
Slot inspection views.
-
pg_replication_slots— every slot, its plugin, its LSNs, itsactiveflag, itswal_status(reserved,extended,unreserved,lost). -
pg_stat_replication— every live replication connection (physical AND logical).state,sent_lsn,flush_lsn,replay_lsn. -
pg_stat_replication_slots— per-slot stats (spilled transactions, stream bytes). Useful for tuninglogical_decoding_work_mem.
Common interview probes on slots.
- "What does a replication slot actually hold — bytes? offsets?" — two LSNs (
restart_lsn,confirmed_flush_lsn) and metadata. The WAL itself is inpg_wal/; the slot just prevents recycling. - "What happens to a slot when the consumer disconnects?" —
active = falsebut the slot persists. WAL retention continues until you drop or invalidate the slot. - "What is
max_slot_wal_keep_size?" — the per-cluster cap on how much WAL a slot can retain. Beyond it, the slot is invalidated. - "How do you safely drop an active slot?" — terminate the WAL sender backend first (
pg_terminate_backend(active_pid)), thenpg_drop_replication_slot.
Worked example — inspecting a healthy slot
Detailed explanation. The first thing you do on any production Postgres CDC pipeline is inspect the slot. Three views — pg_replication_slots, pg_stat_replication, pg_current_wal_lsn — tell you whether the consumer is alive, how far behind it is, and how much WAL it is currently holding.
Question. Show the SQL that gives you a one-row-per-slot snapshot of slot health, including the lag in bytes and a derived "minutes of WAL" estimate.
Input.
| Slot | restart_lsn | confirmed_flush_lsn | active | wal_status |
|---|---|---|---|---|
| debezium_slot | 0/3A00 0000 | 0/3A00 0000 | true | reserved |
| fivetran_slot | 0/38FF 8000 | 0/38FF C000 | true | reserved |
Code.
SELECT slot_name,
plugin,
slot_type,
active,
active_pid,
wal_status,
pg_size_pretty(pg_current_wal_lsn() - restart_lsn) AS retained_wal,
pg_size_pretty(pg_current_wal_lsn() - confirmed_flush_lsn) AS lag,
pg_current_wal_lsn() AS source_lsn,
restart_lsn,
confirmed_flush_lsn
FROM pg_replication_slots
ORDER BY (pg_current_wal_lsn() - restart_lsn) DESC;
Step-by-step explanation.
-
pg_current_wal_lsn()is the publisher's current WAL position. Subtractingrestart_lsngives the retained WAL bytes — the actual disk pressure onpg_wal/. -
pg_current_wal_lsn() - confirmed_flush_lsngives the consumer's lag — how far behind real time it is. The difference between this andretained_walis the gap between "consumer ack" and "what restart needs." -
active = truemeans a WAL sender backend is connected.active = falseis the danger signal — slot persisting without a consumer is the phantom-slot scenario. -
wal_statusladders fromreserved(healthy) →extended(overwal_keep_size, still allowed) →unreserved(about to hitmax_slot_wal_keep_size) →lost(invalidated; slot must be dropped). -
pg_size_pretty(...)formats the byte count as1.2 GB/300 MB— easier to skim than raw bytes. Production dashboards do this and alert when any slot crosses1 GBretained.
Output.
| slot_name | active | retained_wal | lag | wal_status |
|---|---|---|---|---|
| debezium_slot | true | 14 MB | 0 bytes | reserved |
| fivetran_slot | true | 270 MB | 256 MB | reserved |
| zombie_slot | false | 18 GB | 18 GB | extended |
Rule of thumb. Two thresholds. Page on retained_wal > 5 GB for any slot. Auto-drop on active = false AND retained_wal > 20 GB. The first warns; the second protects the database from running out of disk.
Worked example — engineering a phantom slot (and then fixing it)
Detailed explanation. The fastest way to internalise the phantom-slot failure mode is to engineer one. Create a slot, do some DML, kill the consumer, watch WAL pile up, then inspect and clean up. Production teams run this drill as part of on-call training.
Question. Simulate a phantom slot on a test Postgres: create a slot, generate WAL, never read from it, observe pg_wal/ growth, then drop the slot to release the WAL.
Input.
| Step | Action |
|---|---|
| 1 | create slot, no consumer ever attaches |
| 2 | run a tight INSERT loop for a minute |
| 3 | inspect pg_wal/ size |
| 4 | drop the slot |
| 5 | inspect again — should drop back |
Code.
-- 1. Create the slot — no consumer attaches
SELECT pg_create_logical_replication_slot('phantom_slot', 'pgoutput');
-- 2. Generate WAL
DO $$
BEGIN
FOR i IN 1..1000000 LOOP
INSERT INTO orders(id, customer_id, amount, status)
VALUES (i, (i % 1000)+1, (random()*1000)::numeric(10,2), 'pending');
IF i % 50000 = 0 THEN COMMIT; END IF;
END LOOP;
END$$;
-- 3. Inspect
SELECT slot_name, pg_size_pretty(pg_current_wal_lsn() - restart_lsn) AS retained
FROM pg_replication_slots WHERE slot_name = 'phantom_slot';
-- Shell: du -sh /var/lib/postgresql/16/main/pg_wal
-- 4. Drop the slot — releases WAL
SELECT pg_drop_replication_slot('phantom_slot');
-- 5. Force a checkpoint so WAL files are recycled
CHECKPOINT;
-- Shell again: du -sh ... — should now shrink
Step-by-step explanation.
- The slot exists, plugin =
pgoutput, but no client ever callsSTART_REPLICATION SLOT phantom_slot LOGICAL …. Therestart_lsnis frozen at slot-creation time. - The INSERT loop generates ~150–300 MB of WAL on a small box. Without a slot, Postgres recycles WAL every checkpoint. With our phantom slot, none of it can be recycled —
pg_wal/grows by the full ~300 MB. -
pg_replication_slotsshowsretained ≈ 300 MBandactive = false. Theactive = falseflag is the signal that a real on-call alert would fire on. -
pg_drop_replication_slot('phantom_slot')removes the slot. Postgres no longer needs to retain its WAL. - Issuing
CHECKPOINTforces WAL recycling.pg_wal/shrinks. The database is back to normal.
Output.
| Stage |
pg_wal/ size |
slot count | slot active? |
|---|---|---|---|
| Before INSERT loop | 64 MB | 1 (phantom) | false |
| After INSERT loop (slot still there) | 320 MB | 1 (phantom) | false |
| After drop slot + CHECKPOINT | 64 MB | 0 | n/a |
Rule of thumb. Phantom slots are the most preventable Postgres CDC outage. Three guardrails kill it for good: (1) max_slot_wal_keep_size = 50GB; (2) on-call alert on active = false for any logical slot > 5 minutes; (3) the consumer's deploy script always either reuses the existing slot or drops the old one before creating a new one.
Worked example — max_slot_wal_keep_size as a circuit breaker
Detailed explanation. max_slot_wal_keep_size (PG13+) is the brake. When a slot's retained WAL exceeds the cap, Postgres invalidates the slot and recycles WAL anyway. The slot can be dropped + recreated; the consumer must replay from a snapshot. The database stays up — that is the win.
Question. Set max_slot_wal_keep_size = 8GB, then drive a slot past the cap (no consumer reading), and observe the invalidation.
Input.
| Knob | Value |
|---|---|
max_slot_wal_keep_size |
8GB |
| WAL generated with no consumer | 12GB |
Code.
-- 1. Set the cap (requires reload, not restart)
ALTER SYSTEM SET max_slot_wal_keep_size = '8GB';
SELECT pg_reload_conf();
-- 2. Create a slot, generate > 8GB WAL with no consumer attached
SELECT pg_create_logical_replication_slot('cap_test', 'pgoutput');
-- ... DML loop until pg_wal/ ~ 9GB ...
-- 3. Observe invalidation
SELECT slot_name, wal_status, active,
pg_size_pretty(pg_current_wal_lsn() - restart_lsn) AS retained
FROM pg_replication_slots WHERE slot_name = 'cap_test';
-- wal_status transitions: reserved → extended → unreserved → lost
-- 4. The slot is now 'lost' — drop it and recreate
SELECT pg_drop_replication_slot('cap_test');
SELECT pg_create_logical_replication_slot('cap_test', 'pgoutput');
-- consumer must replay from a snapshot
Step-by-step explanation.
-
max_slot_wal_keep_size = '8GB'caps each slot's retention at 8GB.pg_reload_conf()applies it without restart. - As the slot's retained WAL grows past 8GB, Postgres flips
wal_statusfromreservedtoextended(betweenwal_keep_sizeandmax_slot_wal_keep_size) tounreserved(about to exceed the cap) tolost(cap exceeded, WAL recycled, slot invalidated). - When
wal_status = 'lost', the slot can no longer be used for replication — the WAL it needed for restart has been recycled. Any consumer attempting to attach to this slot gets an error. - The recovery: drop the slot, create a new one, and the consumer must do a fresh snapshot. This is painful but bounded — the database does not run out of disk.
- The trade-off: a strict cap protects the database but breaks consumers that go offline longer than the cap covers. Set the cap to 4–8x your worst-case consumer downtime; alert at 50% of the cap so on-call gets a chance to intervene before invalidation.
Output.
| Retained WAL | wal_status |
Slot usable? | Database health |
|---|---|---|---|
| 2 GB | reserved | yes | healthy |
| 6 GB | extended | yes (over wal_keep_size) |
warning |
| 7.5 GB | unreserved | yes (close to cap) | page on-call |
| 9 GB | lost | NO (invalidated) | database stays up; consumer must reinit |
Rule of thumb. max_slot_wal_keep_size is the difference between "database falls over" and "consumer needs to replay." Always pick the latter. Set the cap to cover your acceptable consumer downtime and tune up if you keep invalidating slots, not the other way around.
Senior interview question on slot operations
A senior interviewer might ask: "A CDC consumer (Debezium) crashed three days ago and the team never noticed. pg_wal/ is now at 180GB and writes are starting to slow. Walk me through how you stabilise the database without losing the CDC stream."
Solution Using triage, invalidation, snapshot replay
-- Step 1 — identify the dead slot
SELECT slot_name, active, active_pid, wal_status,
pg_size_pretty(pg_current_wal_lsn() - restart_lsn) AS retained,
confirmed_flush_lsn
FROM pg_replication_slots
WHERE active = false
ORDER BY (pg_current_wal_lsn() - restart_lsn) DESC;
-- Step 2 — decide: revive the consumer (preserves stream) OR drop (loses gap)
-- Quick-revive path:
-- start a new Debezium pod with the SAME slot name
-- Debezium picks up from confirmed_flush_lsn, replays the 3-day gap
-- Crisis path (database about to die):
SELECT pg_drop_replication_slot('debezium_slot');
CHECKPOINT; -- recycle the WAL immediately
-- Recreate + snapshot from scratch:
-- delete Debezium connector state in Kafka Connect
-- redeploy with snapshot.mode='initial'
-- Debezium re-snapshots all tables, then streams from current WAL
Step-by-step trace.
| Step | Decision | Trade-off |
|---|---|---|
| 1. Diagnose |
active = false slot with 180 GB retained |
confirms phantom |
| 2. Revive vs drop | revive if consumer can come back within hours; drop if not | revive preserves stream; drop loses the gap |
| 3. Revive path | redeploy Debezium with same slot name | catches up via WAL replay (slow but lossless) |
| 4. Drop + snapshot path |
pg_drop_replication_slot → CHECKPOINT → restart with snapshot.mode='initial'
|
database recovers immediately; snapshot reseeds |
| 5. Post-mortem | add alert on active = false for any logical slot > 5 min |
prevents the next outage |
In our scenario, the revive path is preferred IF Debezium can come back quickly — but with 180GB of WAL retained, the replay alone will take hours, during which the database is still at risk. The pragmatic call is drop + re-snapshot unless the downstream genuinely cannot tolerate a snapshot reseed.
Output:
| Path | Database recovery | Consumer impact |
|---|---|---|
| Revive same slot | hours (WAL replay) | zero data loss |
| Drop + re-snapshot | seconds (after CHECKPOINT) |
downstream sees a brief "reset" — duplicates possible until first commit |
| Do nothing | database runs out of disk | full outage on the source |
Why this works — concept by concept:
-
Slot retention is the root cause — the slot's
restart_lsnis pinning 180GB of WAL. Until the slot is either advanced or dropped, that WAL stays. No amount ofVACUUMorCHECKPOINTreclaims it. - Revive preserves history — Debezium's consumer state in Kafka Connect remembers the last LSN it acked; reattaching the same slot resumes from there. Lossless but slow on huge gaps.
-
Drop forces a snapshot reseed — the downstream loses the unflushed deltas, but Debezium's
snapshot.mode = 'initial're-reads every table at the newconsistent_pointand resumes from there. Bounded recovery time. -
Alert on
active = false— the single highest-ROI alert in a Postgres CDC stack. Every minute of detection delay = MBs to GBs of additional WAL retention. -
Cost — drop-path is
O(snapshot), dominated by table size. Revive-path isO(WAL gap), dominated by retained WAL. Pick whichever is smaller; in 180 GB-of-WAL territory, snapshot usually wins.
SQL
Topic — SQL
Replication slot SQL problems
4. Publications and REPLICA IDENTITY
A publication subscription pair picks what gets replicated; REPLICA IDENTITY picks what columns the source ships for UPDATEs and DELETEs — get both wrong and your sink silently drops rows
The mental model in one line: a publication is the contract on the publisher ("these tables, these DML operations, optionally these rows and columns"); REPLICA IDENTITY is the contract on each table ("for UPDATE/DELETE, here is how to identify the row that changed") — and both must align with the subscriber's apply logic, or rows go missing without errors. Once you say that out loud, every senior publication subscription interview probe becomes a question about which knob is wrong.
Publications — the publisher-side contract.
-
CREATE PUBLICATION p FOR TABLE orders, payments— list of tables to ship. -
WITH (publish = 'insert, update, delete, truncate')— DML scope. Default is all four. Settingpublish = 'insert'ships only INSERTs (audit-log pattern). -
WHERE (status = 'shipped')— row filter (PG15+). Only matching rows ship. -
(id, amount)after the table name — column list (PG15+). Only listed columns ship. -
FOR ALL TABLES— every table in the database; auto-includes future tables. Used by Debezium when you want full-database CDC.
REPLICA IDENTITY — the per-table ID contract.
-
DEFAULT— ships the primary key columns in the change event. The standard case. Requires a PRIMARY KEY constraint on the table. -
FULL— ships every column of the previous row state. Used when you have no PK or when downstream needs the full pre-image. Doubles WAL volume for UPDATEs and DELETEs. -
USING INDEX idx_name— ships the columns of a specified UNIQUE index. Useful when the table has a natural unique key but no formal PK. -
NOTHING— ships nothing for UPDATE/DELETE. INSERTs still ship. The table effectively becomes INSERT-only for logical replication.
The classic gotcha — UPDATE on a table without PK.
- Default REPLICA IDENTITY = DEFAULT, which requires a PK.
- Run
UPDATE orders SET amount = 100 WHERE order_id = 5on a table without a PK. - The UPDATE succeeds on the publisher. On the subscriber, the apply worker logs
cannot find row to updatebecause no row identifier was shipped. - The row "disappears" from the subscriber's view. Every senior Postgres CDC engineer has seen this bug at least once.
- The fix: either add a PK (always preferred) or
ALTER TABLE … REPLICA IDENTITY FULL.
Row filters and column lists — combining them.
- Filters and column lists compose:
FOR TABLE orders (id, customer_id, amount) WHERE (status = 'shipped'). - A row that changes from matching the filter to not matching emits a DELETE on the subscriber (the row is no longer in the published set).
- A row that changes from not matching to matching emits an INSERT.
- Filters must reference only columns in the column list (or columns that are part of the REPLICA IDENTITY).
- Filters must be IMMUTABLE — no
now(), norandom(), no user-defined function unless explicitly marked.
TRUNCATE and DDL — the special cases.
-
TRUNCATE is replicated only if the publication's
publishlist includestruncate(default yes). The subscriber TRUNCATES its copy. -
DDL is NOT replicated in logical replication.
ALTER TABLE … ADD COLUMN …runs on the publisher only. The subscriber must add the column separately (and ideally first, so the next INSERT does not fail). -
Sequences are not replicated either. Use
CREATE PUBLICATION … FOR ALL SEQUENCES(PG16+) only for the special case; most teams skip it.
Common interview probes on publications + REPLICA IDENTITY.
- "Why does my UPDATE on a table without PK silently disappear on the subscriber?" — REPLICA IDENTITY DEFAULT requires a PK; without one, no row identifier is shipped.
- "What does
REPLICA IDENTITY FULLcost?" — doubles WAL for UPDATE/DELETE (must record the entire previous row state). - "How do you replicate a subset of a table?" —
CREATE PUBLICATION p FOR TABLE t (c1, c2) WHERE (cond)(PG15+). - "Does logical replication handle DDL?" — no, apply DDL on both sides via a migration tool.
Worked example — REPLICA IDENTITY FULL for a table with no primary key
Detailed explanation. A legacy audit_log table never had a PK — every row is a tuple of (actor, action, timestamp, payload) with natural duplicates. Logical replication of UPDATEs and DELETEs requires REPLICA IDENTITY. The fix: REPLICA IDENTITY FULL.
Question. Set up logical replication of audit_log (no PK) such that UPDATEs and DELETEs replicate correctly to the subscriber.
Input.
| Column | Type |
|---|---|
| actor | text |
| action | text |
| ts | timestamptz |
| payload | jsonb |
Code.
-- 1. Mark the table with REPLICA IDENTITY FULL
ALTER TABLE audit_log REPLICA IDENTITY FULL;
-- 2. Verify
SELECT relname, CASE relreplident
WHEN 'd' THEN 'default (PK)'
WHEN 'f' THEN 'full'
WHEN 'i' THEN 'using index'
WHEN 'n' THEN 'nothing'
END AS replica_identity
FROM pg_class WHERE relname = 'audit_log';
-- 3. Now create the publication
CREATE PUBLICATION audit_pub FOR TABLE audit_log;
Step-by-step explanation.
-
REPLICA IDENTITY FULLinstructs Postgres to record the entire previous row state in WAL whenever the row is UPDATED or DELETED. The change event now ships all four columns plus the new values, so the subscriber can locate the row by full-row equality. -
pg_class.relreplidentexposes the per-table setting.d= DEFAULT,f= FULL,i= USING INDEX,n= NOTHING. - Adding the publication is now safe — UPDATEs and DELETEs ship enough info for the subscriber to apply them.
- The cost is WAL volume. For an UPDATE on a row with
payloadas a 4KB JSON blob, WAL records the entire 4KB twice (old + new). On high-update tables this can double WAL size; budget accordingly. - The right long-term fix is to add a real PK (e.g., a
BIGSERIAL audit_idcolumn). REPLICA IDENTITY FULL is a tactical bridge while you can't backfill a PK.
Output.
| Knob | WAL volume cost on UPDATE | Subscriber locates row by |
|---|---|---|
REPLICA IDENTITY DEFAULT (PK) |
normal | primary key |
REPLICA IDENTITY FULL |
~2x | full previous row |
REPLICA IDENTITY USING INDEX uidx |
~1.2x | index columns |
REPLICA IDENTITY NOTHING |
UPDATE/DELETE error | n/a — silent drop |
Rule of thumb. Add a PK if you possibly can — that is always the cheapest. Use REPLICA IDENTITY FULL only when a PK is impractical (legacy schemas, write-once audit tables). Never leave REPLICA IDENTITY NOTHING on a table being replicated; the rows will silently vanish on the subscriber.
Worked example — subset replication with row filter
Detailed explanation. A multi-tenant SaaS replicates Postgres into a per-tenant Snowflake schema. Only the rows where tenant_id = $current_tenant should ship to each tenant's slot. Row filters (PG15+) do this cleanly.
Question. Create a publication that ships only tenant_id = 42's orders + payments, with column lists that exclude internal columns.
Input.
| Column | orders | payments |
|---|---|---|
| id (PK) | yes | yes |
| tenant_id | yes (filter only) | yes (filter only) |
| customer_id | yes (ship) | — |
| amount | yes (ship) | yes (ship) |
| internal_notes | yes (skip) | yes (skip) |
Code.
CREATE PUBLICATION tenant_42_pub
FOR TABLE orders (id, customer_id, amount) WHERE (tenant_id = 42),
TABLE payments (id, amount) WHERE (tenant_id = 42)
WITH (publish = 'insert, update, delete');
Step-by-step explanation.
- Each table can have its own row filter and column list, all inside the same publication.
- The filter
tenant_id = 42is evaluated during WAL decoding on the publisher — only matching rows ever enter the slot's event stream. Network and downstream both win. - The column list
(id, customer_id, amount)ships only those columns.tenant_idis referenced in the filter but does not ship (PG15+ allows filter-only columns when REPLICA IDENTITY DEFAULT covers them; if not, you must include the filter column in the column list). -
publish = 'insert, update, delete'is the default for non-truncate DML. Truncate is excluded here because TRUNCATE is table-wide and would clobber other tenants on the subscriber side. - A row that flips
tenant_idfrom42to99emits a DELETE on the subscriber (the row left the published set). A row that flips from99to42emits an INSERT. The wire correctly represents the membership change.
Output.
| Action on publisher | Effect on tenant 42's subscriber |
|---|---|
INSERT (id=1, tenant_id=42, amount=100)
|
INSERT (id=1, amount=100, customer_id=…)
|
INSERT (id=2, tenant_id=99, amount=50)
|
nothing (filtered out) |
UPDATE id=1 tenant_id 42 → 99
|
DELETE id=1 (left the published set) |
UPDATE id=2 tenant_id 99 → 42
|
INSERT (id=2, amount=50, customer_id=…) (joined the published set) |
DELETE id=1 (when tenant_id=42) |
DELETE id=1
|
Rule of thumb. Row filters are the cleanest tenant-isolation primitive Postgres ships. Per-tenant publication + per-tenant slot is the standard multi-tenant CDC architecture. Avoid putting the tenant filter on the subscriber side — it wastes WAL and network and creates a lateral data leak risk if the subscriber's filter has a bug.
Worked example — column list to redact PII from an analytics sink
Detailed explanation. A customers table contains email, phone, and address columns subject to PII regulations. The analytics warehouse sink only needs customer_id, signup_date, segment. Column lists (PG15+) let you redact PII at the publication layer, so the data never reaches the analytics warehouse.
Question. Publish customers to the analytics warehouse with only (customer_id, signup_date, segment), dropping email, phone, address entirely.
Input.
| Column | Ship to analytics? |
|---|---|
| customer_id (PK) | yes |
| signup_date | yes |
| segment | yes |
| NO (PII) | |
| phone | NO (PII) |
| address | NO (PII) |
Code.
CREATE PUBLICATION customers_analytics_pub
FOR TABLE customers (customer_id, signup_date, segment);
-- Verify the column list
SELECT pubname, attname
FROM pg_publication p
JOIN pg_publication_rel pr ON p.oid = pr.prpubid
JOIN pg_class c ON pr.prrelid = c.oid
JOIN pg_attribute a ON a.attrelid = c.oid AND a.attnum = ANY(pr.prattrs::int2[])
WHERE pubname = 'customers_analytics_pub'
ORDER BY a.attnum;
Step-by-step explanation.
-
FOR TABLE customers (customer_id, signup_date, segment)is the column list. Only these three columns ship —email,phone,addressnever appear in the change event payload. - PII redaction at the publication layer is stronger than column-level GRANTs because no privileged downstream tool can introspect the column. The data physically does not leave the publisher.
- The column list must include the REPLICA IDENTITY columns (the PK). Postgres rejects publications that omit them with "REPLICA IDENTITY columns must be in the column list."
- UPDATE of a non-listed column (e.g., a phone-number change) still emits a change event on WAL, but the event only contains the listed columns. If the row's listed columns did not change, the subscriber sees an UPDATE with no actual data change — harmless but slightly wasteful.
- Column lists compose with row filters — you can ship
tenant_id = 42rows with only PII-free columns, double-locked.
Output.
| Action on publisher | Wire event to analytics |
|---|---|
INSERT (id=1, email=a@x.com, phone=…, address=…, segment='gold', signup_date=…)
|
INSERT (id=1, segment='gold', signup_date=…)
|
UPDATE id=1 SET phone='…'
|
UPDATE id=1 with no real data change (segment and signup_date unchanged) |
UPDATE id=1 SET segment='platinum'
|
UPDATE id=1 SET segment='platinum'
|
DELETE id=1
|
DELETE id=1
|
Rule of thumb. PII redaction via column list is the strongest physical guarantee Postgres offers. Combined with a separate non-analytics publication for the OLTP pipeline, you get an "untrusted analytics sink" pattern that survives a downstream compromise.
Senior interview question on UPDATE replication
A senior interviewer might ask: "You set up logical replication on an events table that has no primary key. INSERTs are showing up on the subscriber but UPDATEs and DELETEs are not. Explain why, and walk through every option for fixing it."
Solution Using REPLICA IDENTITY analysis + the right knob for the schema
-- Diagnose: what is REPLICA IDENTITY for the table?
SELECT relname, relreplident
FROM pg_class WHERE relname = 'events';
-- relreplident = 'd' (DEFAULT) + no PK = no row identifier shipped for UPDATE/DELETE
-- Fix option A — add a primary key (the right answer if possible)
ALTER TABLE events ADD COLUMN event_id BIGSERIAL PRIMARY KEY;
-- Fix option B — use an existing unique index (if you have one)
CREATE UNIQUE INDEX events_uniq ON events(actor, action, ts);
ALTER TABLE events REPLICA IDENTITY USING INDEX events_uniq;
-- Fix option C — REPLICA IDENTITY FULL (last resort)
ALTER TABLE events REPLICA IDENTITY FULL;
-- 2x WAL cost on UPDATEs, full-row scan on the subscriber side
Step-by-step trace.
| Step | Diagnostic | Decision |
|---|---|---|
| 1. Inspect REPLICA IDENTITY |
relreplident = 'd', no PK |
UPDATEs ship no row id → subscriber cannot find row |
| 2. Can we add a PK? | yes — a BIGSERIAL event_id makes sense |
pick option A |
| 3. If table is write-once + huge | adding a PK locks the table for hours | pick option B (existing unique index) or C (FULL) |
| 4. If no unique index either | natural duplicates exist | pick option C (FULL) and accept the 2x WAL cost |
| 5. After fix | verify with pg_class.relreplident flipped to d / i / f
|
re-run an UPDATE → check subscriber |
In our scenario, option A is the right answer if the schema can take it. If the table is multi-TB and live, option B (existing unique index) is the cheapest middle ground. Option C is reserved for genuinely PK-less tables where you cannot add a unique index either.
Output:
| Option | WAL cost on UPDATE | Subscriber locates row by | Effort |
|---|---|---|---|
| A — add PK | normal | PK | medium (DDL + maybe backfill) |
| B — REPLICA IDENTITY USING INDEX | ~1.2x | index columns | low (DDL only) |
| C — REPLICA IDENTITY FULL | ~2x | full previous row | low (single ALTER) |
| Do nothing | n/a — UPDATEs silently vanish | — | zero — but the bug is unsurvivable |
Why this works — concept by concept:
- REPLICA IDENTITY is per-table, not per-publication — you set it once on the table; every publication that includes the table inherits it.
- DEFAULT requires a PK — the most common bug is "everyone assumed REPLICA IDENTITY DEFAULT works" without realising the table has no PK. The fix is always an explicit ALTER.
- FULL is a tax, not a feature — useful for legacy tables but doubles WAL. Treat as the last resort, not the first.
- USING INDEX bridges the gap — when the table has a natural unique key but no formal PK, USING INDEX gives you the cheaper REPLICA IDENTITY without a PK constraint.
-
Cost — WAL cost scales with
O(updates × full_row_size)for FULL,O(updates × index_size)for USING INDEX,O(updates × pk_size)for DEFAULT. On a heavy-update table the difference is large; pick the cheapest viable option.
SQL
Topic — SQL
REPLICA IDENTITY SQL problems
5. CDC into Kafka via Debezium
debezium postgres is a Kafka Connect source that talks the logical replication protocol — snapshot first, then stream, with heartbeats to keep idle slots fresh and a signal table for ad-hoc snapshots
The mental model in one line: Debezium is a Kafka Connect source connector that opens a logical replication connection to Postgres, creates (or reuses) a slot, takes an initial snapshot of every published table, and then streams every WAL change as a Kafka record on a per-table topic — with heartbeats to advance idle slots and a signal table to trigger ad-hoc snapshots. Once you internalise "Debezium is a pgoutput consumer with Kafka Connect ergonomics," every production postgres cdc interview question reduces to "what Debezium knob covers this case."
Debezium Postgres connector — the moving parts.
- Kafka Connect — runs the connector. Provides offset storage, schema-history storage, REST API for config, distributed worker scaling.
- Debezium PG connector — Java code that opens a JDBC connection (for snapshot) AND a replication connection (for streaming).
-
pgoutputplugin — the Debezium 2.x default. Reads WAL on the publisher and emits change events. -
Per-table Kafka topics — Debezium emits each row change to
{topic-prefix}.{schema}.{table}(e.g.,app.public.orders). One topic per table. - Schema registry — Avro / Protobuf schemas published per-table. Snapshot phase creates the schemas; streaming phase emits records with them.
Snapshot + stream — the two phases.
-
Snapshot mode
initial(default) — at connector start, Debezium opens a serialisable transaction on the publisher, takes a snapshot LSN (consistent_point), reads every published table viaSELECT *, and emits each row as a Kafka INSERT event tagged withop = 'r'(read). When the snapshot finishes, Debezium switches to streaming mode and reads from the slot starting atconsistent_point. -
Snapshot mode
never— skip the snapshot entirely, start streaming from the current LSN. Use when you have pre-loaded the target viapg_dump. -
Snapshot mode
always— always re-snapshot on connector start. Useful for debugging; never use in production. -
Snapshot mode
initial_only— snapshot once, never stream. One-off bulk load.
Heartbeats — keeping idle slots fresh.
- If a published table has no DML for hours, the slot's
confirmed_flush_lsndoes not advance. Other (unrelated) WAL on the database still accumulates; the slot holds it. - The fix:
heartbeat.interval.ms = 30000— every 30s Debezium writes a no-op row to adebezium_heartbeattable on the publisher (which is included in the publication), which advances the slot. - Without heartbeats, an idle CDC table on a busy database can pin GBs of WAL. With heartbeats, the slot never falls behind.
Signal table — the ad-hoc operations channel.
- A small table
debezium_signal (id text, type text, data text)that Debezium watches. - Insert a row like
('snapshot-1', 'execute-snapshot', '{"data-collections":["public.orders"]}')and Debezium runs an incremental snapshot oforderson the next polling tick. - Used for re-snapshotting a single table after a schema migration, without restarting the whole connector or losing your place on the slot.
Schema evolution — the great gotcha.
- A new column added on the publisher: Debezium detects the schema change at the next WAL record, updates the Kafka schema, and starts emitting events with the new column populated.
- A column dropped on the publisher: Debezium emits subsequent events with the column omitted, but the Kafka schema may still reference it (depending on registry compat mode).
- A column type changed (
text→int): often a breaking schema change. Debezium may halt with aSchemaChangeException. Plan migrations as two-step (add new column → backfill → drop old column) to avoid in-flight breaks.
Common interview probes on Debezium.
- "How does Debezium catch a new subscriber up — replay all WAL?" — no, initial snapshot via
SELECT *atconsistent_point, then stream from there. - "What is a heartbeat for?" — advance the slot when the published tables are idle so other-table WAL does not pile up.
- "How do you re-snapshot one table after a schema migration?" — insert a row into the signal table requesting an incremental snapshot of that table.
- "How do you guarantee exactly-once into the Kafka topic?" — at-least-once + idempotent consumer; the source connector cannot itself give EOS, only the consumer can.
Worked example — minimal Debezium PG connector config
Detailed explanation. The smallest Debezium connector that actually works. JSON posted to the Kafka Connect REST API. Highlights the must-have knobs: connection string, slot name, publication name, snapshot mode, heartbeat interval.
Question. Write a Debezium Postgres connector config for an orders + payments CDC pipeline, with heartbeat and a single slot, using pgoutput.
Input.
| Setting | Value |
|---|---|
| publisher | pg-prod:5432 |
| database | app |
| tables | public.orders, public.payments |
| target | Kafka topic prefix prod.app
|
| heartbeat | 30s |
Code.
{
"name": "pg-orders-cdc",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "pg-prod",
"database.port": "5432",
"database.user": "debezium",
"database.password": "${file:/secrets/debezium.pwd}",
"database.dbname": "app",
"topic.prefix": "prod.app",
"schema.include.list": "public",
"table.include.list": "public.orders,public.payments",
"plugin.name": "pgoutput",
"slot.name": "debezium_orders_slot",
"publication.name": "debezium_orders_pub",
"publication.autocreate.mode": "filtered",
"snapshot.mode": "initial",
"snapshot.isolation.mode": "serializable",
"heartbeat.interval.ms": "30000",
"heartbeat.action.query": "INSERT INTO public.debezium_heartbeat(ts) VALUES (now()) ON CONFLICT (ts) DO NOTHING",
"signal.data.collection": "public.debezium_signal",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081"
}
}
Step-by-step explanation.
-
connector.class = PostgresConnectorandtasks.max = 1— Debezium PG always runs a single task because the slot itself is single-consumer. -
plugin.name = 'pgoutput',slot.name,publication.nameare the three knobs that wire Debezium to the logical replication primitives. Withpublication.autocreate.mode = 'filtered'Debezium creates a publication scoped totable.include.listif one does not exist. -
snapshot.mode = 'initial'— on first start, Debezium snapshots both tables under a serialisable transaction, emitting each row withop = 'r'. On subsequent restarts, Debezium picks up from the slot's last LSN and skips the snapshot. -
heartbeat.interval.ms = 30000+heartbeat.action.querywrites a no-op row every 30s to keep the slot advancing on idle tables. TheON CONFLICT … DO NOTHINGkeeps the heartbeat table from growing unbounded. -
signal.data.collection = 'public.debezium_signal'tells Debezium which table to watch for ad-hoc commands (incremental snapshots, log markers). - The Avro converter + schema registry wires Kafka records to a typed schema; downstream consumers in Java / Python / etc. get typed access without parsing JSON.
Output (per-table Kafka topics emitted).
| Table | Topic |
|---|---|
public.orders |
prod.app.public.orders |
public.payments |
prod.app.public.payments |
| heartbeat | prod.app.public.debezium_heartbeat |
| schema changes | prod.app.schema-changes |
Rule of thumb. Minimal viable Debezium = connector + slot + publication + snapshot.mode=initial + heartbeat. Add signal table for operability. Add Avro + schema registry for typed downstream consumption. Skip schema registry only if all your downstream consumers can speak JSON.
Worked example — signal-table ad-hoc snapshot after a schema change
Detailed explanation. A new column tax_amount was added to orders on the publisher. The historical rows do not yet have the column populated in Kafka. The fix: trigger an incremental snapshot via the signal table, which re-reads every row of orders and emits them as Kafka events with the new schema.
Question. Use Debezium's signal table to trigger an incremental snapshot of public.orders after a schema migration added tax_amount.
Input.
| Step | Action |
|---|---|
| 1 | On publisher: ALTER TABLE orders ADD COLUMN tax_amount numeric
|
| 2 | On subscriber (Kafka downstream): ALTER TABLE orders ADD COLUMN tax_amount numeric (or schema evolution in the warehouse) |
| 3 | Backfill tax_amount on the publisher historical rows |
| 4 | Trigger incremental snapshot via signal table |
Code.
-- On the publisher
ALTER TABLE orders ADD COLUMN tax_amount numeric;
UPDATE orders SET tax_amount = round(amount * 0.08, 2);
-- Trigger Debezium incremental snapshot of orders
INSERT INTO debezium_signal(id, type, data)
VALUES ('snapshot-orders-2026-06-22',
'execute-snapshot',
'{
"data-collections": ["public.orders"],
"type": "incremental"
}');
-- Debezium picks up the signal on its next poll (sub-second)
-- and starts a chunk-by-chunk SELECT * of orders.
-- Each row is emitted to Kafka with op='r' (read).
-- Streaming continues concurrently — Debezium interleaves chunks with live changes.
Step-by-step explanation.
- The schema migration on the publisher adds the column. Debezium notices the schema change at the next WAL event and updates the Kafka topic schema. New INSERTs already carry
tax_amount. - Historical rows on the publisher still need their
tax_amountpopulated in Kafka. A full re-snapshot would work but takes the connector offline. Instead, use the signal table. - The signal row is an INSERT into
public.debezium_signal. Debezium polls that table every ~5s and picks up the request. - The incremental snapshot reads
ordersin chunks (configurable viaincremental.snapshot.chunk.size), emitting each row withop = 'r'. Critically, streaming continues during the snapshot — concurrent INSERTs / UPDATEs / DELETEs interleave with snapshot chunks; Debezium's chunk algorithm handles the consistency. - Downstream consumers see snapshot rows tagged with
op = 'r'. Idempotent consumers ignore re-reads of rows they already have; non-idempotent consumers may produce duplicates and need an upsert sink.
Output.
| Phase | What Debezium emits |
|---|---|
| Pre-signal | live INSERT / UPDATE / DELETE events with op = c/u/d
|
| Signal received | log marker + start of incremental snapshot |
| Snapshot chunk 1 | rows 1..1000 with op = 'r' (interleaved with live events) |
| Snapshot chunk 2 | rows 1001..2000 with op = 'r' (interleaved) |
| Snapshot done | log marker, back to streaming-only |
Rule of thumb. Use incremental snapshots whenever you need to re-read a table without restarting the connector. The chunk-by-chunk + interleaved-live design is one of Debezium's best operational features; learn it for senior interviews because it is a common probe.
Worked example — idempotent sink on at-least-once delivery
Detailed explanation. Debezium's source side is at-least-once: on restart, it may re-emit events between the last acked LSN and the latest position. End-to-end exactly-once requires either Kafka transactions all the way through (rare in CDC) or an idempotent sink that absorbs duplicates. The Postgres CDC standard is upsert-by-PK on the sink.
Question. Write a sink consumer that absorbs duplicate Debezium events on orders into a Postgres target table idempotently, regardless of how many times Debezium replays.
Input — Kafka topic prod.app.public.orders.
| op | id (PK) | customer_id | amount |
|---|---|---|---|
| c | 1 | 42 | 100 |
| u | 1 | 42 | 150 |
| u | 1 | 42 | 150 (duplicate replay) |
| d | 1 | — | — |
Code.
# Idempotent sink — upsert on PK
import psycopg
from confluent_kafka import Consumer
import json
consumer = Consumer({'bootstrap.servers': 'kafka:9092',
'group.id': 'orders-sink',
'enable.auto.commit': False})
consumer.subscribe(['prod.app.public.orders'])
conn = psycopg.connect("host=warehouse dbname=dw user=loader")
UPSERT_SQL = """
INSERT INTO dw.orders (id, customer_id, amount, source_lsn)
VALUES (%(id)s, %(customer_id)s, %(amount)s, %(lsn)s)
ON CONFLICT (id) DO UPDATE
SET customer_id = EXCLUDED.customer_id,
amount = EXCLUDED.amount,
source_lsn = EXCLUDED.source_lsn
WHERE EXCLUDED.source_lsn > dw.orders.source_lsn;
"""
DELETE_SQL = "DELETE FROM dw.orders WHERE id = %(id)s AND source_lsn < %(lsn)s;"
while True:
msg = consumer.poll(1.0)
if msg is None: continue
evt = json.loads(msg.value())
payload = evt['payload']
op = payload['op']
lsn = payload['source']['lsn']
after = payload.get('after') or {}
before = payload.get('before') or {}
with conn.cursor() as cur:
if op in ('c', 'r', 'u'):
cur.execute(UPSERT_SQL, {**after, 'lsn': lsn})
elif op == 'd':
cur.execute(DELETE_SQL, {'id': before['id'], 'lsn': lsn})
conn.commit()
consumer.commit(msg)
Step-by-step explanation.
- The sink table has a
source_lsncolumn. Every upsert records the LSN of the source event. On a replay, the LSN of the duplicate event is<=the existingsource_lsn, so theWHERE EXCLUDED.source_lsn > dw.orders.source_lsnclause skips it. - INSERT / READ / UPDATE all funnel into the same upsert — Debezium's
optag indicates the source operation but the apply logic on the sink is "upsert the row." This is the idempotency trick. - DELETE is also LSN-guarded — only delete the row if the source LSN is newer than the last recorded one. Protects against out-of-order replay where a stale DELETE arrives after a fresh INSERT.
-
consumer.commit(msg)is after the Postgrescommit— at-least-once on the Kafka side, with the database commit providing the deduplication boundary. If the consumer crashes between DB commit and Kafka commit, the next start replays the message and the upsert is a no-op. - Cost: each replayed event still pays a round-trip to Postgres for the dedup check. For high-volume streams, batch the upserts with
executemanyand pre-filter duplicates on a sliding window of recently-seen LSNs.
Output.
| Event | sink table state after |
|---|---|
c id=1 amount=100 lsn=10 |
{id=1, amount=100, source_lsn=10} |
u id=1 amount=150 lsn=20 |
{id=1, amount=150, source_lsn=20} |
u id=1 amount=150 lsn=20 (replay) |
unchanged (LSN 20 not > 20) |
d id=1 lsn=30 |
row deleted (LSN 30 > 20) |
Rule of thumb. The Debezium contract is at-least-once on the source side; idempotency lives in the sink. Always include the source LSN (or transaction ID) on the sink row and guard upserts + deletes with it. This is the senior-DE pattern that distinguishes a working CDC pipeline from one that silently corrupts data on every replay.
Senior interview question on Debezium production patterns
A senior interviewer might ask: "Walk me through how you'd ship a Postgres-to-Kafka CDC pipeline to production. What are the five production knobs you tune, and what failure does each one cover?"
Solution Using the five-knob production checklist
Debezium PG → Kafka production knobs
====================================
1. slot.name + publication.name (explicit, stable)
Covers: slot leakage on connector redeploy.
Anti-pattern: letting Debezium auto-name (random name on each restart).
2. snapshot.mode = 'initial' + signal table enabled
Covers: schema migrations, single-table re-snapshot without full restart.
Anti-pattern: snapshot.mode='always' (re-snapshots every restart, kills throughput).
3. heartbeat.interval.ms = 30000 + heartbeat.action.query
Covers: idle-table slot stall when other tables generate WAL.
Anti-pattern: no heartbeat — slot lag grows when CDC tables are idle.
4. max_slot_wal_keep_size on Postgres + monitoring on retained WAL
Covers: phantom-slot disk-fill outage.
Anti-pattern: unbounded slot retention.
5. Idempotent sink with source-LSN dedup
Covers: at-least-once replay duplicates.
Anti-pattern: trusting "exactly-once" without an LSN guard on the sink.
Step-by-step trace.
| Knob | Default | Production value | Failure it covers |
|---|---|---|---|
slot.name |
random | explicit, e.g. debezium_orders_slot
|
slot leak on redeploy |
publication.name |
dbz_publication |
explicit, e.g. debezium_orders_pub
|
conflicts with other connectors |
snapshot.mode |
initial |
initial |
full reseed only on first start |
heartbeat.interval.ms |
0 (disabled) | 30000 | idle-table slot stall |
max_slot_wal_keep_size (PG) |
-1 (unbounded) | 8GB–50GB | phantom-slot disk-fill |
Sink upsert with source_lsn guard |
n/a (you write this) | required | duplicate replay |
After applying all five, the pipeline survives a connector restart (knob 1, 2, 3), survives the source database under WAL pressure (knob 4), and survives the at-least-once replay (knob 5). This is the production-grade debezium postgres baseline.
Output:
| Failure mode | Without knobs | With all five knobs |
|---|---|---|
| Connector redeploy leaks slots | yes | no (explicit slot name reused) |
| Schema migration breaks the connector | yes | no (signal-table incremental snapshot) |
| Slot lag grows on idle CDC tables | yes | no (heartbeat) |
| Phantom slot fills disk | yes | bounded at max_slot_wal_keep_size
|
| Replay duplicates corrupt the sink | yes | no (LSN guard) |
Why this works — concept by concept:
- Explicit names are the operability primitive — slots and publications outlive the connector; naming them explicitly makes them inspectable and reattachable.
-
Heartbeat advances the slot — the slot's
confirmed_flush_lsnadvances only on consumer ack of some table in the publication. Heartbeat manufactures activity on a dummy table, keeping the slot moving even when the real tables are idle. - Signal table = ad-hoc operations channel — incremental snapshot, log marker, custom commands. The lowest-friction way to give Debezium an instruction without restarting it.
- max_slot_wal_keep_size protects the source — the brake of last resort. Pair with a monitoring alert at 50% of the cap so you intervene before invalidation.
-
Cost — knobs 1–4 are configuration only (zero ongoing cost). Knob 5 (idempotent sink) is one extra column + one extra
WHEREclause per upsert. The total cost of "production-grade" is a one-time setup; the cost of skipping it is a midnight outage.
ETL
Topic — ETL
Debezium production drills
Streaming
Topic — streaming
Postgres CDC streaming problems
Cheat sheet — logical replication recipes
-
Full-loop 6-command setup. On publisher:
ALTER SYSTEM SET wal_level = 'logical'+ restart;CREATE PUBLICATION p FOR TABLE t;CREATE ROLE replicator REPLICATION LOGIN. On subscriber:CREATE TABLE t (…)(same schema);CREATE SUBSCRIPTION s CONNECTION '…' PUBLICATION p; verify viapg_stat_subscription. Memorise this — it is the floor of everypostgres logical replicationinterview. -
Slot health one-liner.
SELECT slot_name, active, wal_status, pg_size_pretty(pg_current_wal_lsn() - restart_lsn) AS retained, pg_size_pretty(pg_current_wal_lsn() - confirmed_flush_lsn) AS lag FROM pg_replication_slots ORDER BY 4 DESC;. Run this every time you suspect a phantom slot. -
REPLICA IDENTITY ladder.
DEFAULT(PK; cheapest) →USING INDEX uidx(existing unique index; ~1.2x WAL) →FULL(entire previous row; ~2x WAL) →NOTHING(never use on replicated tables — UPDATEs silently vanish). -
Slot disk-fill prevention.
max_slot_wal_keep_size = 8GB(or 4–8x your worst-case consumer downtime) + monitoring alert onactive = falsefor any logical slot > 5 minutes. The two together kill the phantom-slot outage. -
Pick a plugin. Default to
pgoutput(in-tree, binary, supported by Debezium 2.x and Postgres native subscriptions). Switch towal2jsononly for custom consumers that benefit from JSON readability.decoderbufsis legacy Debezium 1.x. -
Row filter + column list (PG15+).
CREATE PUBLICATION p FOR TABLE t (c1, c2, c3) WHERE (tenant_id = $X). Filters must reference REPLICA IDENTITY columns; column list must include them. Combine for per-tenant + PII-redacted feeds. - DDL is NOT replicated. Apply schema changes on both sides via a migration tool (Liquibase / Flyway / Sqitch). Add columns on the subscriber FIRST to avoid in-flight apply errors.
-
Debezium minimum config.
plugin.name=pgoutput, explicitslot.name, explicitpublication.name,snapshot.mode=initial,heartbeat.interval.ms=30000,signal.data.collection=public.debezium_signal. Five knobs cover 90% of production cases. -
Heartbeat keeps the slot moving.
heartbeat.interval.ms = 30000+INSERT INTO debezium_heartbeat (ts) VALUES (now())action query. Without it, idle CDC tables on a busy database silently pin GBs of WAL. -
Signal-table snapshot.
INSERT INTO debezium_signal(id, type, data) VALUES ('id-123', 'execute-snapshot', '{"data-collections":["public.orders"], "type":"incremental"}')— re-snapshot one table without restarting the connector. -
Idempotent sink with LSN guard. Every sink row carries a
source_lsncolumn; every upsert isWHERE EXCLUDED.source_lsn > target.source_lsn. Absorbs at-least-once replays without corruption. -
Postgres → Snowflake / Iceberg / BigQuery. Use a managed CDC tool (Fivetran, Airbyte, AWS DMS) on top of
pgoutput. You still own the slot, the publication, and the REPLICA IDENTITY — the vendor only owns the consumer. - Postgres → Postgres major-version upgrade. Set up logical replication from old (PG13) to new (PG16) with matching schemas, let it catch up, then cut over with a brief application pause. The shortest-downtime Postgres major upgrade pattern in 2026.
-
PG16 parallel apply.
CREATE SUBSCRIPTION … WITH (streaming = 'parallel')applies large transactions in parallel on the subscriber. Big win for write-heavy publishers. -
Postgres on RDS / Cloud SQL / Aurora. All major managed Postgres services expose
wal_level = logicalvia a parameter group flag and shippgoutput. Slot lifecycle and REPLICA IDENTITY semantics are identical to self-hosted; the management plane differs.
Frequently asked questions
What is the difference between physical and logical replication in Postgres?
Physical (streaming) replication ships byte-for-byte WAL between two Postgres clusters of the same major version with identical schemas — used for high-availability standbys and read replicas. Logical replication uses logical decoding (pgoutput / wal2json / decoderbufs) to ship decoded row-level changes that any consumer can apply — same Postgres, a newer Postgres major version, Kafka via Debezium, Snowflake via Fivetran, anything. Physical = HA / read replicas / PITR; logical = CDC / cross-version upgrades / data lakes. The big asymmetry: logical replication does NOT replicate DDL — you apply schema changes on both sides separately. Picking physical when you needed logical (or vice versa) is the most common postgres logical replication design mistake.
Why do UPDATEs and DELETEs fail without a primary key?
Because REPLICA IDENTITY DEFAULT requires a PRIMARY KEY — Postgres ships the PK columns in the change event so the subscriber can locate the row. Without a PK and with default REPLICA IDENTITY, UPDATE and DELETE events ship no row identifier; the subscriber's apply worker logs cannot find row to update (or silently drops the operation, depending on Postgres version). The fix is one of three: add a real PK (best), set REPLICA IDENTITY USING INDEX <unique_index> (cheap if you have one), or REPLICA IDENTITY FULL (~2x WAL volume — last resort for legacy tables). Every senior Postgres CDC engineer has hit this bug at least once; it is a guaranteed interview probe.
What is a replication slot and why is it dangerous?
A replication slot is a persistent server-side cursor that promises "I will not throw away WAL beyond this LSN until the named consumer has acked it" — which is exactly what makes logical replication crash-safe, and exactly what fills your disk when a consumer dies and nobody drops the slot. The dangerous case is the phantom slot: a consumer (Debezium pod, Fivetran connector, custom script) disconnects, the slot remains marked active = false, and WAL keeps accumulating until the database runs out of disk. Two guardrails kill this outage: max_slot_wal_keep_size = 50GB (PG13+) as the brake, plus an on-call alert on active = false for any logical slot > 5 minutes. Both are mandatory production knobs; skipping either is the most preventable Postgres outage.
Can Postgres logical replication replicate DDL?
No. Logical replication only ships DML (INSERT, UPDATE, DELETE, TRUNCATE if enabled). DDL — ALTER TABLE … ADD COLUMN …, new tables, schema migrations — must be applied on both sides manually or via a separate migration tool (Liquibase, Flyway, Sqitch). The recommended pattern: apply schema changes on the subscriber FIRST, then on the publisher. This avoids the failure mode where the publisher emits an INSERT with a new column the subscriber does not have. PG16 added some infrastructure for future DDL replication, but as of 2026 it remains application-managed. Treating DDL like DML is the second-most-common pg_logical mistake (after the phantom slot).
How do I prevent WAL from filling the disk in a CDC pipeline?
Stack four defences. (1) max_slot_wal_keep_size = 8GB-50GB on Postgres — caps per-slot retention; beyond it the slot is invalidated but the database stays up. (2) Monitor pg_replication_slots.active — alert immediately if any logical slot reports active = false for more than 5 minutes. (3) Heartbeat in your CDC consumer (Debezium: heartbeat.interval.ms = 30000 plus heartbeat.action.query) — keeps slots advancing even when published tables are idle. (4) Auto-drop dead slots — if a slot reports active = false for > 30 minutes AND no recovery is in progress, drop it. The combination converts the phantom-slot midnight outage into a 4pm Slack ping. Skip any one of the four and you eventually get the outage anyway.
Postgres CDC — Debezium vs Fivetran vs Snowflake's Postgres connector?
All three sit on top of pgoutput and read from a Postgres replication slot — the underlying primitive is identical. Debezium is open-source, runs on your Kafka Connect, gives you full control of the slot, the publication, the snapshot policy, and the destination Kafka topic structure — best when you have a Kafka platform and want full control. Fivetran is managed, hides the slot lifecycle, picks pgoutput, manages the schema in Snowflake / BigQuery — best when you want zero ops and are willing to pay per-row. Snowflake's native Postgres connector is even more vertically integrated — Snowflake owns both ends, so the schema-mapping story is simpler — best when you are a Snowflake-first shop. The senior interview answer: regardless of vendor, you still own wal_level = logical, the slot, the publication, the REPLICA IDENTITY, and max_slot_wal_keep_size. Pick the vendor on consumer ergonomics; the Postgres-side responsibilities are identical.
Practice on PipeCode
- Drill the ETL practice library → for the snapshot + stream + idempotent-sink family of CDC probes.
- Rehearse on the streaming practice library → for the Kafka topic + at-least-once + replay patterns that sit downstream of Debezium.
- Sharpen the SQL practice library → for the
pg_replication_slots/pg_stat_subscriptionintrospection queries every senior interview probes. - Stack the joins practice library → for the publisher-subscriber co-ordination + multi-table CDC drills.
- Layer the aggregation practice library → for the windowed downstream consumers that turn CDC streams into business metrics.
Lock in Postgres CDC muscle memory
Postgres docs explain slots + publications. PipeCode drills explain the decision — when REPLICA IDENTITY FULL is unavoidable, when a phantom slot fills your disk, when Debezium beats Fivetran for cost. Pipecode.ai is Leetcode for Data Engineering — pattern-first practice tuned for the production trade-offs senior data engineers actually face.





Top comments (0)