137Foundry

Posted on Jun 11

How to Set Up Debezium With PostgreSQL for Production CDC

#data #postgres #automation

Debezium is the open-source change data capture connector that most teams reach for first when they want to stream PostgreSQL changes into Kafka. The five-minute getting-started demo on the Debezium homepage works on the first try, which is part of the reason it gets adopted so widely. Getting Debezium into production, where it has to survive restarts and schema changes and consumer lag, is a longer conversation.

This is a step-by-step guide to the configuration that matters once you move past the demo. Each step is one specific decision; getting any of them wrong creates a production failure mode that is hard to debug after the fact.

Photo by Yuriy Vertikov on Unsplash

Step 1: configure the PostgreSQL WAL level

Debezium reads the PostgreSQL write-ahead log (WAL) at the logical level. By default, PostgreSQL only emits WAL records at the level needed for crash recovery, which is not enough for logical replication.

Set the WAL level to logical in the PostgreSQL configuration:

wal_level = logical
max_wal_senders = 10
max_replication_slots = 10

The max_wal_senders setting governs how many concurrent replication connections the database can support. The max_replication_slots setting governs how many replication slots can exist at once. Both default to small numbers and will block Debezium if you add more connectors later. Set them to roughly 2x the number of Debezium connectors you expect to run.

These settings require a database restart. Plan for that.

Step 2: create a replication user with the right privileges

Do not run Debezium as a superuser. Create a dedicated user with the minimum privileges needed:

CREATE ROLE debezium WITH REPLICATION LOGIN PASSWORD 'strong-password-here';
GRANT CONNECT ON DATABASE your_db TO debezium;
GRANT USAGE ON SCHEMA public TO debezium;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO debezium;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO debezium;

The REPLICATION privilege is the one that requires explicit grant. SELECT on the tables is needed for the initial snapshot phase. The ALTER DEFAULT PRIVILEGES line ensures that tables created later are also readable; without it, you have to grant manually every time the schema grows.

Step 3: pick a logical decoding output plugin

PostgreSQL needs an output plugin to translate the binary WAL into a stream of logical events. Debezium supports two:

pgoutput is the built-in plugin shipped with PostgreSQL 10 and later. Use this for new deployments.
wal2json is a third-party plugin. Use it if you have an older PostgreSQL version or specific format requirements.

For most production setups, pgoutput is the right answer. Configure it in the Debezium connector with plugin.name=pgoutput.

Step 4: choose a slot name and publication name carefully

Debezium creates a PostgreSQL replication slot when it first connects and reads changes through it on every subsequent connection. The slot name is durable; once created, it stays until you explicitly drop it.

This matters because:

The slot holds WAL space until the consumer (Debezium) has confirmed processing. If Debezium is down, the slot blocks WAL cleanup and the WAL grows.
The slot is tied to the specific consumer. If you replace Debezium with a different connector instance, the new instance can resume from the slot only if it knows the slot name.
Dropping the slot loses your CDC position. The next consumer has to re-snapshot.

Pick a slot name that includes the connector's purpose: debezium_main_to_warehouse rather than slot1. Same for the publication name. Name them so that someone looking at the database in three years can figure out what they are.

Step 5: configure heartbeat events

PostgreSQL only emits WAL records when something changes. If your source database is low-throughput, the WAL can sit idle for minutes at a time. During those idle periods, Debezium's reported position does not advance, which makes downstream monitoring think the consumer is stuck.

The fix is heartbeat events. Configure Debezium to insert a periodic heartbeat into a dedicated heartbeat table on the source database. The heartbeat insertion produces a WAL record that Debezium reads and emits as a heartbeat message. The downstream monitoring sees the heartbeat and knows the consumer is healthy.

In the Debezium connector configuration:

heartbeat.interval.ms=30000
heartbeat.action.query=INSERT INTO debezium_heartbeat (ts) VALUES (now()) ON CONFLICT DO NOTHING

Create the heartbeat table once:

CREATE TABLE debezium_heartbeat (ts timestamptz PRIMARY KEY);

Step 6: handle the initial snapshot

When Debezium first connects, it does an initial snapshot of every table in the configured publication. The snapshot is consistent with the WAL position at the moment of the snapshot, which is what makes the handoff to ongoing streaming work correctly.

The snapshot phase can take a long time for large tables. During this phase, Debezium holds a read lock on the source tables; some workloads cannot tolerate that.

Options:

snapshot.mode=initial (default): do a full snapshot on first start. Standard for fresh setups.
snapshot.mode=never: skip the snapshot and start streaming immediately. Use this when you have already loaded the data through some other path.
snapshot.mode=schema_only: capture the schema but not the data. Use when you only care about changes from this point forward.

For most production setups, initial is the right choice on first start. Plan for the snapshot phase to take time proportional to the table size; large tables (hundreds of millions of rows) may take hours.

Step 7: configure Kafka topic naming

By default, Debezium creates one Kafka topic per source table, named <server-name>.<schema>.<table>. This is usually correct, but think about it before you start streaming, because changing topic names later is painful.

The server name is the most important piece. It identifies the source database in the Kafka cluster. Pick a name that includes the environment (prod-app-db, staging-warehouse-db) so production and staging traffic do not collide.

If you have hundreds of tables and most of them do not need CDC, use the table.include.list configuration to capture only the tables that matter. The Kafka topic count compounds fast; running Debezium across an entire OLTP database creates topic sprawl that is hard to clean up later.

Step 8: monitor the replication slot lag

This is the operational concern that bites teams hardest. The PostgreSQL replication slot tracks how far behind the consumer is. If the consumer falls behind for too long, the WAL grows and eventually fills the disk.

Monitor pg_replication_slots.confirmed_flush_lsn against pg_current_wal_lsn(). The difference is the consumer lag in bytes. Alert when the lag exceeds a meaningful fraction of your available WAL retention.

SELECT slot_name,
       pg_current_wal_lsn() - confirmed_flush_lsn AS lag_bytes
FROM pg_replication_slots
WHERE slot_type = 'logical';

Run this query every minute. Write the result to your monitoring system. Alert when lag exceeds 100 MB or so (calibrated to your write volume).

Step 9: set up dead-letter handling on the consumer

Even with everything configured correctly, some events will fail to process downstream: schema mismatches, transient consumer errors, malformed data. Without a dead-letter strategy, these events block the pipeline.

Configure the Debezium consumer to route failed events to a dedicated dead-letter topic, and set up a separate process to inspect that topic periodically. The pattern is well-documented in the Apache Kafka consumer documentation.

The cleanup process is non-trivial. Some dead-letter events are recoverable (retry after fixing the consumer); some are not (data is genuinely malformed). Plan for both cases.

Step 10: test failure recovery before going live

Before declaring the pipeline production-ready, simulate the failure modes:

Stop the Kafka cluster for an hour. Watch the replication slot lag grow. Verify the source database tolerates the lag without filling the WAL disk.
Restart Debezium. Verify it resumes from the slot without missing changes.
Drop a schema-evolution change on the source database (add a column, rename a column). Verify the downstream consumer handles it correctly.
Insert a bad row that the downstream consumer will reject. Verify it ends up in the dead-letter topic and the pipeline keeps moving.

The pipeline that has passed these tests is the one that survives production. The pipeline that has not been tested under failure is the one that produces a 3am page.

When this is the wrong tool

A few situations where Debezium plus PostgreSQL plus Kafka is the wrong answer:

If your source database is not PostgreSQL, MySQL, SQL Server, or MongoDB, Debezium does not support it. Different tooling is required.

If you do not want to run Kafka, the operational cost of standing up a Kafka cluster just for CDC is real. Managed alternatives (AWS DMS, GCP Datastream, Airbyte, various managed Debezium services) handle CDC without requiring you to run Kafka yourself.

If your latency tolerance is high (minutes to hours) and your throughput is low (tens of writes per second), trigger-based CDC or timestamp polling are simpler answers. The longer guide How to Implement Change Data Capture Without Polling Your Database covers the decision rule for picking between patterns.

For a broader read on the 137Foundry data integration approach, the services page covers our build-vs-buy framework for CDC infrastructure specifically.

The short version

Debezium plus PostgreSQL plus Kafka is the production-grade open-source CDC stack. The setup is more involved than the demo suggests, the operational burden is real, and the result is a CDC pipeline that handles the things polling cannot: deletes, transaction boundaries, sub-second latency, full change history.

The teams that succeed with this stack are the ones that treat the steps above as a checklist, not a suggestion. Skip any one of them and you ship a pipeline that works in development and fails in production. Run them all and you ship a pipeline that survives.

Worth the setup time. The data quality on the other side is durable.

DEV Community