PostgreSQL Replication Slots: Create, Monitor & Troubleshoot
A replication slot guarantees that WAL (Write-Ahead Log) segments are retained until a consumer has processed them. Without slots, a slow or disconnected replica can fall behind, the primary recycles the WAL it needed, and you have to reinitialize the replica from scratch.
Slots solve that problem. But they create a worse one: if the consumer disappears, the slot retains WAL indefinitely, the disk fills, and the primary goes down.
The Failure Mode Everyone Hits
Here's the scenario:
- You set up streaming replication with a slot, or logical replication creates one automatically
- The replica goes down (maintenance, failure, decommission)
- Nobody drops the slot on the primary
- WAL accumulates: 5 GB... 20 GB... 100 GB...
- Disk fills, PostgreSQL can't write new WAL, all transactions stall
- The primary is effectively down
The slot was created to protect replication. It ended up taking down the primary.
With multiple slots, it's worse. Each slot independently retains WAL. If 4 of 5 slots are current but one is lagging, that single lagging slot determines the WAL retention floor for the entire system.
Checking Your Slots Right Now
-- List all slots with lag and activity status
SELECT
slot_name,
slot_type,
database,
active,
active_pid,
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS retained_wal_size,
wal_status
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
The columns that matter:
-
active: is anything consuming this slot? -
retained_wal_size: how much WAL is this slot holding? -
wal_status: should bereserved. Iflost, the slot fell behind and its WAL was recycled.
Any slot where active = false and retained_wal_size is growing needs immediate investigation.
Check WAL directory size:
SELECT
pg_size_pretty(sum(size)) AS total_wal_size,
count(*) AS wal_file_count
FROM pg_ls_waldir();
Find orphaned slots (no consumer, significant lag):
SELECT
s.slot_name,
s.slot_type,
s.active,
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), s.restart_lsn)
) AS retained_wal_size,
sub.subname AS subscription_name
FROM pg_replication_slots s
LEFT JOIN pg_subscription sub ON sub.subslotname = s.slot_name
WHERE s.active = false
AND pg_wal_lsn_diff(pg_current_wal_lsn(), s.restart_lsn) > 1024 * 1024 * 100
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), s.restart_lsn) DESC;
A slot with no matching subscription and growing WAL is almost certainly orphaned.
Fixing Slot Problems
Drop an orphaned slot
-- Confirm it's orphaned
SELECT slot_name, active, active_pid, slot_type
FROM pg_replication_slots
WHERE slot_name = 'old_replica_slot';
-- Drop it (immediately allows WAL recycling)
SELECT pg_drop_replication_slot('old_replica_slot');
Physical slot lagging (streaming replication)
Check the replica. If it's running but lagging, it may need more resources to apply WAL faster. If it's down, either bring it back or drop the slot:
SELECT
client_addr,
state,
pg_size_pretty(
pg_wal_lsn_diff(sent_lsn, replay_lsn)
) AS replay_lag
FROM pg_stat_replication;
Logical slot lagging (subscriptions, CDC)
Check the subscription on the subscriber:
SELECT subname, subenabled, subslotname
FROM pg_subscription;
-- Re-enable if disabled
ALTER SUBSCRIPTION my_subscription ENABLE;
If the subscriber was permanently removed, drop the slot on the primary. Logical slots can't be reassigned.
The Safety Valve: max_slot_wal_keep_size
This is the single most important configuration for replication slot safety:
-- Limit WAL retained per slot (PostgreSQL 13+)
ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';
SELECT pg_reload_conf();
When a slot exceeds this limit, PostgreSQL invalidates it rather than filling the disk. The consumer needs re-initialization, but the primary stays alive.
Choose a value large enough for temporary network issues (2-5x your hourly WAL generation rate) but small enough that a dead slot can't fill your disk.
If this parameter is unset (the default), there is no limit. A single inactive slot can consume all available disk space.
Creating Slots Properly
-- Physical slot (streaming replication)
SELECT pg_create_physical_replication_slot('replica_east_1');
-- Logical slot (CDC, logical replication)
SELECT pg_create_logical_replication_slot('cdc_debezium', 'pgoutput');
Name slots descriptively. Include the consumer name or purpose so it's obvious what each slot is for when you audit them later. cdc_debezium is better than slot_1.
Physical vs Logical Slots
| Physical | Logical | |
|---|---|---|
| Purpose | Streaming replication to standbys | Logical decoding for subscriptions, CDC |
| WAL format | Raw bytes | Decoded into row-level changes |
| CPU overhead | Low | Higher (decoding process) |
| Flexibility | Full cluster copy | Per-table, cross-version, schema-independent |
| Orphan risk | Lower (tied to standby) | Higher (created by external tools) |
Logical slots are particularly prone to orphaning because they're often created by tools (Debezium, pglogical, pg_recvlogical) that manage their own lifecycle. If the tool is uninstalled or misconfigured, the slot persists with no consumer.
Prevention Checklist
-
Set
max_slot_wal_keep_sizeto 5-20 GB. Non-negotiable. - Monitor slot activity continuously. Alert on inactive slots and growing WAL retention.
- Document every slot's consumer. When decommissioning a consumer, drop the slot as part of the procedure.
-
Keep
max_replication_slotsclose to actual count. Default is 10. If you use 3, set it to 5 -- an unexpected slot is easier to spot. - Audit slots weekly. Compare active slots against documented consumers. Investigate any mismatch.
-
Name slots descriptively.
replica_east_1andcdc_debeziumare self-documenting.slot_1is not.
Top comments (0)