DEV Community

Cover image for PostgreSQL Replication Slots: Create, Monitor & Troubleshoot
Philip McClarence
Philip McClarence

Posted on

PostgreSQL Replication Slots: Create, Monitor & Troubleshoot

PostgreSQL Replication Slots: Create, Monitor & Troubleshoot

A replication slot guarantees that WAL (Write-Ahead Log) segments are retained until a consumer has processed them. Without slots, a slow or disconnected replica can fall behind, the primary recycles the WAL it needed, and you have to reinitialize the replica from scratch.

Slots solve that problem. But they create a worse one: if the consumer disappears, the slot retains WAL indefinitely, the disk fills, and the primary goes down.

The Failure Mode Everyone Hits

Here's the scenario:

  1. You set up streaming replication with a slot, or logical replication creates one automatically
  2. The replica goes down (maintenance, failure, decommission)
  3. Nobody drops the slot on the primary
  4. WAL accumulates: 5 GB... 20 GB... 100 GB...
  5. Disk fills, PostgreSQL can't write new WAL, all transactions stall
  6. The primary is effectively down

The slot was created to protect replication. It ended up taking down the primary.

With multiple slots, it's worse. Each slot independently retains WAL. If 4 of 5 slots are current but one is lagging, that single lagging slot determines the WAL retention floor for the entire system.

Checking Your Slots Right Now

-- List all slots with lag and activity status
SELECT
    slot_name,
    slot_type,
    database,
    active,
    active_pid,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
    ) AS retained_wal_size,
    wal_status
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
Enter fullscreen mode Exit fullscreen mode

The columns that matter:

  • active: is anything consuming this slot?
  • retained_wal_size: how much WAL is this slot holding?
  • wal_status: should be reserved. If lost, the slot fell behind and its WAL was recycled.

Any slot where active = false and retained_wal_size is growing needs immediate investigation.

Check WAL directory size:

SELECT
    pg_size_pretty(sum(size)) AS total_wal_size,
    count(*) AS wal_file_count
FROM pg_ls_waldir();
Enter fullscreen mode Exit fullscreen mode

Find orphaned slots (no consumer, significant lag):

SELECT
    s.slot_name,
    s.slot_type,
    s.active,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_current_wal_lsn(), s.restart_lsn)
    ) AS retained_wal_size,
    sub.subname AS subscription_name
FROM pg_replication_slots s
LEFT JOIN pg_subscription sub ON sub.subslotname = s.slot_name
WHERE s.active = false
    AND pg_wal_lsn_diff(pg_current_wal_lsn(), s.restart_lsn) > 1024 * 1024 * 100
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), s.restart_lsn) DESC;
Enter fullscreen mode Exit fullscreen mode

A slot with no matching subscription and growing WAL is almost certainly orphaned.

Fixing Slot Problems

Drop an orphaned slot

-- Confirm it's orphaned
SELECT slot_name, active, active_pid, slot_type
FROM pg_replication_slots
WHERE slot_name = 'old_replica_slot';

-- Drop it (immediately allows WAL recycling)
SELECT pg_drop_replication_slot('old_replica_slot');
Enter fullscreen mode Exit fullscreen mode

Physical slot lagging (streaming replication)

Check the replica. If it's running but lagging, it may need more resources to apply WAL faster. If it's down, either bring it back or drop the slot:

SELECT
    client_addr,
    state,
    pg_size_pretty(
        pg_wal_lsn_diff(sent_lsn, replay_lsn)
    ) AS replay_lag
FROM pg_stat_replication;
Enter fullscreen mode Exit fullscreen mode

Logical slot lagging (subscriptions, CDC)

Check the subscription on the subscriber:

SELECT subname, subenabled, subslotname
FROM pg_subscription;

-- Re-enable if disabled
ALTER SUBSCRIPTION my_subscription ENABLE;
Enter fullscreen mode Exit fullscreen mode

If the subscriber was permanently removed, drop the slot on the primary. Logical slots can't be reassigned.

The Safety Valve: max_slot_wal_keep_size

This is the single most important configuration for replication slot safety:

-- Limit WAL retained per slot (PostgreSQL 13+)
ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';
SELECT pg_reload_conf();
Enter fullscreen mode Exit fullscreen mode

When a slot exceeds this limit, PostgreSQL invalidates it rather than filling the disk. The consumer needs re-initialization, but the primary stays alive.

Choose a value large enough for temporary network issues (2-5x your hourly WAL generation rate) but small enough that a dead slot can't fill your disk.

If this parameter is unset (the default), there is no limit. A single inactive slot can consume all available disk space.

Creating Slots Properly

-- Physical slot (streaming replication)
SELECT pg_create_physical_replication_slot('replica_east_1');

-- Logical slot (CDC, logical replication)
SELECT pg_create_logical_replication_slot('cdc_debezium', 'pgoutput');
Enter fullscreen mode Exit fullscreen mode

Name slots descriptively. Include the consumer name or purpose so it's obvious what each slot is for when you audit them later. cdc_debezium is better than slot_1.

Physical vs Logical Slots

Physical Logical
Purpose Streaming replication to standbys Logical decoding for subscriptions, CDC
WAL format Raw bytes Decoded into row-level changes
CPU overhead Low Higher (decoding process)
Flexibility Full cluster copy Per-table, cross-version, schema-independent
Orphan risk Lower (tied to standby) Higher (created by external tools)

Logical slots are particularly prone to orphaning because they're often created by tools (Debezium, pglogical, pg_recvlogical) that manage their own lifecycle. If the tool is uninstalled or misconfigured, the slot persists with no consumer.

Prevention Checklist

  1. Set max_slot_wal_keep_size to 5-20 GB. Non-negotiable.
  2. Monitor slot activity continuously. Alert on inactive slots and growing WAL retention.
  3. Document every slot's consumer. When decommissioning a consumer, drop the slot as part of the procedure.
  4. Keep max_replication_slots close to actual count. Default is 10. If you use 3, set it to 5 -- an unexpected slot is easier to spot.
  5. Audit slots weekly. Compare active slots against documented consumers. Investigate any mismatch.
  6. Name slots descriptively. replica_east_1 and cdc_debezium are self-documenting. slot_1 is not.

Top comments (0)