Interactive Replication Topology: See Your Cluster at a Glance

#database #monitoring #performance #postgres

Interactive Replication Topology: See Your Cluster at a Glance

Replication lag is one of those metrics that everyone monitors as a single number — and that single number hides almost everything you need to know to actually fix the problem. Is the lag caused by network, storage, or CPU? The answer changes your remediation completely, and the default PostgreSQL views do not make it obvious.

The Problem

You manage a PostgreSQL primary with three read replicas. One of the replicas has a cascading standby attached to it. Replication lag is spiking and your application team is reporting stale reads.

To diagnose the problem, you need to answer several questions at once: Which replica is lagging? Is it write lag (the primary has sent WAL but the replica has not received it), flush lag (the replica received it but has not flushed to disk), or replay lag (flushed but not yet applied)? Is the cascading standby affected by the same lag, or is it independently lagging from its upstream replica? Are all replication slots active, or is an inactive slot causing WAL retention to balloon?

Answering these questions means connecting to each node separately. You run pg_stat_replication on the primary to see what it knows about each replica. You run pg_stat_wal_receiver on each replica to see the receiver's perspective. You check pg_replication_slots on the primary to verify slot health. Each query returns raw LSN positions and interval values that you mentally compare across nodes. With three replicas and a cascading standby, that is at least five separate connections and eight queries before you have a picture of the cluster state.

During an incident, this is too slow.

How to Detect It

On the primary, pg_stat_replication shows the current state of each connected standby:

-- Run on the PRIMARY: check replication lag to each standby
SELECT
    client_addr,
    application_name,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    write_lag,
    flush_lag,
    replay_lag,
    sync_state
FROM pg_stat_replication
ORDER BY client_addr;

On each replica, pg_stat_wal_receiver shows the receiver's perspective:

-- Run on each REPLICA: check WAL receiver status
SELECT
    status,
    received_lsn,
    last_msg_send_time,
    last_msg_receipt_time,
    latest_end_lsn,
    slot_name,
    conninfo
FROM pg_stat_wal_receiver;

Check replication slot health — inactive slots prevent WAL recycling and cause disk usage to grow indefinitely:

-- Run on the PRIMARY: check replication slot health
SELECT
    slot_name,
    slot_type,
    active,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
    ) AS retained_wal_size,
    wal_status
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

The raw data is all there, but it is distributed across nodes, expressed in LSN positions that require mental arithmetic, and presented as flat rows with no visual representation of the topology.

Making Sense of the Data

When monitoring replication, the key is breaking total lag into its three components and understanding what each one means:

Write lag (time between primary sending WAL and replica receiving it) points to network issues between primary and replica.
Flush lag (time between receiving and flushing to disk) suggests slow storage on the replica.
Replay lag (time between flushing and applying) means the replica is CPU-bound on WAL replay — common when the primary runs many parallel writes but the replica applies them single-threaded.

Trend lines matter more than point-in-time values. Is lag stable, growing, or recovering? A stable 500ms of replay lag is normal for a busy system. A replay lag that grows 100ms every minute is a replica falling behind that will eventually become unusable for reads.

For cascading standbys, check whether lag is inherited from the upstream replica or independent. If the upstream replica has 3 seconds of replay lag and the cascading standby has 3.5 seconds, the standby is fine — it is just inheriting the upstream lag plus 500ms of its own.

How to Fix It

The fix depends on which lag component is elevated:

High write lag (network) — Check bandwidth and latency between primary and replica. Verify wal_sender_timeout and wal_receiver_timeout are not too aggressive for your network conditions. For cross-region replication, ensure adequate bandwidth for your WAL generation rate.

High flush lag (storage) — The replica's storage cannot keep up with incoming WAL. Move the replica's WAL directory to faster storage (NVMe). Check for I/O contention from queries running on the replica.

High replay lag (CPU) — The replica is applying WAL single-threaded and falling behind. In PostgreSQL 15+, enable parallel apply with max_parallel_apply_workers_per_subscription for logical replication. For physical replication, reduce the write rate on the primary during peak times or add more replicas to distribute read load.

Inactive replication slots retaining WAL indefinitely:

-- Drop an inactive slot that is retaining excessive WAL
SELECT pg_drop_replication_slot('stale_replica_slot');

-- Or set a maximum WAL retention size (PG 13+)
ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';
SELECT pg_reload_conf();

Cascading standby disconnected — Check primary_conninfo on the cascading standby. Verify the upstream replica allows replication connections (max_wal_senders not exhausted, pg_hba.conf permits the connection).

How to Prevent It

Monitor all three lag types separately — write, flush, and replay. Total replication lag hides the root cause. A replica showing 5 seconds of total lag could be a network problem, a storage problem, or a CPU problem, and each requires a different fix.

Set max_slot_wal_keep_size to prevent inactive slots from consuming all available disk. Without this setting, a single inactive slot will retain WAL indefinitely until the primary runs out of disk space and stops accepting writes.

Alert on replication slot inactivity. A slot that becomes inactive usually means a replica has disconnected — either crashed, lost network connectivity, or was decommissioned without dropping its slot.

Catching a lagging replica at 2 seconds of delay is routine maintenance; catching it at 2 hours of delay is an incident.