The Replication Lag Question That Shows Up in Most Senior Interviews

#database #architecture #postgres #tutorial

Book: System Design Pocket Guide: Fundamentals
Also by me: Database Playbook
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A user updates their profile photo. The POST /me returns 200. The page reloads, and the old photo is back. Three seconds later it flips to the new one. The bug ticket says "intermittent UI flicker on profile save." The actual cause is async replication, and the ticket will get reopened twice before someone names it.

The senior interview question that exposes this is some variant of "you've added a read replica, what breaks?" Junior answers stop at "we offload reads." Senior answers map out the read-your-writes problem, propose three different mitigations with their trade-offs, sketch monitoring, and only then write code.

The split between those two answers is what the panel scores.

Why "read your writes" breaks under async replication

Primary-replica with asynchronous replication is the default in every major managed Postgres service. The primary commits, the WAL ships to the replica, the replica applies it. In healthy steady state that takes single-digit to tens of milliseconds. Under write spikes, vacuum storms, or replica-side query contention, it can stretch to seconds or minutes. Cloud SQL's documentation calls out network bandwidth, write volume, and replica CPU as the three main drivers.

The user-visible bug looks like this:

Client writes to primary at T=0. Primary commits.
Primary returns 200 at T=20ms.
Client immediately reads. Load balancer routes to replica.
Replica is at WAL position from T=−200ms. Returns the old row.

Async replication gives you eventual consistency across the cluster. Your application (which expects "I wrote this, I should see it") was written for a stronger model. The mismatch is the entire bug class.

The four canonical mitigations, in order of operational cost:

Read from primary always for write paths. Boring, expensive, defensible.
Sticky sessions / sticky reads. Pin recent writers to the primary for N seconds.
WAL-position read tokens. Track the LSN at write time, force the replica to wait.
Synchronous or semi-synchronous replication. Make the primary wait for the replica before acking.

You will get asked to compare at least two. Most candidates skip the third entirely.

Monitoring lag with pg_stat_replication

Before you mitigate anything, you measure. pg_stat_replication is the canonical view. It exists on the primary and shows one row per connected replica.

SELECT
    application_name,
    client_addr,
    state,
    pg_wal_lsn_diff(
        pg_current_wal_lsn(), sent_lsn
    ) AS sent_lag_bytes,
    pg_wal_lsn_diff(
        pg_current_wal_lsn(), write_lsn
    ) AS write_lag_bytes,
    pg_wal_lsn_diff(
        pg_current_wal_lsn(), flush_lsn
    ) AS flush_lag_bytes,
    pg_wal_lsn_diff(
        pg_current_wal_lsn(), replay_lsn
    ) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag,
    sync_state
FROM pg_stat_replication;

Five things this query tells you:

sent_lag — bytes the primary hasn't transmitted yet. Network or sender bottleneck.
write_lag — bytes received but not written to the replica's OS cache.
flush_lag — bytes written but not fsynced.
replay_lag — bytes flushed but not yet applied to the replica's data files. This is the one your application sees. Replay can fall behind even when network is fine, because applying conflicts with read queries on the replica.
sync_state — async, potential, sync, or quorum. Tells you whether semi-sync is actually engaging.

Alert on replay_lag > 5s for read replicas serving user traffic, and replay_lag_bytes > 256MB as a circuit-breaker for routing decisions. The Cybertec walkthrough is a thorough public reference for these columns and what changes them.

The replica side has a counterpart query. Run on the replica:

SELECT
    pg_is_in_recovery(),
    pg_last_wal_receive_lsn(),
    pg_last_wal_replay_lsn(),
    EXTRACT(EPOCH FROM (
        now() - pg_last_xact_replay_timestamp()
    )) AS replay_age_seconds;

replay_age_seconds is what you graph in Grafana. It is the number you'll quote in the postmortem.

Sticky reads via per-user write timestamps

The pragmatic mitigation. After a user writes, route their reads to the primary for a window long enough to outlast typical replica lag. A common starting point is 5 seconds. Chatty UIs tend to tune down to 2–3 seconds; quiet back-office tools sit closer to 30. Pick from your own lag percentiles, not from a blog post.

The state lives in Redis, keyed by user ID. Setting it on every write is one Redis SET. Checking it on every read is one Redis GET. Both are sub-millisecond.

import time
from dataclasses import dataclass
from typing import Optional

import psycopg
import redis

WRITE_STICKINESS_SECONDS = 5

@dataclass
class Pools:
    primary: psycopg.AsyncConnection
    replica: psycopg.AsyncConnection
    cache: redis.Redis

class ReadRouter:
    def __init__(self, pools: Pools):
        self.pools = pools

    def _key(self, user_id: str) -> str:
        return f"recent_write:{user_id}"

    async def mark_write(self, user_id: str) -> None:
        self.pools.cache.set(
            self._key(user_id),
            int(time.time()),
            ex=WRITE_STICKINESS_SECONDS,
        )

    async def pick(self, user_id: str) -> psycopg.AsyncConnection:
        recent = self.pools.cache.get(self._key(user_id))
        if recent is not None:
            return self.pools.primary
        return self.pools.replica

    async def execute_write(
        self, user_id: str, sql: str, params: tuple
    ) -> None:
        async with self.pools.primary.cursor() as cur:
            await cur.execute(sql, params)
        await self.mark_write(user_id)

    async def execute_read(
        self, user_id: str, sql: str, params: tuple
    ) -> list:
        conn = await self.pick(user_id)
        async with conn.cursor() as cur:
            await cur.execute(sql, params)
            return await cur.fetchall()

The contract: execute_write always hits the primary and stamps the user. execute_read consults Redis first; if the user wrote in the last 5 seconds, route to primary, otherwise replica. The TTL on the Redis key is the stickiness window.

What this catches: the profile photo bug, the "I just sent a chat message and the channel doesn't show it" bug, the "I deleted a draft and it reappeared" bug. All single-user, write-then-read flows.

What it misses: cross-user reads (you commented on my post, I refresh, my replica hasn't seen your write yet). For those, you need either a per-resource sticky window keyed by post_id, or the WAL-position approach below.

When you need WAL-position tokens

The stricter pattern, used by systems that cannot tolerate even a 5-second window of stale-read risk on shared resources. Capture the LSN at write time, propagate it through the response, and have the replica refuse to serve reads behind that LSN.

-- After commit on the primary
SELECT pg_current_wal_lsn();

You return that LSN to the client (header, body, cookie). On the next read, the client sends it back. Your read router does:

async def wait_for_lsn(replica, target_lsn: str, timeout=2.0):
    deadline = time.monotonic() + timeout
    while time.monotonic() < deadline:
        async with replica.cursor() as cur:
            # pg_wal_lsn_diff returns bytes(replay - target);
            # >= 0 means the replica is at or past target_lsn.
            # Don't compare LSNs as strings: works for same-length
            # values, breaks across WAL segments.
            await cur.execute(
                "SELECT pg_wal_lsn_diff(pg_last_wal_replay_lsn(), %s::pg_lsn)",
                (target_lsn,),
            )
            row = await cur.fetchone()
        if row and row[0] >= 0:
            return True
        await asyncio.sleep(0.01)
    return False

If the replica catches up within the timeout, serve the read. Otherwise fall back to the primary. Comparable patterns appear in AWS Aurora's session-pinned reads. CockroachDB's bounded-staleness follower reads via AS OF SYSTEM TIME solve a related problem (let the read tell the system how stale is too stale), though they're a time-travel read against a chosen timestamp, not an LSN-token wait. It's stronger than sticky reads and harder to operate. Pick it when sticky reads are too coarse.

Semi-sync as the structural answer

Sticky reads patch the symptom. Semi-synchronous replication addresses the cause: the primary acks the client only after at least one replica has confirmed the WAL.

Postgres exposes this via synchronous_commit and synchronous_standby_names. Set on the primary:

synchronous_commit = on
synchronous_standby_names = 'ANY 1 (replica_a, replica_b)'

ANY 1 means the primary commits when any one of the listed replicas confirms. That replica is now guaranteed to be at-or-ahead of the client's last write. Reads routed to it satisfy read-your-writes by construction.

The cost is latency. Every write now pays a network round-trip to the slowest of the synchronous replicas. As a rough order of magnitude: typically single-digit milliseconds same-AZ, tens of milliseconds cross-region. Measure your own. You also lose write availability if all synchronous replicas are unreachable. synchronous_commit = on will block.

The right operational shape: synchronous replicas in the same AZ for read-your-writes, asynchronous replicas across regions for disaster recovery. The sync_state column in pg_stat_replication tells you which is which.

What the senior answer sounds like

The shape of the answer the panel is listening for:

Name the consistency model you actually want. Read-your-writes? Monotonic reads? Linearizability? Most product flows you'll defend in an interview need read-your-writes plus monotonic reads.
Show you can measure lag with pg_stat_replication and know which column matters (replay_lag).
Pick the cheapest mitigation that satisfies the requirement. Per-user sticky reads for single-user flows, per-resource sticky reads when one row has many readers, WAL tokens when even a short stickiness window is too coarse, and semi-sync when the SLA is hard.
Mention the failure mode of your pick. Sticky reads break under cache eviction. WAL tokens break under timeout fallback. Semi-sync breaks under replica unavailability.
Sketch the monitoring before the panel asks.

The candidates who get the offer don't write more code. They name the consistency model in the first 60 seconds and treat the implementation as the boring part.

If this was useful

Replication lag is one of those topics that sits at the intersection of database internals and system design (easy to handwave, hard to actually ship). System Design Pocket Guide: Fundamentals covers the consistency-model vocabulary the panel is listening for, and Database Playbook goes deep on the Postgres-specific knobs that decide whether your replicas serve reads or fall over under load: synchronous_commit, hot-standby feedback, conflict resolution.