Read‑your‑writes on replicas: PostgreSQL WAIT FOR LSN and MongoDB Causal Consistency

#postgres #mongodb #database #replication

In databases designed for high availability and scalability, secondary nodes can fall behind the primary. Typically, a quorum of nodes is updated synchronously to guarantee durability while maintaining availability, while remaining standby instances are eventually consistent to handle partial failures. To balance availability with performance, synchronous replicas acknowledge a write only when it is durable and recoverable, even if it is not yet readable.

As a result, if your application writes data and then immediately queries another node, it may still see stale data.

Here’s a common anomaly: you commit an order on the primary and then try to retrieve it from a reporting system. The order is missing because the read replica has not yet applied the write.

PostgreSQL and MongoDB tackle this problem in different ways:

PostgreSQL 19 should introduce a WAIT FOR LSN command, allowing applications to explicitly coordinate reads after writes.
MongoDB provides causal consistency within sessions using the afterClusterTime read concern.

Both approaches track when your write occurred and ensure subsequent reads observe at least that point. Let’s look at how each database does this.

PostgreSQL: `WAIT FOR LSN` (PG19 - currently in development)

PostgreSQL records every change in the Write‑Ahead Log (WAL). Each WAL record has a Log Sequence Number (LSN): a 64‑bit position, typically displayed as two hexadecimal halves such as 0/40002A0 (high/low 32 bits).

Streaming replication ships WAL records from the primary to standbys, which then:

Write WAL records to disk
Flush them to durable storage
Replay them, applying changes to data files

The write position determines what can be recovered after a database crash. The flush position defines the recovery point for a compute instance failure. The replay position determines what queries can see on a standby.

A recent commit for the next PostgreSQL release (19) adds a WAIT FOR LSN command to allow a session to block until one of these points reaches a target LSN:

standby_write → WAL written to disk on the standby (not yet flushed)
standby_flush → WAL flushed to durable storage on the standby
standby_replay (default) → WAL replayed into data files and visible to readers
primary_flush → WAL flushed on the primary (useful when synchronous_commit = off and a durability barrier is needed)

A typical flow is to write on the primary, commit, and then fetch the current WAL insert LSN:

pg19rw=*# BEGIN;

BEGIN

pg19rw=*# INSERT INTO orders VALUES (123, 'widget');

INSERT 0 1

pg19rw=*# COMMIT;

COMMIT

pg19rw=# SELECT pg_current_wal_insert_lsn();

 pg_current_wal_insert_lsn
---------------------------
 0/18724C0

(1 row)

That LSN is then used to block reads on a replica until it has caught up:


pg19ro=# WAIT FOR LSN '0/18724C0'
  WITH (MODE 'standby_replay', TIMEOUT '2s');

This LSN‑based read‑your‑writes pattern in PostgreSQL requires extra round‑trips: capturing the LSN on the primary and explicitly waiting on the standby. For many workloads, reading from the primary is simpler and faster.

The pattern becomes valuable when expensive reads must be offloaded to replicas while still preserving read‑your‑writes semantics, or in event‑driven and CQRS designs where the LSN itself serves as a change marker for downstream consumers.

MongoDB: Causal Consistency

While PostgreSQL reasons in WAL positions, MongoDB tracks causality using oplog timestamps and a hybrid logical clock.

In a replica set, each write on the primary produces an entry in local.oplog.rs, a capped collection. These entries are rewritten to be idempotent (for example, $inc becomes $set) so they can be safely reapplied. Each entry carries a Hybrid Logical Clock (HLC) timestamp that combines physical time with a logical counter, producing a monotonically increasing cluster time. Replica set members apply oplog entries in timestamp order.

Because MongoDB allows concurrent writes, temporary “oplog holes” can appear: a write with a later timestamp may commit before another write with an earlier timestamp. A naïve reader scanning the oplog could skip the earlier operation.

MongoDB prevents this by tracking an oplogReadTimestamp, the highest hole‑free point in the oplog. Secondaries are prevented from reading past this point until all prior operations are visible, ensuring causal consistency even in the presence of concurrent commits.

Causal consistency in MongoDB is enforced by attaching an afterClusterTime to reads:

Drivers track the operationTime of the last operation in a session.
When a session is created with causalConsistency: true, the driver automatically includes an afterClusterTime equal to the highest known cluster time on subsequent reads.
The server blocks the read until its cluster time has advanced beyond afterClusterTime.

With any read preference that allows reading from secondaries as well as the primary, this guarantees read‑your‑writes behavior:


// Start a causally consistent session
const session = client.startSession({ causalConsistency: true });

const coll = db.collection("orders");

// Write in this session
await coll.insertOne({ id: 123, product: "widget" }, { session });

// The driver automatically injects afterClusterTime into the read concern
const order = await coll.findOne({ id: 123 }, { session });

Causal consistency is not limited to snapshot reads. It applies across read concern levels. The key point is that the session ensures later reads observe at least the effects of earlier writes, regardless of which replica serves the read.

Conclusion

Here is a simplified comparison:

Feature	PostgreSQL `WAIT FOR LSN`	MongoDB Causal Consistency
Clock type	Physical byte offset in the WAL (LSN)	Hybrid Logical Clock (HLC)
Mechanism	Block until replay/write/flush LSN reached	Block until `afterClusterTime` is visible
Tracking	Application captures LSN	Driver tracks `operationTime`
Granularity	WAL record position	Oplog timestamp
Replication model	Physical streaming	Logical oplog application
Hole handling	N/A (serialized WAL)	`oplogReadTimestamp`
Failover handling	Error unless `NO_THROW`	Session continues, bounded by replication state

Both PostgreSQL’s WAIT FOR LSN and MongoDB’s causal consistency ensure reads can observe prior writes, but at different layers:

PostgreSQL offers manual, WAL‑level precision.
MongoDB provides automatic, session‑level guarantees.

If you want read‑your‑writes semantics to “just work” without additional coordination calls, MongoDB’s session‑based model is a strong fit. Despite persistent myths about consistency, MongoDB delivers strong consistency in a horizontally scalable system with a simple developer experience.