MongoDB Guests for MongoDB

Posted on May 21 • Edited on May 26

Comparing Replication and Failover in PostgreSQL and MongoDB

#database #distributedsystems #postgres #systemdesign

This article was written by Arek Borucki.

Every production database eventually faces the same question: what happens when the machine running it dies? An EC2 instance gets retired with three minutes’ warning. A data center loses connectivity to its peers. The database keeps the company’s data, and the company cannot wait for someone to drive to a colocation facility. So databases learned to replicate themselves. To keep more than one copy of the data, on more than one machine, ready to take over when something breaks.

PostgreSQL and MongoDB both solve this problem. They both replicate data. They both promote a new primary when the current one fails. But they took radically different paths to get there, and those paths reflect when each system was born and what its designers believed about how databases should work. PostgreSQL grew out of an academic project in the 1990s, in an era when high availability meant a tape backup and a phone call to a DBA. MongoDB was conceived in 2009, after a decade of internet companies discovering that machines fail constantly and recovery cannot involve humans. Their replication systems carry that history in their bones.

This article carefully walks through both systems. You will learn how PostgreSQL streams write-ahead logs to standby servers and how MongoDB tails its operation log on every secondary. You will see why PostgreSQL needs Patroni or repmgr to fail over automatically, while MongoDB has Raft consensus built into its replica set protocol. You will understand why a transaction in flight on a PostgreSQL primary simply dies during failover, while a prepared transaction on a MongoDB primary survives and resumes on the new primary. By the end, you will have a clear framework for choosing between them. Not based on marketing pages, but on which trade-offs actually fit your system.

Understanding the philosophical divide

The most important thing to know about PostgreSQL replication is that it does not include automatic failover. Read that sentence twice. PostgreSQL can stream every byte of its write-ahead log to a standby server in real time, keep that standby in lockstep with the primary, and let you read from it. But if the primary dies at three in the morning, PostgreSQL itself does not promote the standby. A human, or a piece of software written by humans, must notice the failure, decide which standby is the most up-to-date, and run the promotion command.

This is not an oversight. It is a design choice that goes back to PostgreSQL’s origins as the Berkeley POSTGRES project in the mid-1980s and its long incubation through the 1990s. Database research at the time treated replication as an engineering problem with many possible solutions and no single right answer. Should the primary wait for the standby to acknowledge writes, or should it commit and replicate asynchronously? Should the standby be a hot spare, ready to serve reads, or a cold backup? Should failover require human judgment, or should it be automatic? PostgreSQL refused to choose. It built the primitive, write-ahead logs, replication slots, hot standbys, and left the high-level policy to the operator.

MongoDB started from the opposite assumption. It was designed in 2009, in the wreckage of the early-2000s internet, where companies like Google, Amazon, and Facebook had spent a decade learning that machines fail constantly. A typical Google datacenter sees thousands of disk failures, hundreds of machine failures, and dozens of rack failures per year, as documented in Jeff Dean’s famous 2009 talk on cluster reliability. By 2009, the operational lesson was clear: if your database requires a human to recover from a single machine failure, your database is broken. MongoDB’s designers chose to bake high availability into the database itself. A MongoDB replica set is not a primary plus a few read replicas; it is a small cluster that elects its own leader, detects its own failures, and recovers from them without human intervention.

A useful analogy is the difference between a car that comes with a spare tire and a car that comes with run-flat tires. PostgreSQL hands you a spare tire and a jack: it tells you exactly how to replace the wheel, but you have to be the one who notices the flat and pulls over. MongoDB ships run-flat tires that detect a puncture and let the car keep driving. Both approaches keep you on the road. The first gives you more control over what happens when something goes wrong; the second works at three in the morning when nobody is paying attention.

This difference shapes everything else in this chapter. When PostgreSQL operators want automatic failover, they install Patroni, repmgr, or pg_auto_failover. External systems that wrap PostgreSQL and add the missing logic. These tools watch the primary, detect when it stops responding, run an election among the surviving standbys, and promote the winner. Some of them use etcd or Consul to coordinate, treating PostgreSQL itself as a passive replicated state machine that responds to instructions from the outside. The PostgreSQL process knows nothing about its peers; it just streams WAL to whoever asks. When a MongoDB operator wants automatic failover, they do nothing. It is already there.

Why has PostgreSQL not absorbed this functionality? The community has debated it for years, and the answer is partly philosophical and partly practical.

Philosophically, PostgreSQL favors small, composable primitives over integrated systems. Adding automatic failover to core PostgreSQL would mean adding consensus algorithms, leader election, and cluster membership management. Substantial complexity that the community has been reluctant to take on, when external tools already do the job. Practically, automatic failover requires answering questions that depend on the deployment: how many failures can the cluster tolerate, what counts as the primary being down, and how should clients discover the new primary. PostgreSQL’s answer has been to expose enough hooks that external tools can implement any policy, rather than commit to one.

The trade-off is real. With PostgreSQL, you get flexibility. You can build exactly the HA system your environment needs, or pick from several mature options. You also get complexity. Patroni alone is a substantial system, and getting it right requires understanding etcd, distributed consensus, network partitions, and split-brain scenarios. A misconfigured Patroni cluster can cheerfully promote two primaries during a network partition, and only careful fencing prevents the resulting data divergence. With MongoDB, you get simplicity. The rs.initiate() command and three nodes give you a working HA cluster, and you give up flexibility. The MongoDB replica set protocol is what it is; you cannot swap out its election algorithm or its heartbeat interval beyond the configuration knobs the database exposes.

Warning: the apparent simplicity of MongoDB replication can mislead operators into deploying minimum configurations that fail in surprising ways. A two-node replica set looks reasonable until you realize that any partition leaves both nodes unable to form a majority, so neither can serve writes. An odd number of voting members is not optional advice; it is a structural requirement of the protocol. PostgreSQL operators face the equivalent trap when they configure synchronous replication with a single standby and discover that losing the standby blocks all writes on the primary. Both systems have failure modes that follow from their consistency guarantees, and both reward operators who understand those guarantees before deploying.

Throughout this article, you will see the philosophical divide play out at every level. PostgreSQL has two replication mechanisms because operators wanted both physical and logical replication. MongoDB has one. PostgreSQL’s synchronous replication is configured through cryptic strings in synchronous_standby_names; MongoDB exposes write concerns at the query level. PostgreSQL clients connect to a hostname, and MongoDB drivers maintain a topology view of the cluster. Each system is internally consistent, but they answer different questions, and the answers reflect the era in which each was built.

Tracing how PostgreSQL replicates data

PostgreSQL replication starts with the write-ahead log. Every change to the database, every INSERT, every UPDATE, every CREATE INDEX, is written to the WAL before it touches the actual data files. The WAL exists for crash recovery: if the database process dies between writing a row and flushing it to disk, replaying the WAL on restart reconstructs the lost change. This is how every modern relational database achieves durability. What PostgreSQL noticed, sometime in the early 2000s, is that the same WAL stream that lets a single server recover from a crash can be sent to a second server to keep it in sync. Replication, in PostgreSQL, is fundamentally a side effect of crash recovery extended across the network.

PostgreSQL exposes two replication mechanisms built on top of the WAL: physical replication and logical replication. They operate concurrently, on the same WAL stream, and serve different use cases. Understanding the difference between them clarifies why PostgreSQL has accumulated the configuration knobs and external tools it has.

Physical replication: the WAL on the wire

Physical replication ships the WAL byte-for-byte from the primary to one or more standby servers. The standby is, in a literal sense, a replica of the primary at the binary level. Every page in the standby’s data files matches the corresponding page on the primary. This makes physical replication conceptually simple: the primary writes its WAL; a process called walsender streams those records to the standby; the standby’s walreceiver process receives them; and the recovery system on the standby applies them as if it were replaying a crash. The standby is always in recovery mode and only stops being a standby when it is promoted to primary.

Streaming replication is the modern way to ship WAL. The standby connects to the primary via the replication protocol, identifies itself, and asks for WAL records starting from a specific log sequence number (LSN). The primary streams records as they are generated. To prevent the primary from deleting WAL files that a standby still needs, PostgreSQL uses replication slots: small pieces of state on the primary that track how far each standby has consumed. As long as a slot exists, the primary will not delete WAL beyond the slot’s position. This is critical because a standby that falls too far behind can otherwise discover that the WAL it needs has been recycled, at which point it cannot catch up without a fresh base backup.

PostgreSQL also supports cascading replication, where a standby acts as a relay, streaming WAL it has received to other downstream standbys. This reduces network bandwidth on the primary and lets you build replication trees that span multiple data centers without each standby holding a connection back to the origin. Cascading replication carries a subtlety: the downstream standbys can never be more current than the relay standby, which means failover policies must account for the relay sometimes lagging behind the primary.

Synchronous replication adds durability guarantees on top of streaming. By default, the primary acknowledges a commit as soon as the WAL is flushed to its own disk; the standby may receive and apply the change later. Synchronous replication makes the primary wait for one or more standbys to acknowledge the write before reporting success to the client. This is configured through synchronous_standby_names, a string that lists which standbys count as synchronous and how many of them must acknowledge. PostgreSQL supports two methods. FIRST n (s1, s2, s3) requires the first n standbys in the list, in priority order, to confirm. ANY n (s1, s2, s3) requires any n of the listed standbys to confirm. A quorum-style policy that tolerates individual standby failures better. The synchronous_commit parameter further refines what acknowledgment means: remote_write requires the standby to write the WAL to the filesystem cache (surviving a PostgreSQL crash, but not necessarily an OS crash), while on requires the WAL to be flushed to disk, and remote_apply requires it to actually apply the change so subsequent reads on the standby see it.

A common mistake is to enable synchronous replication with a single standby and a strict synchronous_commit value, then panic when writes hang during a standby restart. The primary, by design, waits for the synchronous standby to confirm the commit before acknowledging success to the client. If the standby is gone, the primary blocks. Production deployments either configure ANY-style synchrony with multiple standbys or use synchronous_commit = remote_write to balance durability with availability. The right setting depends on whether you would rather lose a recent transaction during a primary crash or stop accepting writes during a standby outage.

Logical replication: the row-level alternative

Physical replication is powerful, but it has constraints that follow from its byte-level nature. The standby must run the same major version of PostgreSQL as the primary; you cannot replicate from version 14 to version 16 because the on-disk page formats might differ. The standby replicates the entire PostgreSQL cluster (that is, the entire database instance, including all databases in it); you cannot replicate just one database or table. The standby is read-only; you cannot write to it without breaking replication. These constraints are fine for high availability, where you want a hot spare identical to the primary, but they make physical replication unsuitable for use cases like cross-version upgrades, partial replication for analytics, or feeding data into systems that are not PostgreSQL.

Built-in logical replication, introduced in PostgreSQL 10 and refined heavily since, solves these problems by replicating at the row level instead of the page level. Where physical replication says, “here is the byte change at offset 8192 in file 16384,” logical replication says, “here is an INSERT into table users with these column values.” This higher-level representation lets logical replication cross PostgreSQL major versions, replicate selected tables, and lets subscribers process the changes through plugins or external systems.

The architecture is publish-subscribe. On the publisher node, you create a publication that defines which tables to replicate. On the subscriber node, you create a subscription that connects to the publisher and asks for one or more publications. The publisher records each change in its WAL as usual, and a separate logical decoding process extracts the high-level row changes from the WAL and sends them to the subscriber, which applies them to its own tables.

-- On the publisher
CREATE PUBLICATION mypub FOR TABLE users, departments;

-- On the subscriber
CREATE SUBSCRIPTION mysub
  CONNECTION 'dbname=foo host=publisher.local user=repuser'
  PUBLICATION mypub;

Logical replication uses several worker processes coordinated by the logical replication launcher (ApplyLauncherMain in the source). The Apply Worker, one per subscription, receives changes from the publisher and applies them. The Table Sync Worker handles the initial copy of each table’s existing data when the subscription is first created. The Sequence Sync Worker synchronizes sequence values. The Parallel Apply Worker can apply streaming transactions in parallel, which matters for large transactions that would otherwise serialize on the subscriber.

Behind the scenes, logical decoding does the heavy lifting. The reorder buffer assembles individual WAL records into complete transactions, spooling large transactions to disk if they exceed memory. The pgoutput plugin transforms the assembled changes into the logical replication protocol that subscribers understand. Snapshot-build (snapbuild.c) maintains historical catalog snapshots so that a row decoded from old WAL is interpreted using the schema that existed when the row was written, not the current schema. These mechanisms make logical decoding work across schema changes, but they also make it more expensive than physical replication.

Logical replication has an important limitation that confuses operators on first encounter: it does not replicate DDL. A CREATE TABLE on the publisher does not propagate to the subscriber. Schema changes must be applied separately on each node, and they must be applied carefully. If a column is added to the publisher and not yet to the subscriber, replication of new rows including that column will fail until the subscriber catches up. The same applies to TRUNCATE behavior, sequence values for newly created sequences, and other DDL-adjacent operations. Logical replication is therefore well-suited for replicating data between known schemas but poorly suited as a general-purpose schema mirror.

Replication origins, exposed through the pg_replication_origin catalog, track where each subscriber is in each publisher’s WAL stream. Each origin records both the remote LSN it has consumed and the local LSN where the change was applied. This metadata is what makes logical replication safe across crashes: if the subscriber dies mid-apply, it knows on restart exactly which transaction to resume. Replication origins also prevent infinite loops in bidirectional replication setups. A change that came in from origin A is tagged with that origin, so the subscriber knows not to send it back.

Setting up replication, in either form, requires explicit configuration. Physical replication needs wal_level = replica, max_wal_senders, and a base backup made with pg_basebackup. Logical replication needs wal_level = logical and one of the available output plugins (almost always pgoutput in modern deployments). PostgreSQL ships utilities like pg_basebackup and pg_createsubscriber to make these workflows easier, but the operator is still in charge of running them in the right order and pointing the standby or subscriber at the right primary. There is no equivalent to MongoDB’s rs.initiate() command that creates a working replicated cluster from scratch.

Tracing how MongoDB replicates data

MongoDB replication looks deceptively simple from the outside. You start three mongod processes, run rs.initiate() against one of them, add the other two with rs.add(), and wait a few seconds. The first node becomes primary, the other two become secondaries, and the cluster begins serving traffic. There is no separate replication daemon, no external coordination service, and no configuration file string that lists which nodes count as synchronous. Everything runs inside the same mongod process. Behind that simplicity sits a careful design that turns Raft consensus, an operation log, and pull-based replication into a system that genuinely needs no human intervention to recover from a primary failure.

The oplog: a log of operations, not bytes

The heart of MongoDB replication is the oplog, a capped collection named oplog.rs in the local database that exists on every replica set member. Where PostgreSQL’s WAL records byte-level changes to data pages, the oplog records logical operations: insert this document into this collection, update this document with these field changes, drop this index. Each oplog entry carries a timestamp called an OpTime that uniquely identifies the operation in the cluster’s history. Secondaries replicate by pulling entries from the primary’s oplog and applying them to their own copies of the data.

The oplog, being a capped collection, has consequences. It is a fixed-size circular buffer, sized at startup or via the replSetResizeOplog command. When it fills up, the oldest entries are overwritten. This means a secondary that falls too far behind can discover that the oplog entries it needs have been overwritten, at which point it cannot catch up incrementally and must perform a full initial sync. Sizing the oplog is therefore an operational decision: too small, and a brief secondary outage forces a multi-hour resync; too large, and you waste storage on history nobody will replay. The MongoDB documentation recommends sizing the oplog to cover at least 24 to 72 hours of write traffic in production, but the right number depends on your tolerance for resync events.

The mechanics of write replication are straightforward. When a write arrives at the primary, it goes through the same path as on a standalone mongod: the storage engine commits the change to disk, and an OpObserver inserts a corresponding document into the oplog. Secondaries pull oplog entries continuously through a long-running connection. The pull-based model differs from PostgreSQL’s push-based streaming and reflects MongoDB’s assumption that secondaries should be in charge of their own progress, rather than relying on the primary to keep them up to date. If a secondary lags or disconnects, the primary keeps writing without slowing down; the secondary catches up at its own pace when it reconnects.

Initial sync handles the case where a new node joins the replica set or an existing secondary has fallen too far behind to catch up incrementally. The new secondary connects to a sync source (usually a healthy secondary, but it can be the primary), reads every collection in every database, and tails the oplog from the time it started reading. Initial sync is expensive. It can take hours or days for a multi-terabyte cluster, but the protocol is robust to interruption: if the new secondary disconnects mid-sync, it can resume rather than start over.

The replication coordinator and the topology coordinator

Inside the mongod process, MongoDB replication runs through two components that are easier to understand once you know they were deliberately factored apart. The ReplicationCoordinatorImpl is the central orchestrator. It owns the public API, commands like replSetGetStatus, replSetReconfig, and the implicit machinery that runs when a write arrives. It manages the lifecycle of replication: starting it up at server boot, transitioning between member states (PRIMARY, SECONDARY, RECOVERING, ROLLBACK), and shutting down cleanly. It tracks states like _memberState, _myLastAppliedOpTime, and _myLastWrittenOpTime, which other components query to make decisions.

The TopologyCoordinator is the brain of the cluster. It owns the election algorithm, the heartbeat policy, the sync source selection logic, and the rules for deciding when to step down. Critically, it performs no I/O. It is a pure decision engine that takes inputs (heartbeat responses, configuration changes, election timeouts) and produces decisions (start an election, vote for this candidate, step down). The separation matters because it makes the consensus logic testable: you can drive the TopologyCoordinator with synthetic events and verify its decisions without spinning up a real network.

This split mirrors how Raft is described in academic literature, which is not an accident. MongoDB’s replica set protocol implements a variant of Raft, the consensus algorithm published by Diego Ongaro and John Ousterhout in 2014. Raft was designed specifically to be easier to understand than its predecessor, Paxos, and MongoDB adopted it after a few years of running its earlier protocol (sometimes called PV0), which revealed edge cases that Raft’s formalism made easier to reason about. The current protocol version, often referred to as PV1 in the source code, applies Raft-style rules: every member has a term, every election increments the term, and votes are tallied by the majority of the voting members.

Heartbeats: knowing who is alive

Replica set members maintain their view of the cluster through heartbeats. Each member sends periodic heartbeat messages to every other member, asking for status. The default heartbeat interval is two seconds, and members are considered unreachable if they fail to respond within ten seconds (the electionTimeoutMillis default, also configurable). Heartbeats carry more than liveliness information: they include each member’s OpTime, term, configuration version, and state. This means every node knows which member is primary, who is most up-to-date, and whether a configuration change is in flight, just by observing the heartbeats.

The TopologyCoordinator maintains a MemberData structure for each peer, recording ping times, last heartbeat response, and health status. When a heartbeat response arrives, _handleHeartbeatResponse updates this structure and, if necessary, triggers state changes. If the primary stops responding for long enough, secondaries notice through their MemberData and start the election process. The election timeout is critical here: too short and transient network blips trigger spurious failovers, too long and real failures take a long time to detect. The default of ten seconds reflects a compromise that works for most deployments but can be tuned for environments with stricter latency requirements.

Pull-based replication and sync source selection

Secondaries replicate by pulling oplog entries from a chosen sync source. The choice of sync source is not fixed: a secondary picks among available members based on factors like ping time, oplog freshness, replication chain depth, and configuration preferences (members can be tagged to influence selection). A secondary may sync from another secondary rather than from the primary, which is conceptually similar to PostgreSQL’s cascading replication and reduces load on the primary in large clusters.

The pull model has one important property: it makes the secondary responsible for its own progress. If the secondary slows down for any reason, like disk pressure or network congestion, the primary continues serving writes without blocking. Replication lag accumulates on the secondary side and is visible in metrics like replSetGetStatus output and the oplog window. PostgreSQL’s push model can be configured for similar non-blocking behavior, but its synchronous mode actively ties the primary’s commit latency to the standby’s acknowledgment, which the MongoDB write concern model handles differently (covered in the next section).

A subtlety worth noting: oplog entries are designed to be idempotent. The same operation can be applied twice without changing the final state. This matters for replication safety. If a secondary crashes mid-apply, it can replay the last few operations on restart without corrupting data. PostgreSQL’s WAL records have similar properties (they are designed to be replayed), but the bookkeeping is different: PostgreSQL tracks LSN positions, MongoDB tracks OpTime, and both protocols include enough information to resume after an unclean shutdown.

Comparing durability and synchronous semantics

A write is durable when, after the database tells the client the write succeeded, the database can recover that write through any single failure. The two systems mean roughly the same thing by durability, but they let you control it through completely different interfaces, and the differences reveal each system’s assumptions about who should make durability decisions.

PostgreSQL exposes durability through the synchronous_commit and synchronous_standby_names parameters. These are server-side settings: the database administrator chooses them once for the cluster (or per role, or per session via SET), and every transaction inherits the choice. The model is rigid by design. If your deployment policy says writes must reach two standbys, the database enforces that universally. The synchronous_commit values define the durability level: off means commits are not even written to disk before acknowledging (fast, lossy), local means flushed to the primary’s disk only, on means the synchronous standbys have written but not necessarily applied, remote_write means the standbys have received the data, and remote_apply means the standbys have actually applied the change. The synchronous_standby_names string then defines which standbys count and how many must confirm.

MongoDB takes the opposite approach. Durability is specified per write, through the writeConcern parameter. A client can issue an insert with { w: "majority", j: true } and the server will wait until a majority of voting members have written the change to disk before acknowledging. Another client, on the same database, can issue a write with { w: 1 } and get an acknowledgment as soon as the primary has applied it. This is more flexible than PostgreSQL’s server-wide setting, and it makes the durability decision a property of the operation rather than the deployment. A user-facing write that creates an account uses { w: "majority", j: true }; a background job that records click metrics might use { w: 1 } because losing a click is acceptable.

Both systems converge on the same durability primitive: a quorum of disks has flushed the WAL or oplog. PostgreSQL with synchronous_commit = on and synchronous_standby_names = 'ANY 1 (s1, s2, s3)' gives you a durability guarantee similar to MongoDB with w: 2 and j: true. The difference is who chose: the operator who set up the cluster, or the developer who wrote the query. There is no universally right answer. PostgreSQL’s model is easier to audit. You can read the configuration and know what the cluster guarantees, but harder to vary by operation. MongoDB’s model is more expressive but requires every developer to think about write concerns, and code that omits the write concern silently uses the default ({ w: "majority" }).

The read side has a similar pattern. PostgreSQL supports reads on hot standbys, with hot_standby_feedback to let the standby tell the primary not to vacuum rows the standby is still using. Reads on standbys may be stale due to some replication lag. MongoDB exposes read preferences (primary, primaryPreferred, secondary, secondaryPreferred, nearest) and read concerns (local, available, majority, snapshot, linearizable) on a per-query basis. Reading with readConcern: "majority" means the data has been committed to a majority of nodes, which is stronger than reading from a possibly lagging secondary in PostgreSQL terms. linearizable read concern is even stronger. It guarantees you see the latest write, at the cost of additional round-trips. Again, the philosophical pattern repeats: PostgreSQL gives you knobs at the deployment level, MongoDB exposes them at the query level.

Why the divergence? The answer is partly about how each system thinks about clients. PostgreSQL was designed when SQL was the universal interface, and applications rarely shipped with database-level configuration awareness; the DBA controlled durability because the application could not be expected to know. MongoDB was designed when applications were richer, drivers shipped with sophisticated configuration, and developers were comfortable specifying durability per operation. Neither approach is wrong. They reflect different ideas about where the contract between the database and application sits.

A practical warning applies to both systems. Operators sometimes configure synchronous replication or majority write concern and then deploy a cluster that cannot satisfy the requirement. Too few standbys, an arbiter that does not store data and therefore cannot acknowledge majority writes, and a synchronous standby that is offline. The result is the primary blocking on writes, which presents to applications as latency or timeouts. Test your durability configuration against the failure modes you actually expect: take down a standby in a staging cluster, see what happens, then decide whether the behavior is acceptable.

Detecting failure

Replication keeps copies of the data; failure detection decides when to use them. The contrast between the two systems is sharpest here because PostgreSQL has essentially no failure detection in core, and MongoDB has it built into every node.

In MongoDB, every member of a replica set sends periodic heartbeats to every other member. The default heartbeat interval is two seconds, configurable via the settings.heartbeatIntervalMillis configuration. Heartbeats are bidirectional: every member is both a sender and a receiver, and every member maintains its own view of the cluster’s health.

When a member fails to respond to heartbeats for longer than electionTimeoutMillis (default 10 seconds), peers consider it unreachable. If the unreachable member is the primary, eligible secondaries begin the election process to choose a new one.

The heartbeat data is rich. A heartbeat response carries the responder’s state (PRIMARY, SECONDARY, etc.), its term, its OpTime, its configuration version, and various health metrics. The TopologyCoordinator on each node uses this stream of information to maintain a consistent view of the cluster. Because every node sees the same heartbeat traffic, every node can independently decide what to do when something changes. There is no single coordinator that must itself stay alive for the cluster to make decisions. This is the essence of distributed consensus: each member runs the same algorithm on the same inputs and arrives at the same conclusion.

PostgreSQL has nothing equivalent in core. The walreceiver on a standby maintains a connection to the primary and detects when the connection breaks, but it does not act on that detection. It simply tries to reconnect. The primary does not actively monitor its standbys for liveness; it only knows about a standby that has connected and is consuming WAL. There is no protocol for a standby to vote, no protocol for primary handoff, and no protocol for cluster membership beyond the operator-managed standby connection. PostgreSQL is, in this sense, a single-node database that streams its WAL to bystanders.

External tools fill the gap. Patroni, the most popular HA solution for PostgreSQL, runs alongside each Postgres instance and uses a distributed key-value store (typically etcd or Consul) to maintain cluster state. Each Patroni agent registers its node, monitors the local Postgres process, and writes liveness updates to the key-value store. If the agent on the primary fails to update its leader key within a configured TTL, the key expires, and the Patroni agents on the standbys race to acquire it through a compare-and-swap operation. The winner promotes its local Postgres to primary; the losers reconfigure themselves to follow the new primary. The leader election logic lives entirely in Patroni and etcd. PostgreSQL itself only sees commands like pg_promote() or recovery configuration changes.

repmgr, a different tool, takes a different approach. It uses a daemon (repmgrd) on each node that monitors the primary and coordinates failover through repmgr metadata stored in the database itself. The trade-offs are subtle: Patroni has a robust external coordination layer (etcd) at the cost of running an additional service; repmgr has fewer moving parts but relies on the standby cluster reaching consensus through reads of the primary’s metadata, which is harder when the primary is the failed node. pg_auto_failover, a newer option from Microsoft, uses a separate monitor process that all nodes report to, again with its own trade-offs around the monitor itself becoming a single point of failure.

The result is that failure detection in PostgreSQL is a deployment decision, not a database decision. Two PostgreSQL deployments running the same software can have completely different failure detection semantics depending on which HA tool wraps them, what its timeout values are, and how its coordination layer is configured. A MongoDB deployment’s failure detection is essentially the same on every cluster of the same version, because it is part of the database. This makes MongoDB easier to reason about for an operator who is running it for the first time. It makes PostgreSQL more flexible for an operator who needs failure detection that integrates with broader infrastructure — a Kubernetes cluster, an existing etcd-based coordination plane, or a custom orchestrator.

A common failure mode in PostgreSQL HA setups is the network partition that splits the etcd cluster from the database cluster. Patroni cannot make decisions without a quorum in etcd, so the database cluster effectively freezes, not because the database itself failed, but because the coordination layer cannot reach a quorum. MongoDB has a similar pattern, but the coordination layer and the database layer are the same nodes, so a partition that affects coordination affects the database in identical ways. The integrated approach has fewer surprises in partition scenarios but cannot offload coordination to a separate, possibly more reliable, layer.

Electing a new leader

When the primary is gone, the cluster needs to pick a new one. The procedure has to satisfy two constraints. First, it must terminate. A cluster that endlessly debates who should be primary is unusable. Second, it must produce exactly one primary. A cluster with two primaries can accept conflicting writes that diverge silently, a condition called split-brain that is one of the worst failure modes in distributed systems. Both PostgreSQL ecosystems and MongoDB solve this through consensus algorithms, but they put the consensus in very different places.

How MongoDB elects a primary

MongoDB uses a Raft-derived election protocol, implemented in the TopologyCoordinator and exercised through the ReplicationCoordinatorImpl. When a secondary detects that the primary has been unreachable longer than the election timeout, it transitions through a series of states designed to avoid wasting votes on candidates that cannot win.

The first phase is the dry run. Before requesting actual votes, the candidate sends a vote request marked as dry_run = true to all voting members. Each member responds with whether it would have voted for this candidate based on the current state. Specifically, whether the candidate is at least as up-to-date as the responder, and whether the responder has not already voted in this term. A dry run that returns a majority of yes responses means the candidate has a reasonable chance of winning a real election. A dry run that fails (typically because some other node thinks it is more up-to-date) lets the candidate back off without burning a term number.

If the dry run succeeds, the candidate increments its term, votes for itself, and sends a real vote request, ReplSetRequestVotesArgs, in the wire protocol. Voting members evaluate three things: (1) is the candidate at least as up-to-date as I am, measured by OpTime; (2) has the candidate’s term advanced beyond my current term; and (3) have I already voted in this term. If all three conditions are met, the voter grants the vote. Once the candidate collects a majority of votes from voting members, it transitions to the PRIMARY state, and the cluster recognizes it through the next round of heartbeats.

A few details matter for understanding what can and cannot happen. Only voting members count toward the majority. A member configured with votes: 0 (often used for analytics secondaries) is not part of the quorum and cannot vote in elections. Arbiters are voting members who hold no data. They exist solely to break ties in even-numbered clusters and are themselves a source of operational confusion because they affect quorum calculations without contributing data redundancy. The terminology can mislead: a five-node replica set with three data-bearing voting members and two arbiters has the same fault tolerance as a three-node cluster, not a five-node cluster, because the arbiters cannot serve data.

The dry run phase is a refinement of the stock Raft, designed to mitigate a specific bad pattern. In plain Raft, a partitioned node that loses contact with the rest of the cluster will keep starting elections, incrementing its term each time. When the partition heals, this node’s high term number forces the surviving primary to step down even though that primary was happily serving traffic. The dry run prevents the partitioned node from incrementing its term unless it can plausibly win an election, which means the high-term churn never happens.

How PostgreSQL elects a primary

PostgreSQL itself does not elect anything. Promoting a standby to primary is a single operation: pg_promote() (since Postgres 12) or, equivalently, calling the trigger file mechanism. The promoted node stops being in recovery, accepts writes, and starts producing its own WAL. The election logic, deciding which standby to promote, ensuring no other standby is also being promoted, and redirecting clients, lives entirely outside PostgreSQL.

In a Patroni-managed cluster, the election is run by Patroni agents using the underlying distributed coordination service. The protocol typically goes like this. Each Patroni agent periodically renews a leader key in etcd with a TTL of, say, 30 seconds. As long as the primary’s Patroni agent is alive and the primary database is healthy (Patroni runs local health checks), the key is renewed, and the leader is stable. If the primary fails or its agent stops renewing, the leader key expires. The Patroni agents on the standbys race to acquire the now-empty key through a compare-and-swap operation. Whichever standby wins the race and is also healthy and sufficiently caught up, it promotes its local PostgreSQL and writes itself as the new leader. Critically, the compare-and-swap is atomic in etcd, so exactly one standby can win.

The fault tolerance of this scheme depends on the etcd cluster, not on PostgreSQL. An etcd cluster of three nodes tolerates one failure; an etcd cluster of five tolerates two. If the etcd cluster loses quorum, no leader change can happen, and the existing leader (if any) eventually loses its lock. Whether this is acceptable depends on whether you can keep etcd more reliable than PostgreSQL itself. In Kubernetes deployments, etcd is often a separate cluster used for many things; in dedicated Postgres deployments, you might run etcd on the same machines as Patroni, and a failure that takes out a database node also takes out the local etcd member.

There is a subtlety around catch-up. Patroni and similar tools must decide whether to promote a standby that has slightly stale WAL or wait for it to catch up first. Promoting a stale standby loses recent writes that did not make it across; waiting risks extending the outage. Patroni’s default behavior favors availability — it will promote a standby that is reasonably caught up rather than wait — but you can configure it to be stricter. MongoDB faces the same trade-off and resolves it by prioritizing and requiring that the elected primary be at least as up-to-date as a majority of voting members. The mechanisms differ, but the choice of availability versus consistency is the same.

Preventing two primaries

Both systems take split-brain seriously, and both rely on majority requirements to prevent it. In MongoDB, a candidate must receive votes from a majority of voting members. If the cluster is partitioned into a minority and a majority partition, only the majority partition can elect a new primary; the minority cannot reach quorum and remains read-only or unavailable. The old primary, if it survives in the minority partition, eventually steps down on its own (the protocol requires the primary to maintain heartbeat contact with a majority of voters; if that contact is lost for the heartbeat timeout, the primary steps down).

In a Patroni-managed PostgreSQL cluster, split-brain prevention has two layers. The etcd lock prevents two Patroni agents from both believing they are the leader. The optional fencing mechanism (Patroni’s rewind feature, plus tools like pg_rewind) handles the case where an old primary returns after a network partition heals. MongoDB has a similar mechanism called rollback: when a stepped-down primary rejoins the cluster and discovers its tail of oplog was not replicated to the new primary, the unreplicated operations are rolled back and saved to a separate file for human inspection. Both rollback mechanisms are last resorts and should not happen in correctly configured clusters.

Handling failover at the client layer

A failover that the database handles is only useful if applications notice and adapt. The two systems take different approaches to the client side, and the differences shape how applications must be written.

MongoDB drivers maintain a topology view of the cluster. When an application connects to a MongoDB replica set, the driver does not connect to a single hostname. It connects to a list of seed hosts and discovers the rest of the cluster through the isMaster command (renamed hello in newer protocol versions). The driver tracks which member is currently primary, and routes writes to it; reads go to whichever member matches the configured read preference. When the primary changes, the driver detects the change through subsequent isMaster responses and updates its routing. From the application’s perspective, this happens transparently. A write that arrived during the failover sees a brief error, but with retryable writes enabled, it is retried automatically against the new primary.

Retryable writes are an important safety net. The MongoDB protocol assigns each write a unique transaction number derived from the client’s session. When the driver retries a failed write, the server checks whether the operation has already been applied (using the transaction number to look up the operation in a special transactions collection). If yes, the server reconstructs the original response without re-executing the write; if no, it executes the write normally. This makes retries safe against double-application, even in edge cases where the original write succeeded but the response did not reach the client. PostgreSQL has nothing equivalent at the protocol level; retrying a failed INSERT in PostgreSQL is the application’s responsibility, and idempotency is a property of the SQL the application chooses to write.

PostgreSQL clients connect to a hostname and port. There is no protocol-level concept of a cluster. If the primary fails and a standby takes over, the client must somehow be redirected to the new primary, and PostgreSQL itself does nothing to make that happen. Production deployments solve this in one of three ways.

The classic approach is a virtual IP (VIP) that floats between the primary and standby. Clients connect to the VIP; whichever node currently holds the VIP is the active primary. VIPs can be managed through Pacemaker, keepalived, or a cloud load balancer. The pattern is simple but has subtle failure modes. The VIP must move atomically (no two nodes can hold it simultaneously), and clients connected to the VIP at the time of failover see their connections drop, which they must then reconnect.

A second approach uses a connection pooler with topology awareness. PgBouncer in transaction-pooling mode, deployed in front of the cluster, can be reconfigured to point at the new primary on failover. The application sees a stable PgBouncer endpoint while the actual primary changes behind the scenes. This decouples the application from the cluster topology and tends to work well with HA tools like Patroni, which can update the pooler’s configuration during promotion.

The third approach pushes topology awareness into the client driver. JDBC supports multi-host connection strings (jdbc:postgresql://host1,host2,host3/db?targetServerType=primary), and similar mechanisms exist in many language ecosystems. The driver tries each host until it finds one that is currently the primary, retrying as needed. This is closer in spirit to MongoDB’s driver-level topology, but it lacks the protocol support to detect role changes proactively. The driver only learns about the new topology when a query fails, not before.

The practical consequence is that writing failover-tolerant applications looks different on the two systems. On MongoDB, you connect to the cluster, enable retryable writes (on by default in modern drivers), and let the driver handle most of the rest. Your application code is mostly indistinguishable from code that talks to a single mongod. In PostgreSQL, choose your failover routing strategy (VIP, pooler, or multi-host), ensure your transactions are idempotent or wrap them in retry logic, and verify that your client library handles connection drops the way you expect. Neither approach is intrinsically better, but the work that has to happen in application code is meaningfully larger in PostgreSQL.

A worth-mentioning subtlety: read preference handling. Both systems let clients route reads to non-primary nodes, but they handle staleness differently. MongoDB drivers honor read preferences (primary, secondary, secondaryPreferred, nearest) and read concerns (local, majority, snapshot, linearizable) on a per-query basis, with the driver picking the right node. PostgreSQL clients connecting through a multi-host string can specify targetServerType=preferSecondary, but they cannot specify a read concern; staleness is whatever the standby happens to be at, with no protocol-level guarantee. PostgreSQL’s hot_standby_feedback parameter helps prevent vacuum-induced query cancellations on standbys, but the application has no way to ask for a strict freshness guarantee at query time.

Surveying the operational reality

Running these systems in production looks different. Both PostgreSQL and MongoDB demand operational competence, but the shape of the work is different enough that an operator skilled in one will need to relearn habits to be effective with the other.

A typical PostgreSQL HA stack includes more components than first-time operators expect. You have PostgreSQL itself. The primary and standbys, each with its own postgresql.conf, pg_hba.conf, and recovery configuration. You have Patroni or repmgr running alongside each Postgres instance, with its own configuration file and process supervision. You have etcd or Consul (often three or five nodes for fault tolerance), with its own deployment, security, and backup considerations. You have a connection-routing layer — PgBouncer, HAProxy, a cloud load balancer, or some combination. That needs to be told about topology changes. You have monitoring, e.g., pg_exporter for Prometheus, Patroni’s own metrics, and etcd’s metrics. Each producing its own view of cluster health. Configuration drift is a real risk: PostgreSQL configuration changes have to be coordinated with Patroni, which has its own opinions about which settings it owns versus which ones the operator can change directly.

A typical MongoDB HA deployment has fewer moving parts. You have mongod processes. Three or five of them. Each running a single configuration file. You have client drivers that handle topology discovery automatically. You have monitoring (Prometheus exporters, mongostat, the built-in $listSessions and serverStatus commands). That is, in many cases, it. There is no separate orchestration daemon, no external coordination service, no connection router. The simplicity is an asset for small operations teams and a liability for large ones, because it means you cannot mix-and-match components the way you can with PostgreSQL.

The flip side: PostgreSQL’s flexibility means each shop runs a slightly different stack, and skills are partially transferable. An operator who knows Patroni can move between teams and apply roughly the same playbook. An operator who has only used vanilla MongoDB has fewer options to deploy, but the operations are more uniform across deployments. The rs.status() output is the same on every replica set, the failover behavior is the same, the metrics names are the same. If you join a team running MongoDB and want to know how to fail over, the answer is in the documentation; if you join a team running PostgreSQL with Patroni, you need to read both the PostgreSQL and Patroni documentation, plus the team’s internal runbooks for how their etcd cluster is laid out.

Upgrades

Major version upgrades are a real pain point in PostgreSQL. Physical replication does not work across major versions, so the standard upgrade path involves either pg_upgrade (which upgrades a single instance in place, requiring downtime) or logical replication (set up a new version cluster, replicate data into it, switch over). Each approach has trade-offs. pg_upgrade is fast for small clusters and complex for large ones; logical replication is more robust for large clusters but requires schema compatibility and careful coordination. Tools like pg_easy_replicate automate parts of the logical replication path, but the operator still has to think about the cutover.

MongoDB upgrades are simpler in the common case. Replica sets support rolling upgrades: you upgrade one secondary, wait for it to rejoin, repeat for the other secondaries, then step down the primary and upgrade it last. Each version of MongoDB documents its compatibility window with the previous version, and feature compatibility flags let you upgrade the binaries before enabling new on-disk features. There are still pitfalls. Some upgrades require running specific scripts to update internal collections, and major version jumps may require intermediate stops, but the basic flow is straightforward and well-trodden.

Observability and debugging

Both systems expose detailed cluster state, but through different surfaces. PostgreSQL exposes pg_stat_replication on the primary (showing each connected standby and its lag) and pg_stat_wal_receiver on the standby (showing the receiver’s state). Patroni adds its own /cluster endpoint (or patronictl list output) that shows the topology, including which node is the leader. MongoDB exposes rs.status() and rs.printSecondaryReplicationInfo(), each producing a JSON document that captures the entire replica set’s state. The MongoDB output is denser and more uniform; the PostgreSQL output is more accurate to the actual database state, but requires you to know which catalogs to query.

Replication lag is monitored differently. In PostgreSQL, lag is measured in WAL bytes (replay_lsn vs current_lsn) or in time (now() - last_replay_timestamp). In MongoDB, lag is measured in OpTime difference, which is roughly equivalent to time since the OpTime is itself a Timestamp. Both systems can produce confusing numbers when a cluster is idle: a PostgreSQL standby that is fully caught up reports zero lag, but the timestamp-based metric drifts upward in idle periods because the primary has not generated new WAL. MongoDB’s OpTime gap behaves similarly. Operators learn to distinguish “real lag” from “cluster is idle” by checking whether traffic is flowing.

Choosing between them

After seven sections of comparison, the temptation is to declare one system the winner. The honest answer is that neither system wins on its own merits; they win for specific deployments. The right way to choose is to start with the workload and the team, then ask which system’s replication and failover model fits.

Where PostgreSQL’s model wins

PostgreSQL’s philosophy of primitives over policy plays to its strengths in environments that already have strong infrastructure conventions. If your platform team runs etcd for other purposes, integrating Patroni is straightforward and gives you HA that fits your existing operational mental model. If your application has heterogeneous replication needs, a synchronous standby in the same datacenter, an asynchronous standby for disaster recovery in another region, and a logical replica for analytics with a subset of tables. PostgreSQL’s independent replication mechanisms let you build the exact topology you need without compromise. MongoDB’s replica sets are uniform; you cannot mix synchronous and asynchronous nodes within a set the way PostgreSQL’s synchronous_standby_names allows.

Regulatory environments often favor PostgreSQL for similar reasons. Auditors like to see explicit configuration: this row in the postgresql.conf file says writes must be confirmed by two standbys, and you can read it. MongoDB’s per-query write concerns are auditable too, but the audit involves tracing through application code rather than reading server-side configuration. Both can satisfy compliance requirements, but PostgreSQL’s server-side approach is easier to centralize.

Workloads that benefit from logical replication — ETL pipelines, change data capture into Kafka or other streams, and gradual migrations between database versions — are often easier on PostgreSQL than on MongoDB. PostgreSQL’s logical replication is mature, supported by the database itself, and integrated with widely used tools (Debezium, pg_logical, pglogical). MongoDB has change streams, which solve the same problem with a different shape, but they live at a higher level and are tightly coupled to the MongoDB protocol; they are great for MongoDB-aware consumers and less useful when you need to feed data into systems that want SQL semantics.

Where MongoDB’s model wins

MongoDB’s built-in HA is a real asset for teams that do not have specialized database operations expertise. A four-person backend team can deploy a three-node MongoDB replica set in production and trust that automatic failover will work without ever reading about Raft. The same team running PostgreSQL with Patroni would need to learn etcd, distributed consensus failure modes, fencing, and a half-dozen Patroni configuration parameters before they could be confident the cluster would recover correctly from a primary failure. Both teams can succeed, but the MongoDB team starts higher up the learning curve.

Cloud-native workloads often fit MongoDB’s model well. A Kubernetes deployment that uses StatefulSets, headless services, and DNS-based discovery can run a MongoDB replica set with minimal extra infrastructure. The cluster discovers itself through DNS, drivers handle topology, and the operator does not need to maintain a separate coordination layer. PostgreSQL operators on Kubernetes typically run a Postgres operator (CloudNativePG, Stolon, Zalando’s Postgres Operator, Crunchy Data’s operator) that wraps Patroni or implements similar logic, which is not much extra work but is one more dependency to keep track of.

Workloads that want fine-grained durability per operation — some writes need majority acknowledgment, others can fire-and-forget — fit MongoDB’s write concerns naturally. PostgreSQL can simulate this with synchronous_commit overrides per session, but the model is more awkward. Conversely, an application that wants every write to have the same durability guarantee finds PostgreSQL’s server-side configuration easier to reason about.

A practical decision framework

When you cannot decide based on workload features alone, fall back to operational fit. Ask three questions. First, what does your team know? An ops team with deep PostgreSQL experience will run PostgreSQL well, even with the additional HA tooling overhead. A team with MongoDB experience will deliver more reliably on MongoDB. Skills compound; the database you are best at running is the database you should run, all else being equal.

Second, what does your platform look like? If your platform already provides Patroni-style coordination (etcd, Consul, Zookeeper), running PostgreSQL on top of it is straightforward. If your platform is opinionated about minimizing dependencies (small Kubernetes clusters, edge deployments, single-tenant workloads), MongoDB’s self-contained model is a better fit. The infrastructure you do not have is sometimes the deciding factor.

Third, what is your worst-case failure mode? PostgreSQL with Patroni has well-understood failure modes (etcd partition, fencing failure, pg_rewind divergence) that experienced operators can debug. MongoDB has its own (rollback files, network partitions that strand minorities, oplog window exhaustion) that experienced MongoDB operators can debug. The systems are roughly comparable in reliability when run by experienced teams. If your team is new to both, the simpler operational model of MongoDB usually wins, because there is less to misconfigure.

A final note: the choice is rarely permanent. Both systems are mature, both have migration tools, and both can be replaced if the original choice turns out wrong. The cost of switching is real but not catastrophic for most workloads. The cost of choosing badly and then refusing to revisit the decision is much higher. If you choose PostgreSQL and your team finds itself spending half its time tuning Patroni, that is a signal worth acting on. If you choose MongoDB and discover you really need cross-version logical replication, that is worth acting on too. The replication and failover models in this chapter are a starting point for the choice, not the end of it.

Summary

PostgreSQL exposes replication primitives (WAL streaming, replication slots, hot standbys) and leaves automatic failover to external tools such as Patroni, repmgr, pg_auto_failover, or Kubernetes-native operators like CloudNativePG.
MongoDB ships replication and automatic failover as part of the database itself, implemented through the ReplicationCoordinator, the TopologyCoordinator, and a Raft-based election protocol.
PostgreSQL supports two replication mechanisms: physical (byte-for-byte WAL streaming, used for HA) and logical (publish-subscribe row-level replication, used for cross-version upgrades and selective replication).
MongoDB has a single replication model based on the oplog, a capped collection that records logical operations and is tailed by secondaries through pull-based replication.
Synchronous durability is configured at the cluster level in PostgreSQL through synchronous_standby_names (with FIRST and ANY methods) and at the per-write level in MongoDB through write concerns.
Failure detection in MongoDB happens through a heartbeat protocol that every node runs internally; PostgreSQL has no equivalent in core and relies on external monitors for detection.
Leader election in MongoDB uses Raft consensus with a dry-run phase to prevent unnecessary term increments; PostgreSQL leader election runs in external tools (Patroni with etcd, repmgr, or pg_auto_failover).
Split-brain prevention in both systems requires a majority of voting members; without a majority, the cluster cannot elect a primary or accept new writes.
MongoDB drivers maintain a topology view of the cluster and handle failover transparently, with retryable writes providing automatic safe retry across primary changes.
PostgreSQL clients connect to a hostname and require external mechanisms (VIPs, connection poolers like PgBouncer, or multi-host JDBC strings) to route connections after a failover.
In-flight transactions on a failed PostgreSQL primary are lost and must be retried by the application; MongoDB unprepared transactions are explicitly aborted, while prepared transactions survive stepdown for sharded two-phase commit safety.
A typical PostgreSQL HA stack involves PostgreSQL plus Patroni plus etcd plus a connection-routing layer; a typical MongoDB HA deployment involves only the mongod processes and standard drivers.
Major-version upgrades are easier in MongoDB (rolling upgrades within a replica set) than in PostgreSQL (where physical replication does not span major versions and pg_upgrade or logical replication is required).
Choose PostgreSQL when you need replication flexibility, mature logical replication tooling, or fit with existing infrastructure conventions (etcd, Consul, Kubernetes operators).
Choose MongoDB when you want HA without a separate coordination layer, per-operation durability semantics, or operational simplicity for teams without deep database operations expertise.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.