Philip McClarence

Posted on Mar 6

PostgreSQL High Availability: Patroni, Replication and Failover Patterns

#architecture #database #postgres #sre

PostgreSQL High Availability: Patroni, pg_auto_failover, and Streaming Replication Compared

At 3 AM, the primary goes down. The application throws connection errors. Pages start loading with 500s. Your phone buzzes with alerts you probably should have set up differently.

The question in that moment is not whether you have a replica. Most teams do. The question is whether that replica can take over automatically, without data loss, in under 30 seconds. And if you have never tested that scenario -- really tested it, not just assumed it works -- then you do not have high availability. You have a warm backup and a hope-based failover strategy.

I have watched teams discover this distinction the hard way. A standby that was silently 6 hours behind because nobody monitored replication lag. A Patroni cluster where the DCS (distributed configuration store) was a single etcd node that went down with the primary. A manual failover runbook that nobody had practiced, taking 45 minutes to execute at 3 AM with shaking hands.

PostgreSQL high availability is a solved problem in 2026. But choosing the right solution -- and more importantly, operating it correctly -- requires understanding the trade-offs. This guide covers everything from basic streaming replication to fully automated failover with Patroni, including the monitoring that makes any HA strategy actually work.

RPO and RTO: The Two Numbers That Define Your HA Strategy

Before choosing tools, define what "high availability" means for your application. Two metrics frame every decision:

Recovery Point Objective (RPO) -- How much data can you afford to lose? An RPO of zero means no committed transactions are lost during failover. An RPO of 5 minutes means you accept losing up to 5 minutes of writes. RPO determines whether you need synchronous or asynchronous replication.

Recovery Time Objective (RTO) -- How quickly must the database be available again? An RTO of 4 hours means restoring from a backup is acceptable. An RTO of 30 seconds means you need automated failover with health checking.

Here is how these map to real architectures:

Strategy	RPO	RTO	Complexity
Backup restore (pg_basebackup)	Hours	Hours	Low
Async streaming replication + manual failover	Seconds to minutes	5-30 minutes	Medium
Sync streaming replication + manual failover	Zero	5-30 minutes	Medium
Patroni (async)	Seconds	10-30 seconds	High
Patroni (sync)	Zero	10-30 seconds	High
pg_auto_failover	Configurable	10-30 seconds	Medium
Managed (RDS Multi-AZ, Aurora)	Zero (usually)	30-120 seconds	Low

Most production systems need an RTO under 60 seconds and an RPO as close to zero as the application can tolerate. If your answer is "we just restore from a nightly backup," you do not have HA -- you have disaster recovery. That is fine for development and staging, but production traffic demands more.

Streaming Replication: The Foundation of PostgreSQL HA

Every PostgreSQL HA solution builds on streaming replication. Understanding it deeply makes everything else make sense.

Streaming replication works by shipping Write-Ahead Log (WAL) records from the primary to one or more standbys in near real-time. The standby replays those WAL records to maintain an identical copy of the database.

Setting Up a Standby

On the primary, configure postgresql.conf:

# Primary configuration
wal_level = replica                    # Required for replication
max_wal_senders = 10                   # Max concurrent replication connections
max_replication_slots = 10             # Prevent WAL removal before replica consumes it
wal_keep_size = 1GB                    # Keep WAL segments as a safety net
hot_standby = on                       # Allow read queries on standby

Create a replication user and configure pg_hba.conf:

CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';

# pg_hba.conf - allow replication connections from standby
host replication replicator 10.0.1.0/24 scram-sha-256

Take a base backup and start the standby:

pg_basebackup -h primary-host -U replicator -D /var/lib/postgresql/data -Fp -Xs -P -R

The -R flag creates standby.signal and adds connection info to postgresql.auto.conf, so the standby knows where to stream WAL from.

Asynchronous vs Synchronous Replication

By default, streaming replication is asynchronous. The primary commits transactions without waiting for the standby to confirm receipt. This means:

Zero performance impact on the primary
The standby is typically milliseconds behind
During failover, you may lose the most recent transactions that had not yet been shipped

For synchronous replication, configure the primary:

synchronous_standby_names = 'FIRST 1 (standby1, standby2)'
synchronous_commit = on

Now the primary waits for at least one standby to confirm it has written the WAL to disk before reporting a commit as successful. This gives you RPO = 0, but every write incurs the network round-trip latency to the standby. For a standby in the same data center, that is typically 0.5-2ms of added latency per commit. Across regions, it can be 30-100ms -- which is often unacceptable.

A pragmatic middle ground is synchronous_commit = remote_write, where the standby confirms it has received the WAL in its OS buffer cache (not yet flushed to disk). This protects against primary failure but not simultaneous primary + standby failure.

The Limitation: Manual Failover

Plain streaming replication does not handle failover automatically. When the primary goes down, someone must:

Verify the primary is actually down (not just a network blip)
Ensure the standby has replayed all available WAL
Promote the standby: pg_ctl promote or SELECT pg_promote()
Reconfigure applications to connect to the new primary
Rebuild the old primary as a standby (pg_rewind or fresh base backup)

This process takes 5-30 minutes with a practiced team and a good runbook. At 3 AM with an adrenaline-fueled on-call engineer, it takes longer. And every minute of downtime costs money, trust, or both.

This is why automated failover tools exist.

Patroni: The Industry Standard for Automated HA

Patroni is the tool most teams reach for when they need production-grade PostgreSQL HA. Developed by Zalando and used at scale by GitLab, Zalando itself, and hundreds of other organizations, Patroni handles leader election, automatic failover, planned switchover, and cluster bootstrapping.

How Patroni Works

Patroni runs as a daemon alongside each PostgreSQL instance. It uses a distributed configuration store (DCS) -- etcd, ZooKeeper, or Consul -- to coordinate leader election and store cluster state.

┌─────────────────────────────────────────────────┐
│                  etcd cluster                    │
│            (3 or 5 nodes for quorum)             │
└───────┬──────────────┬──────────────┬────────────┘
        │              │              │
   ┌────▼────┐    ┌────▼────┐    ┌────▼────┐
   │ Patroni │    │ Patroni │    │ Patroni │
   │  Agent  │    │  Agent  │    │  Agent  │
   ├─────────┤    ├─────────┤    ├─────────┤
   │ PG Node │    │ PG Node │    │ PG Node │
   │(Primary)│    │(Standby)│    │(Standby)│
   └─────────┘    └─────────┘    └─────────┘

The leader acquires a lock in the DCS. If the leader fails to renew the lock (because the node crashed, the network partitioned, or PostgreSQL became unresponsive), the remaining nodes hold an election. The standby with the least replication lag wins and is promoted.

Patroni Configuration

A minimal Patroni configuration:

scope: postgres-cluster
name: node1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.1.1:8008

etcd3:
  hosts: 10.0.2.1:2379,10.0.2.2:2379,10.0.2.3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB - won't promote a replica lagging more than this
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: replica
        max_wal_senders: 10
        max_replication_slots: 10
        hot_standby: on
        wal_log_hints: on  # Required for pg_rewind

  initdb:
    - encoding: UTF8
    - data-checksums

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.1.1:5432
  data_dir: /var/lib/postgresql/data
  authentication:
    superuser:
      username: postgres
      password: 'secure_password'
    replication:
      username: replicator
      password: 'secure_password'

Why Patroni Works Well

Configurable failover policy. The maximum_lag_on_failover setting prevents promoting a standby that is too far behind. You can also define a priority order for which standby should be promoted first.

Planned switchover. patronictl switchover lets you move the primary role to another node gracefully -- essential for maintenance, OS upgrades, or rebalancing.

pg_rewind integration. After failover, the old primary can rejoin the cluster without a full base backup using pg_rewind, which rewinds the old primary's timeline to match the new one.

REST API. Patroni exposes a REST API on each node reporting its role and health. Load balancers like HAProxy use this to route traffic: GET /primary returns 200 on the leader, GET /replica returns 200 on standbys.

When Patroni Is Too Much

Patroni's main cost is the DCS. Running a reliable etcd cluster means 3-5 additional nodes (or colocating etcd on the PostgreSQL nodes, which has its own trade-offs). If etcd loses quorum, Patroni cannot perform failover -- and depending on configuration, it may demote the primary to prevent split-brain.

For small teams or simpler deployments, this operational overhead can be significant. That is where pg_auto_failover comes in.

pg_auto_failover: Simpler Automated HA

pg_auto_failover, originally developed by Microsoft (now under the Citus open-source umbrella), takes a different approach. Instead of a distributed consensus system, it uses a dedicated monitor node to track cluster health and orchestrate failover.

Architecture

          ┌──────────────┐
          │   Monitor    │
          │ (pg_auto_    │
          │  failover)   │
          └──────┬───────┘
                 │
        ┌────────┼────────┐
        │                 │
   ┌────▼────┐      ┌────▼────┐
   │ PG Node │      │ PG Node │
   │(Primary)│ ───► │(Standby)│
   └─────────┘      └─────────┘

The monitor is a small PostgreSQL instance that tracks the state of each node. When the primary becomes unreachable, the monitor promotes the standby. The monitor itself is a single point of failure, but if it goes down, the PostgreSQL nodes continue operating -- you just lose the ability to perform automatic failover until the monitor recovers.

Setup

# On the monitor node
pg_autoctl create monitor \
  --pgdata /var/lib/postgresql/monitor \
  --pgport 5000

# On the primary
pg_autoctl create postgres \
  --pgdata /var/lib/postgresql/data \
  --pgport 5432 \
  --monitor postgres://autoctl_node@monitor-host:5000/pg_auto_failover \
  --name node1

# On the standby
pg_autoctl create postgres \
  --pgdata /var/lib/postgresql/data \
  --pgport 5432 \
  --monitor postgres://autoctl_node@monitor-host:5000/pg_auto_failover \
  --name node2

When to Choose pg_auto_failover Over Patroni

Fewer moving parts. No etcd, ZooKeeper, or Consul to manage. The monitor is a single PostgreSQL instance.

Simpler operations. The state machine is explicit and easy to reason about. Each node transitions through well-defined states (SINGLE, PRIMARY, SECONDARY, WAIT_STANDBY, etc.).

Two-node setups. pg_auto_failover handles primary + standby gracefully without the quorum concerns that make Patroni tricky with only two PostgreSQL nodes.

Trade-off. The monitor is a single point of failure for the failover mechanism. If the monitor is down when the primary fails, no automatic failover occurs. You can mitigate this by running the monitor with its own standby, but at that point the complexity gap with Patroni narrows.

Managed HA: RDS, Aurora, Cloud SQL, and Azure

If you run PostgreSQL on a major cloud provider, HA is a checkbox. But the implementations vary significantly.

Amazon RDS Multi-AZ maintains a synchronous standby in a different availability zone. Failover takes 60-120 seconds and involves a DNS update (the endpoint stays the same, but connections are dropped). The standby is not available for read queries -- it is a cold standby used only for failover.

Amazon Aurora uses a shared storage layer that replicates data across three AZs automatically. Read replicas connect to the same storage, so replication lag is typically under 100ms. Failover takes 30-60 seconds. Aurora is not wire-compatible with vanilla PostgreSQL in all edge cases, but for most applications it is transparent.

Google Cloud SQL offers regional HA with automatic failover. Similar to RDS Multi-AZ in concept but with generally faster failover times (under 60 seconds in most cases). Read replicas are available for read scaling.

Azure Flexible Server provides zone-redundant HA with synchronous replication and automatic failover. The standby is in a different AZ, and failover typically completes in 60-120 seconds.

The Trade-Offs of Managed HA

Pros: Zero operational burden for replication setup, automatic failover, integrated backups, no DCS to manage.

Cons: Less control over failover behavior, higher cost (you pay for the standby instance), potential vendor lock-in, and connection drops during failover that your application must handle gracefully.

The biggest hidden issue with managed HA is that failover still drops all active connections. Your application needs connection retry logic, and any in-flight transactions are rolled back. If your connection pooler or ORM does not handle reconnection gracefully, a 30-second failover can cascade into minutes of application errors.

Connection Routing for HA: The Missing Piece

Automated failover is only half the problem. The other half is ensuring your application connects to the right node after failover. This is where connection pooling and routing become essential.

HAProxy with Patroni

HAProxy can use Patroni's REST API to route traffic:

listen postgres-primary
    bind *:5000
    option httpchk GET /primary
    http-check expect status 200
    default-server inter 3s fall 3 rise 2
    server node1 10.0.1.1:5432 check port 8008
    server node2 10.0.1.2:5432 check port 8008
    server node3 10.0.1.3:5432 check port 8008

listen postgres-replicas
    bind *:5001
    balance roundrobin
    option httpchk GET /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2
    server node1 10.0.1.1:5432 check port 8008
    server node2 10.0.1.2:5432 check port 8008
    server node3 10.0.1.3:5432 check port 8008

Applications connect to port 5000 for writes and 5001 for reads. When the primary changes, HAProxy detects it within seconds through the health checks and reroutes traffic.

PgBouncer for Connection Management

PgBouncer does not handle failover routing itself, but it is critical for HA deployments because it manages connection lifecycle. During a failover, PgBouncer's server_login_retry setting controls how quickly it reconnects to the new primary. Place PgBouncer behind HAProxy or use it with a VIP (virtual IP) that follows the primary.

libpq Connection Strings

PostgreSQL's native client library supports multiple hosts in the connection string:

postgresql://host1:5432,host2:5432,host3:5432/mydb?target_session_attrs=read-write

The client tries each host in order and connects to the first one that matches target_session_attrs=read-write (which is the primary). This is the simplest approach for applications that use libpq directly, but it adds connection latency during failover as the client iterates through hosts.

Monitoring Your HA Setup: The Non-Negotiable

Here is the uncomfortable truth: every HA architecture described above can fail silently. Replication can stall. The standby can fall hours behind. WAL can accumulate on the primary until the disk fills. The DCS can lose quorum without anyone noticing.

The worst time to discover your replica is 6 hours behind is during a failover. By then it is too late -- you either accept the data loss or you wait for the replica to catch up while the application stays down.

Tools like myDBA.dev continuously monitor replication lag, replication slot status, WAL accumulation, and replica health. The replication topology visualization shows your entire cluster at a glance -- primary, standbys, replication direction, and lag for each connection -- so you know immediately if something drifts.

Alerts should fire on these conditions at minimum:

Replication lag exceeds your RPO. If your RPO is 10 seconds, alert at 5 seconds of lag.
Replication slot inactive. An inactive slot causes WAL accumulation on the primary.
WAL accumulation growing. The primary's pg_wal directory should remain roughly constant. Growth means something is not consuming WAL.
Standby not in streaming state. The standby should be in streaming state, not startup or catchup for extended periods.

-- Check replication status on the primary
SELECT
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;

-- Check replication slot health
SELECT
    slot_name,
    active,
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_wal_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal_size
FROM pg_replication_slots;

Testing Failover: The Step Everyone Skips

If you have not tested failover in the last month, you do not know if it works. Configuration drift, software updates, network changes, and disk space issues can all break a failover that worked perfectly three months ago.

Build a Failover Test Routine

For Patroni:

# Planned switchover (safe, graceful)
patronictl switchover --leader node1 --candidate node2 --scheduled now --force

# Simulate primary failure (harder test)
# On the primary node:
sudo systemctl stop postgresql
# Watch Patroni logs on the standby to confirm promotion

For pg_auto_failover:

# Check current state
pg_autoctl show state

# Perform planned failover
pg_autoctl perform failover

For managed services: Most cloud providers support a "reboot with failover" option in the console or CLI. Use it.

What to Measure During Testing

Time to detect failure. How long before the HA system notices the primary is gone?
Time to promote. How long does the standby take to become the new primary?
Time to reconnect. How long before the application establishes connections to the new primary?
Data loss. Were any committed transactions lost? Check by writing a known sequence of values before the failover and verifying they all exist after.
Application behavior. Did the application handle dropped connections gracefully, or did it crash, leak connections, or enter a retry storm?

Document the results. Set a calendar reminder to test again next month. Failover testing is not a one-time event -- it is a practice.

Comparison: Choosing the Right HA Solution

Feature	Streaming Replication (Manual)	Patroni	pg_auto_failover	Managed (RDS/Aurora)
Automatic failover	No	Yes	Yes	Yes
Typical RTO	5-30 min	10-30 sec	10-30 sec	30-120 sec
Zero data loss (RPO=0)	With sync replication	Configurable	Configurable	Usually yes
Read replicas	Yes	Yes	Yes	Yes (varies)
External dependencies	None	etcd/ZK/Consul (3-5 nodes)	Monitor node (1)	None (managed)
Operational complexity	Low setup, high failover	High setup, low failover	Medium	Very low
Cost	Lowest	Medium (DCS infra)	Low-medium	Highest
Control over failover	Full	High	Medium	Low
Multi-region	Manual	Supported	Limited	Easy (managed)
Battle-tested at scale	Everywhere	GitLab, Zalando	Growing	AWS, GCP, Azure

Decision Framework

Start with your constraints and work backward:

"We cannot manage infrastructure and just need it to work."
Use a managed service. RDS Multi-AZ for standard workloads, Aurora if you need fast read scaling or sub-minute failover. Accept the higher cost as the price of operational simplicity.

"We need automatic failover and run our own infrastructure."
Patroni if you have the team to operate an etcd cluster and want maximum control. pg_auto_failover if you want simpler operations and can tolerate a single monitor node as a (mitigable) point of failure.

"We are on a tight budget and can tolerate manual failover."
Streaming replication with a well-documented, regularly tested runbook. Invest the savings into monitoring so you know the instant something goes wrong. A good monitoring setup with alerting (like myDBA.dev) makes manual failover viable by giving you early warning and clear visibility into cluster state.

"We need zero data loss across regions."
Synchronous replication with Patroni or a managed service like Aurora Global Database. Accept the write latency cost. Test thoroughly because cross-region synchronous replication adds meaningful latency to every commit.

The Bottom Line

PostgreSQL high availability is not about picking the fanciest tool. It is about understanding your RPO and RTO requirements, choosing the simplest architecture that meets them, and then investing heavily in monitoring and testing.

The teams I have seen handle failovers well share three traits: they monitor replication lag continuously, they test failover monthly, and they have runbooks that even a sleep-deprived on-call engineer can follow at 3 AM.

The teams that struggle? They set up replication once, assumed it worked, and found out it did not at the worst possible moment.

Do not be that team. Set up your HA architecture, monitor it relentlessly, and test it before production forces you to.

DEV Community