PostgreSQL High Availability: Patroni, pg_auto_failover, and Streaming Replication Compared
At 3 AM, the primary goes down. The application throws connection errors. Pages start loading with 500s. Your phone buzzes with alerts you probably should have set up differently.
The question in that moment is not whether you have a replica. Most teams do. The question is whether that replica can take over automatically, without data loss, in under 30 seconds. And if you have never tested that scenario -- really tested it, not just assumed it works -- then you do not have high availability. You have a warm backup and a hope-based failover strategy.
I have watched teams discover this distinction the hard way. A standby that was silently 6 hours behind because nobody monitored replication lag. A Patroni cluster where the DCS (distributed configuration store) was a single etcd node that went down with the primary. A manual failover runbook that nobody had practiced, taking 45 minutes to execute at 3 AM with shaking hands.
PostgreSQL high availability is a solved problem in 2026. But choosing the right solution -- and more importantly, operating it correctly -- requires understanding the trade-offs. This guide covers everything from basic streaming replication to fully automated failover with Patroni, including the monitoring that makes any HA strategy actually work.
RPO and RTO: The Two Numbers That Define Your HA Strategy
Before choosing tools, define what "high availability" means for your application. Two metrics frame every decision:
Recovery Point Objective (RPO) -- How much data can you afford to lose? An RPO of zero means no committed transactions are lost during failover. An RPO of 5 minutes means you accept losing up to 5 minutes of writes. RPO determines whether you need synchronous or asynchronous replication.
Recovery Time Objective (RTO) -- How quickly must the database be available again? An RTO of 4 hours means restoring from a backup is acceptable. An RTO of 30 seconds means you need automated failover with health checking.
Here is how these map to real architectures:
| Strategy | RPO | RTO | Complexity |
|---|---|---|---|
| Backup restore (pg_basebackup) | Hours | Hours | Low |
| Async streaming replication + manual failover | Seconds to minutes | 5-30 minutes | Medium |
| Sync streaming replication + manual failover | Zero | 5-30 minutes | Medium |
| Patroni (async) | Seconds | 10-30 seconds | High |
| Patroni (sync) | Zero | 10-30 seconds | High |
| pg_auto_failover | Configurable | 10-30 seconds | Medium |
| Managed (RDS Multi-AZ, Aurora) | Zero (usually) | 30-120 seconds | Low |
Most production systems need an RTO under 60 seconds and an RPO as close to zero as the application can tolerate. If your answer is "we just restore from a nightly backup," you do not have HA -- you have disaster recovery. That is fine for development and staging, but production traffic demands more.
Streaming Replication: The Foundation of PostgreSQL HA
Every PostgreSQL HA solution builds on streaming replication. Understanding it deeply makes everything else make sense.
Streaming replication works by shipping Write-Ahead Log (WAL) records from the primary to one or more standbys in near real-time. The standby replays those WAL records to maintain an identical copy of the database.
Setting Up a Standby
On the primary, configure postgresql.conf:
# Primary configuration
wal_level = replica # Required for replication
max_wal_senders = 10 # Max concurrent replication connections
max_replication_slots = 10 # Prevent WAL removal before replica consumes it
wal_keep_size = 1GB # Keep WAL segments as a safety net
hot_standby = on # Allow read queries on standby
Create a replication user and configure pg_hba.conf:
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';
# pg_hba.conf - allow replication connections from standby
host replication replicator 10.0.1.0/24 scram-sha-256
Take a base backup and start the standby:
pg_basebackup -h primary-host -U replicator -D /var/lib/postgresql/data -Fp -Xs -P -R
The -R flag creates standby.signal and adds connection info to postgresql.auto.conf, so the standby knows where to stream WAL from.
Asynchronous vs Synchronous Replication
By default, streaming replication is asynchronous. The primary commits transactions without waiting for the standby to confirm receipt. This means:
- Zero performance impact on the primary
- The standby is typically milliseconds behind
- During failover, you may lose the most recent transactions that had not yet been shipped
For synchronous replication, configure the primary:
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'
synchronous_commit = on
Now the primary waits for at least one standby to confirm it has written the WAL to disk before reporting a commit as successful. This gives you RPO = 0, but every write incurs the network round-trip latency to the standby. For a standby in the same data center, that is typically 0.5-2ms of added latency per commit. Across regions, it can be 30-100ms -- which is often unacceptable.
A pragmatic middle ground is synchronous_commit = remote_write, where the standby confirms it has received the WAL in its OS buffer cache (not yet flushed to disk). This protects against primary failure but not simultaneous primary + standby failure.
The Limitation: Manual Failover
Plain streaming replication does not handle failover automatically. When the primary goes down, someone must:
- Verify the primary is actually down (not just a network blip)
- Ensure the standby has replayed all available WAL
- Promote the standby:
pg_ctl promoteorSELECT pg_promote() - Reconfigure applications to connect to the new primary
- Rebuild the old primary as a standby (pg_rewind or fresh base backup)
This process takes 5-30 minutes with a practiced team and a good runbook. At 3 AM with an adrenaline-fueled on-call engineer, it takes longer. And every minute of downtime costs money, trust, or both.
This is why automated failover tools exist.
Patroni: The Industry Standard for Automated HA
Patroni is the tool most teams reach for when they need production-grade PostgreSQL HA. Developed by Zalando and used at scale by GitLab, Zalando itself, and hundreds of other organizations, Patroni handles leader election, automatic failover, planned switchover, and cluster bootstrapping.
How Patroni Works
Patroni runs as a daemon alongside each PostgreSQL instance. It uses a distributed configuration store (DCS) -- etcd, ZooKeeper, or Consul -- to coordinate leader election and store cluster state.
┌─────────────────────────────────────────────────┐
│ etcd cluster │
│ (3 or 5 nodes for quorum) │
└───────┬──────────────┬──────────────┬────────────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Patroni │ │ Patroni │ │ Patroni │
│ Agent │ │ Agent │ │ Agent │
├─────────┤ ├─────────┤ ├─────────┤
│ PG Node │ │ PG Node │ │ PG Node │
│(Primary)│ │(Standby)│ │(Standby)│
└─────────┘ └─────────┘ └─────────┘
The leader acquires a lock in the DCS. If the leader fails to renew the lock (because the node crashed, the network partitioned, or PostgreSQL became unresponsive), the remaining nodes hold an election. The standby with the least replication lag wins and is promoted.
Patroni Configuration
A minimal Patroni configuration:
scope: postgres-cluster
name: node1
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.1.1:8008
etcd3:
hosts: 10.0.2.1:2379,10.0.2.2:2379,10.0.2.3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB - won't promote a replica lagging more than this
postgresql:
use_pg_rewind: true
parameters:
wal_level: replica
max_wal_senders: 10
max_replication_slots: 10
hot_standby: on
wal_log_hints: on # Required for pg_rewind
initdb:
- encoding: UTF8
- data-checksums
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.1.1:5432
data_dir: /var/lib/postgresql/data
authentication:
superuser:
username: postgres
password: 'secure_password'
replication:
username: replicator
password: 'secure_password'
Why Patroni Works Well
Configurable failover policy. The maximum_lag_on_failover setting prevents promoting a standby that is too far behind. You can also define a priority order for which standby should be promoted first.
Planned switchover. patronictl switchover lets you move the primary role to another node gracefully -- essential for maintenance, OS upgrades, or rebalancing.
pg_rewind integration. After failover, the old primary can rejoin the cluster without a full base backup using pg_rewind, which rewinds the old primary's timeline to match the new one.
REST API. Patroni exposes a REST API on each node reporting its role and health. Load balancers like HAProxy use this to route traffic: GET /primary returns 200 on the leader, GET /replica returns 200 on standbys.
When Patroni Is Too Much
Patroni's main cost is the DCS. Running a reliable etcd cluster means 3-5 additional nodes (or colocating etcd on the PostgreSQL nodes, which has its own trade-offs). If etcd loses quorum, Patroni cannot perform failover -- and depending on configuration, it may demote the primary to prevent split-brain.
For small teams or simpler deployments, this operational overhead can be significant. That is where pg_auto_failover comes in.
pg_auto_failover: Simpler Automated HA
pg_auto_failover, originally developed by Microsoft (now under the Citus open-source umbrella), takes a different approach. Instead of a distributed consensus system, it uses a dedicated monitor node to track cluster health and orchestrate failover.
Architecture
┌──────────────┐
│ Monitor │
│ (pg_auto_ │
│ failover) │
└──────┬───────┘
│
┌────────┼────────┐
│ │
┌────▼────┐ ┌────▼────┐
│ PG Node │ │ PG Node │
│(Primary)│ ───► │(Standby)│
└─────────┘ └─────────┘
The monitor is a small PostgreSQL instance that tracks the state of each node. When the primary becomes unreachable, the monitor promotes the standby. The monitor itself is a single point of failure, but if it goes down, the PostgreSQL nodes continue operating -- you just lose the ability to perform automatic failover until the monitor recovers.
Setup
# On the monitor node
pg_autoctl create monitor \
--pgdata /var/lib/postgresql/monitor \
--pgport 5000
# On the primary
pg_autoctl create postgres \
--pgdata /var/lib/postgresql/data \
--pgport 5432 \
--monitor postgres://autoctl_node@monitor-host:5000/pg_auto_failover \
--name node1
# On the standby
pg_autoctl create postgres \
--pgdata /var/lib/postgresql/data \
--pgport 5432 \
--monitor postgres://autoctl_node@monitor-host:5000/pg_auto_failover \
--name node2
When to Choose pg_auto_failover Over Patroni
Fewer moving parts. No etcd, ZooKeeper, or Consul to manage. The monitor is a single PostgreSQL instance.
Simpler operations. The state machine is explicit and easy to reason about. Each node transitions through well-defined states (SINGLE, PRIMARY, SECONDARY, WAIT_STANDBY, etc.).
Two-node setups. pg_auto_failover handles primary + standby gracefully without the quorum concerns that make Patroni tricky with only two PostgreSQL nodes.
Trade-off. The monitor is a single point of failure for the failover mechanism. If the monitor is down when the primary fails, no automatic failover occurs. You can mitigate this by running the monitor with its own standby, but at that point the complexity gap with Patroni narrows.
Managed HA: RDS, Aurora, Cloud SQL, and Azure
If you run PostgreSQL on a major cloud provider, HA is a checkbox. But the implementations vary significantly.
Amazon RDS Multi-AZ maintains a synchronous standby in a different availability zone. Failover takes 60-120 seconds and involves a DNS update (the endpoint stays the same, but connections are dropped). The standby is not available for read queries -- it is a cold standby used only for failover.
Amazon Aurora uses a shared storage layer that replicates data across three AZs automatically. Read replicas connect to the same storage, so replication lag is typically under 100ms. Failover takes 30-60 seconds. Aurora is not wire-compatible with vanilla PostgreSQL in all edge cases, but for most applications it is transparent.
Google Cloud SQL offers regional HA with automatic failover. Similar to RDS Multi-AZ in concept but with generally faster failover times (under 60 seconds in most cases). Read replicas are available for read scaling.
Azure Flexible Server provides zone-redundant HA with synchronous replication and automatic failover. The standby is in a different AZ, and failover typically completes in 60-120 seconds.
The Trade-Offs of Managed HA
Pros: Zero operational burden for replication setup, automatic failover, integrated backups, no DCS to manage.
Cons: Less control over failover behavior, higher cost (you pay for the standby instance), potential vendor lock-in, and connection drops during failover that your application must handle gracefully.
The biggest hidden issue with managed HA is that failover still drops all active connections. Your application needs connection retry logic, and any in-flight transactions are rolled back. If your connection pooler or ORM does not handle reconnection gracefully, a 30-second failover can cascade into minutes of application errors.
Connection Routing for HA: The Missing Piece
Automated failover is only half the problem. The other half is ensuring your application connects to the right node after failover. This is where connection pooling and routing become essential.
HAProxy with Patroni
HAProxy can use Patroni's REST API to route traffic:
listen postgres-primary
bind *:5000
option httpchk GET /primary
http-check expect status 200
default-server inter 3s fall 3 rise 2
server node1 10.0.1.1:5432 check port 8008
server node2 10.0.1.2:5432 check port 8008
server node3 10.0.1.3:5432 check port 8008
listen postgres-replicas
bind *:5001
balance roundrobin
option httpchk GET /replica
http-check expect status 200
default-server inter 3s fall 3 rise 2
server node1 10.0.1.1:5432 check port 8008
server node2 10.0.1.2:5432 check port 8008
server node3 10.0.1.3:5432 check port 8008
Applications connect to port 5000 for writes and 5001 for reads. When the primary changes, HAProxy detects it within seconds through the health checks and reroutes traffic.
PgBouncer for Connection Management
PgBouncer does not handle failover routing itself, but it is critical for HA deployments because it manages connection lifecycle. During a failover, PgBouncer's server_login_retry setting controls how quickly it reconnects to the new primary. Place PgBouncer behind HAProxy or use it with a VIP (virtual IP) that follows the primary.
libpq Connection Strings
PostgreSQL's native client library supports multiple hosts in the connection string:
postgresql://host1:5432,host2:5432,host3:5432/mydb?target_session_attrs=read-write
The client tries each host in order and connects to the first one that matches target_session_attrs=read-write (which is the primary). This is the simplest approach for applications that use libpq directly, but it adds connection latency during failover as the client iterates through hosts.
Monitoring Your HA Setup: The Non-Negotiable
Here is the uncomfortable truth: every HA architecture described above can fail silently. Replication can stall. The standby can fall hours behind. WAL can accumulate on the primary until the disk fills. The DCS can lose quorum without anyone noticing.
The worst time to discover your replica is 6 hours behind is during a failover. By then it is too late -- you either accept the data loss or you wait for the replica to catch up while the application stays down.
Tools like myDBA.dev continuously monitor replication lag, replication slot status, WAL accumulation, and replica health. The replication topology visualization shows your entire cluster at a glance -- primary, standbys, replication direction, and lag for each connection -- so you know immediately if something drifts.
Alerts should fire on these conditions at minimum:
- Replication lag exceeds your RPO. If your RPO is 10 seconds, alert at 5 seconds of lag.
- Replication slot inactive. An inactive slot causes WAL accumulation on the primary.
- WAL accumulation growing. The primary's pg_wal directory should remain roughly constant. Growth means something is not consuming WAL.
-
Standby not in streaming state. The standby should be in
streamingstate, notstartuporcatchupfor extended periods.
-- Check replication status on the primary
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
-- Check replication slot health
SELECT
slot_name,
active,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_wal_bytes,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal_size
FROM pg_replication_slots;
Testing Failover: The Step Everyone Skips
If you have not tested failover in the last month, you do not know if it works. Configuration drift, software updates, network changes, and disk space issues can all break a failover that worked perfectly three months ago.
Build a Failover Test Routine
For Patroni:
# Planned switchover (safe, graceful)
patronictl switchover --leader node1 --candidate node2 --scheduled now --force
# Simulate primary failure (harder test)
# On the primary node:
sudo systemctl stop postgresql
# Watch Patroni logs on the standby to confirm promotion
For pg_auto_failover:
# Check current state
pg_autoctl show state
# Perform planned failover
pg_autoctl perform failover
For managed services: Most cloud providers support a "reboot with failover" option in the console or CLI. Use it.
What to Measure During Testing
- Time to detect failure. How long before the HA system notices the primary is gone?
- Time to promote. How long does the standby take to become the new primary?
- Time to reconnect. How long before the application establishes connections to the new primary?
- Data loss. Were any committed transactions lost? Check by writing a known sequence of values before the failover and verifying they all exist after.
- Application behavior. Did the application handle dropped connections gracefully, or did it crash, leak connections, or enter a retry storm?
Document the results. Set a calendar reminder to test again next month. Failover testing is not a one-time event -- it is a practice.
Comparison: Choosing the Right HA Solution
| Feature | Streaming Replication (Manual) | Patroni | pg_auto_failover | Managed (RDS/Aurora) |
|---|---|---|---|---|
| Automatic failover | No | Yes | Yes | Yes |
| Typical RTO | 5-30 min | 10-30 sec | 10-30 sec | 30-120 sec |
| Zero data loss (RPO=0) | With sync replication | Configurable | Configurable | Usually yes |
| Read replicas | Yes | Yes | Yes | Yes (varies) |
| External dependencies | None | etcd/ZK/Consul (3-5 nodes) | Monitor node (1) | None (managed) |
| Operational complexity | Low setup, high failover | High setup, low failover | Medium | Very low |
| Cost | Lowest | Medium (DCS infra) | Low-medium | Highest |
| Control over failover | Full | High | Medium | Low |
| Multi-region | Manual | Supported | Limited | Easy (managed) |
| Battle-tested at scale | Everywhere | GitLab, Zalando | Growing | AWS, GCP, Azure |
Decision Framework
Start with your constraints and work backward:
"We cannot manage infrastructure and just need it to work."
Use a managed service. RDS Multi-AZ for standard workloads, Aurora if you need fast read scaling or sub-minute failover. Accept the higher cost as the price of operational simplicity.
"We need automatic failover and run our own infrastructure."
Patroni if you have the team to operate an etcd cluster and want maximum control. pg_auto_failover if you want simpler operations and can tolerate a single monitor node as a (mitigable) point of failure.
"We are on a tight budget and can tolerate manual failover."
Streaming replication with a well-documented, regularly tested runbook. Invest the savings into monitoring so you know the instant something goes wrong. A good monitoring setup with alerting (like myDBA.dev) makes manual failover viable by giving you early warning and clear visibility into cluster state.
"We need zero data loss across regions."
Synchronous replication with Patroni or a managed service like Aurora Global Database. Accept the write latency cost. Test thoroughly because cross-region synchronous replication adds meaningful latency to every commit.
The Bottom Line
PostgreSQL high availability is not about picking the fanciest tool. It is about understanding your RPO and RTO requirements, choosing the simplest architecture that meets them, and then investing heavily in monitoring and testing.
The teams I have seen handle failovers well share three traits: they monitor replication lag continuously, they test failover monthly, and they have runbooks that even a sleep-deprived on-call engineer can follow at 3 AM.
The teams that struggle? They set up replication once, assumed it worked, and found out it did not at the worst possible moment.
Do not be that team. Set up your HA architecture, monitor it relentlessly, and test it before production forces you to.


Top comments (0)