Yasser B.

Posted on Mar 29 • Originally published at rivestack.io

PostgreSQL High Availability: A Practical Guide for Production

#postgres #database #devops #tutorial

Your application is live. Customers are using it. The database goes down.

How long before traffic routes around the failure? Ten seconds? Five minutes? Never, because you're paged at 2 AM and have to manually promote a replica while the on-call engineer Slacks you asking if the database is "doing a thing"?

PostgreSQL high availability is one of those topics that looks straightforward in blog posts and turns out to be deeply humbling when you actually implement it in production. This guide covers how PostgreSQL HA actually works, the main tools people use, what typically goes wrong, and when the complexity of DIY HA stops being worth it.

What "High Availability" Means for PostgreSQL

High availability means your database keeps serving requests even when individual components fail. For PostgreSQL, that typically requires three things working together:

Data replication — at least one copy of your data exists on a server other than the primary
Failure detection — something notices when the primary is unreachable
Automatic failover — the replica promotes itself to primary without a human in the loop

PostgreSQL ships with excellent replication primitives but no built-in automatic failover. The replication part is solid and well-understood. The failover part is where teams get into trouble.

Streaming Replication: The Foundation

PostgreSQL replication is based on Write-Ahead Log (WAL) shipping. Every write to the primary is first written to the WAL. Replicas connect to the primary and stream that WAL in near-real-time, replaying it to stay current.

Setting up a basic standby looks like this in postgresql.conf on the primary:

-- postgresql.conf on the primary
wal_level = replica
max_wal_senders = 3
wal_keep_size = 1GB

And in pg_hba.conf, you allow the replica to connect for replication:

# pg_hba.conf on the primary
host replication replicator 10.0.0.2/32 scram-sha-256

The replica connects with a primary_conninfo in its configuration and starts streaming WAL from the primary. Once streaming, the replica is typically only milliseconds behind.

Synchronous vs Asynchronous Replication

By default, PostgreSQL replication is asynchronous: the primary commits a transaction and returns success to the client before confirming the replica received the data. If the primary dies at exactly the wrong moment, you can lose the last few transactions.

Synchronous replication waits for at least one replica to confirm it has received and written the WAL before reporting the commit as successful:

-- postgresql.conf on the primary
synchronous_standby_names = 'replica1'
synchronous_commit = on

This gives you zero-RPO (recovery point objective) — no committed data is ever lost. The tradeoff is latency: every write waits for a round trip to the replica. On a local network this is typically 1–5ms. Across availability zones it can be 10–30ms depending on the cloud provider.

Most production setups use synchronous replication for the hot standby and asynchronous replication for additional read replicas or DR standbys that are geographically distant.

The Problem: PostgreSQL Doesn't Fail Over Itself

With streaming replication running, you have your data in two places. But if the primary goes down, PostgreSQL doesn't automatically promote the replica. You have two choices:

Manual failover — someone runs pg_ctl promote or SELECT pg_promote() on the replica. Fast if someone is awake, catastrophic if not.
Automated failover via an HA tool — a separate process watches the primary and promotes the replica when it detects a failure.

Almost every production PostgreSQL HA setup uses one of three tools for automated failover: Patroni, pg_auto_failover, or repmgr. They all solve the same problem; they have meaningfully different complexity and tradeoff profiles.

Patroni: The Industry Standard (and Why It's Hard)

Patroni is what most teams with serious PostgreSQL HA requirements end up using. It's battle-tested, highly configurable, and runs at scale. It's also genuinely complex to operate.

Patroni uses a distributed consensus store — either etcd, Consul, or ZooKeeper — to maintain cluster state and elect a primary. A minimal production Patroni setup is 3 etcd nodes + 2–3 PostgreSQL nodes + HAProxy. That's 5–6 servers before you've even started.

A minimal patroni.yml:

scope: prod-cluster
namespace: /db/
name: pg-node-1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.0.1:8008

etcd3:
  hosts: 10.0.1.1:2379,10.0.1.2:2379,10.0.1.3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 30
    maximum_lag_on_failover: 1048576
    synchronous_mode: true

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.0.1:5432
  data_dir: /var/lib/postgresql/data

HAProxy needs health checks against the Patroni REST API (port 8008), not just the PostgreSQL port — that's how it knows which node is the current primary.

Patroni is genuinely good software. But "run Patroni in production" is a weeks-long project, not an afternoon task.

pg_auto_failover: Simpler, More Opinionated

pg_auto_failover uses a dedicated monitor node instead of etcd:

# On the monitor node
pg_autoctl create monitor \
  --pgdata /var/lib/postgresql/monitor \
  --pgport 5000 \
  --hostname monitor.internal

# On the primary node
pg_autoctl create postgres \
  --pgdata /var/lib/postgresql/data \
  --monitor postgres://autoctl_node@monitor.internal:5000/pg_auto_failover \
  --hostname primary.internal \
  --pgport 5432

Easier to set up than Patroni, but the monitor is a single point of failure and it's less flexible for complex topologies. Good choice for teams that want something working quickly without running etcd.

What Goes Wrong in Production

Split-Brain

The most dangerous failure mode: a network partition causes both nodes to think they're primary and accept writes. Patroni prevents this with etcd distributed locks — a node can only be primary if it holds the lock. If it can't reach etcd, it demotes itself. pg_auto_failover prevents it by centralizing all promotion decisions in the monitor.

Replication Lag at Failover Time

With async replication, a lagged replica that promotes will be missing the last N transactions. Patroni's maximum_lag_on_failover controls this — but set it too conservatively and failover blocks entirely if all replicas are lagged after a network partition.

Application Not Reconnecting

After failover, point your connection string at a VIP or load balancer, not a node's IP:

postgresql://ha-proxy.internal:5432/mydb?target_session_attrs=read-write

The target_session_attrs=read-write parameter tells libpq to reject connections to read-only servers, which helps clients find the current primary automatically.

Testing Your Failover

# Kill the primary while watching application logs
sudo systemctl stop postgresql@17-main

# Check the cluster state
patronictl -c /etc/patroni/config.yml list

If you haven't tested your failover, you don't have HA. You have a plan that might work.

The Failover Time Question

Typical Patroni failover: 10–30 seconds. The timeline:

Primary unreachable (0s)
Patroni TTL expires, primary declared dead (default: 30s)
Replica acquires DCS lock and promotes (1–2s)
HAProxy detects new primary (2–5s)
Application reconnects (depends on pool settings)

Reducing TTL speeds detection but increases false positives from transient network blips. Most teams settle on 15–30 seconds.

When Managed PostgreSQL High Availability Makes Sense

DIY HA works. Large companies run Patroni at massive scale. But you're running 5+ nodes, maintaining etcd, and debugging edge cases at the worst possible time.

As we've broken down before, the real cost of self-hosted PostgreSQL HA includes infrastructure, tooling, and engineering time — and engineering time dominates.

Rivestack's HA clusters handle streaming replication, automatic failover, and connection routing automatically. Failover happens in seconds, your connection string doesn't change, and there's no etcd cluster to maintain. HA clusters start at $99/month with NVMe storage, automated backups, point-in-time recovery, and monitoring included.

What to Do If You're Starting From Scratch

For a managed setup: Try Rivestack — spin up an HA cluster, verify failover works from the dashboard, and move on to building your application.

For a self-managed setup:

Start with pg_auto_failover for 2–3 nodes and fast setup
Move to Patroni if you need multi-datacenter support or fine-grained control
Use HAProxy with Patroni REST API health checks (port 8008) for connection routing
Enable synchronous replication if your application can't tolerate data loss
Test failover in staging before relying on it in production

The Bottom Line

PostgreSQL has excellent HA building blocks. Streaming replication is fast, reliable, and well-understood. The gap is automated failover, and Patroni or pg_auto_failover fills it — but adds real operational complexity.

Test your failover before you need it. The worst time to discover your replica is 10 minutes behind is during an actual outage.

If you want to skip the infrastructure work, Rivestack handles the HA layer for you — including pgvector if you're building AI applications. See the getting started guide if that's your stack.

Originally published at rivestack.io

Top comments (1)

Sanjay Ghosh • Mar 30

Good information!