DEV Community

丁久
丁久

Posted on • Originally published at dingjiu1989-hue.github.io

Database High Availability: Failover, Standby Types, Health Checks

This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.

Database High Availability: Failover, Standby Types, Health Checks

Database High Availability: Failover, Standby Types, Health Checks

High availability (HA) ensures your database remains accessible despite hardware failures, network partitions, or software crashes. This article covers failover mechanisms, standby types, health monitoring, and automated recovery.

Availability Metrics

Availability is measured in "nines":

| Uptime | Downtime/Year | Classification | |--------|---------------|----------------| | 99% | 87.6 hours | Basic | | 99.9% (three nines) | 8.76 hours | Standard | | 99.99% (four nines) | 52.56 minutes | Enterprise | | 99.999% (five nines) | 5.26 minutes | Mission-critical |

Achieving higher availability requires increasingly sophisticated (and expensive) infrastructure.

Standby Types

Hot Standby

A hot standby accepts read queries while replicating from the primary. PostgreSQL's hot_standby = on enables this:

standby postgresql.conf

hot_standby = on

hot_standby_feedback = on

  • Failover time : Seconds (already running, just promote).

  • Resource usage : Full database server resources.

  • Use case : Production read replicas with fast failover.

Warm Standby

A warm standby is running but does not accept connections until promoted:

warm standby: accept no connections

hot_standby = off

  • Failover time : Tens of seconds (start postmaster + recovery).

  • Resource usage : Moderate (receives WAL but no query processing).

  • Use case : Cost-sensitive environments.

Cold Standby

A cold standby is an offline copy of the data. It must be started from scratch during failover:

  • Failover time : Minutes to hours (restore from backup, start database).

  • Resource usage : Minimal (only storage).

  • Use case : Disaster recovery only.

Failover Strategies

Manual Failover

The operator promotes a standby and redirects applications:

Promote standby

pg_ctl promote -D /var/lib/postgresql/data

Update application DNS or connection string

This may take minutes and requires human intervention

Pros : Simple, operator has full control. Cons : Slow (minutes), error-prone under pressure.

Automated Failover with Patroni

Patroni is the most popular HA solution for PostgreSQL. It uses distributed consensus (etcd, Consul, or Zookeeper) to elect a new leader:

patroni.yml

scope: myapp-db

namespace: /service/

name: pg-node-1

consul:

host: consul.service.consul:8500

register_service: true

postgresql:

listen: 0.0.0.0:5432

connect_address: pg-node-1:5432

data_dir: /var/lib/postgresql/data

bin_dir: /usr/lib/postgresql/16/bin

authentication:

replication:

username: replicator

password: secure_password

superuser:

username: postgres

password: secure_password

parameters:

max_connections: 200

use_pg_rewind: true

use_slots: true

watchdog:

mode: required

device: /dev/watchdog

Health Checks

Patroni performs regular health checks:

  • pg_isready : Checks if PostgreSQL is accepting connections.

2\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Replication lag : If lag exceeds a threshold, the standby is not eligible for promotion. 3\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. Consensus health : The node must be reachable to the DCS (etcd/Consul).

Manual health check

pg_isready -h localhost -p 5432

localhost:5432 - accepting connections

Check replication status

psql -c "SELECT * FROM pg_stat_replication;"

Failover Flow

  • Patroni detects primary is down (missed health check + no DCS lock).

2\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\


Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.

Found this useful? Check out more developer guides and tool comparisons on AI Study Room.

Top comments (0)