Brian Misachi

Posted on Aug 19

HA Postgres with Patroni and Barman

#postgres #highavailability #patroni #barman

In the last post we managed to set up a HA 3-node Postgres cluster. We set up physical replication where WAL records are streamed from the primary to be replayed on the standbys. We were also able to manually failover to one of the standbys when the primary was shutdown. A lot of manual processes were involved which can be time consuming, error prone and could potentially lead to data loss.

This post will show how Patroni can be helpful in automation of HA Postgres cluster and also when managing failover when the cluster becomes unhealthy. Barman will be used for backup and recovery management. Taking it further, we will add monitoring of the cluster using tools like Grafana and Prometheus Exciting!!

We will be using scripts from this repository. Clone the repository to your local filesystem and follow along.

First create 3 nodes testPG.1, testPG.2 and testPG.3. You can choose to create as many nodes(containers) as your host can allow. The run_pg.sh script can be used for this part. Each node created will be installed with Postgres server version 18beta1 and additionally install Patroni. If this is the first time running the script, it will also create an ETCD v3 container node as the DCS for the cluster.

# ARCHIVE_DIR and BACKUP_DIR should be separate directories

$ export ARCHIVE_DIR=/path/to/archive # Replace
$ sudo chown -R 991:991 $ARCHIVE_DIR # Ensure Postgres user owns the directory
$ export BACKUP_DIR=/path/to/backup/data  # Replace
$ sudo chown -R 991:991 $BACKUP_DIR # Ensure Postgres user owns the directory
$ nohup ./run_pg.sh 1 5432 > /tmp/test.1 &  # nohup ./run_pg.sh <node number> <port> &
$ nohup ./run_pg.sh 2 5433 > /tmp/test.2 & # node 2
$ nohup ./run_pg.sh 3 5434 > /tmp/test.3 & # node 3

That's it. We now have a 3-node cluster with the above commands. Check the current state of the cluster and its member

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml list cluster1"
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Leader  | running   |  2 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Replica | streaming |  2 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  2 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+

Node1(testPG.1) is the primary server. Nodes 2 and 3 are standbys and are actively streaming changes from the primary. Since we have no data yet, none of the standbys are lagging behind the primary.

Failover

When the cluster becomes unhealthy e.g the primary goes down for some reason, patroni processes in each node coordinate and eventually select a healthy standby node by leader election to become the new primary. This part rarely requires any human intervention, unless something extremely out of the normal has happened like the DCS is down or an issue with the standbys.

Let's bring down the current primary to see how patroni handles failover.

$ CMD=`docker exec testPG.1 bash -c "ps -C patroni | grep patroni"` && docker exec testPG.1 bash -c "kill `echo $CMD | awk '{print $1}'`"

We send a SIGTERM signal to allow patroni process to cleanly shutdown Postgres server. Since this is the primary, Patroni performs a final checkpoint of any data not yet flushed to disk for writes done since the last checkpoint. This can be helpful in reducing the amount of WAL records to be replayed in recovery during the next restart.

$ tail /tmp/test.1 -n 15
....
2025-08-18 11:34:37,128 INFO: no action. I am (node1), the leader with the lock
2025-08-18 11:34:40.692 UTC [40] LOG:  received fast shutdown request
2025-08-18 11:34:40.694 UTC [40] LOG:  aborting any active transactions
....
2025-08-18 11:34:40.696 UTC [44] LOG:  shutting down
2025-08-18 11:34:40.708 UTC [44] LOG:  checkpoint starting: shutdown immediate
2025-08-18 11:34:40.758 UTC [44] LOG:  checkpoint complete: wrote 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.051 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16383 kB, estimate=16383 kB; lsn=0/8000028, redo lsn=0/8000028
2025-08-18 11:34:40.822 UTC [40] LOG:  database system is shut down

The node1 patroni process releases the leader lock allowing other nodes to compete and acquire the lock. The old primary is removed from the members list of the cluster. If we check the current list we get.

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml list cluster1"
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Leader  | running   |  3 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  3 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+

Notice node1 is gone and we have a new leader, node2. We can then proceed to fix node1 before we bring it back to the cluster.

Once the patroni process in each node detects the change in the cluster(missing primary). The first healthy node to acquire the leader lock becomes the new primary while the remaining standbys begin to follow the new leader. In this case node2(testPG.2) becomes the new primary while node3(testPG.3) begins to follow the new leader.

Now you can send a SIGKILL signal to terminate the node1 patroni process

$ CMD=`docker exec testPG.1 bash -c "ps -C patroni | grep patroni"` && docker exec testPG.1 bash -c "kill -9 `echo $CMD | awk '{print $1}'`"

The old primary can be put back into the cluster after being fixed. When this is done, it first attempts to acquire the leader lock and then it detects the lock is already being held by another server, it quickly transitions to a standby and follows the current leader.

$ nohup ./run_pg.sh 1 > /tmp/test.1 &  # Bring old primary back up
....
2025-08-18 13:31:03,430 INFO: Lock owner: node2; I am node1
2025-08-18 13:31:03,430 INFO: establishing a new patroni heartbeat connection to postgres
cp: cannot stat '/home/postgres/.tmp/00000004.history': No such file or directory
2025-08-18 13:31:03.432 UTC [970] LOG:  waiting for WAL to become available at 0/9000018
2025-08-18 13:31:03,491 INFO: no action. I am (node1), a secondary, and following a leader (node2)
cp: cannot stat '/home/postgres/.tmp/000000030000000000000009': No such file or directory
2025-08-18 13:31:08.433 UTC [996] LOG:  started streaming WAL from primary at 0/9000000 on timeline 3
2025-08-18 13:31:14,023 INFO: no action. I am (node1), a secondary, and following a leader (node2)
2025-08-18 13:31:23,934 INFO: no action. I am (node1), a secondary, and following a leader (node2)

And now node1 has been added back to the cluster, as a standby.

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml list cluster1"
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Replica | streaming |  3 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Leader  | running   |  3 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  3 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+

SwitchOver

In order to promote node1 back to the primary role we can use the switchover command, given that this is a healthy cluster.

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml switchover cluster1 --leader node2 --candidate node1 --force"  # Using force to skip the prompts from patroni
Current cluster topology
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Replica | streaming |  3 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Leader  | running   |  3 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  3 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
2025-08-18 14:19:31.73068 Successfully switched over to "node1"
+ Cluster: cluster1 (7539602363152794219) ------------+----+-----------+----------------------+
| Member | Host       | Role    | State               | TL | Lag in MB | Tags                 |
+--------+------------+---------+---------------------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Leader  | running             |  3 |           | clonefrom: true      |
|        |            |         |                     |    |           | failover_priority: 1 |
+--------+------------+---------+---------------------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Replica | stopping            |    |   unknown | clonefrom: true      |
|        |            |         |                     |    |           | failover_priority: 1 |
+--------+------------+---------+---------------------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | in archive recovery |  3 |         0 | clonefrom: true      |
|        |            |         |                     |    |           | failover_priority: 1 |
+--------+------------+---------+---------------------+----+-----------+----------------------+

The cluster is still in an inconsistent state.
The switching over process from the primary node(node2) logs looks like this. The output is truncated for clarity.

$ tail /tmp/test.2 -n 100
2025-08-18 14:19:28,597 INFO: received switchover request with leader=node2 candidate=node1 scheduled_at=None
2025-08-18 14:19:28,608 INFO: Got response from node1
....
2025-08-18 14:19:28,658 INFO: Lock owner: node2; I am node2
2025-08-18 14:19:28,759 INFO: switchover: demoting myself
2025-08-18 14:19:28,759 INFO: Demoting self (graceful)
2025-08-18 14:19:28.764 UTC [39] LOG:  checkpoint starting: immediate force wait
....
2025-08-18 14:19:31,033 INFO: Leader key released
2025-08-18 14:19:31,082 INFO: Lock owner: node1; I am node2
2025-08-18 14:19:31,082 INFO: switchover: demote in progress
....

The cluster is back to a consistent state and with original leader node1 and nodes node2 and node3 as followers.

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml list cluster1"
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Leader  | running   |  4 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Replica | streaming |  4 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  4 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+

Having changed the leader multiple times, patroni also provides a way to view leadership changes within the cluster

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml history cluster1"
+----+-----------+------------------------------+----------------------------------+------------+
| TL |       LSN | Reason                       | Timestamp                        | New Leader |
+----+-----------+------------------------------+----------------------------------+------------+
|  1 | 117440672 | no recovery target specified | 2025-08-18T09:23:50.986078+00:00 | node1      |
|  2 | 134217888 | no recovery target specified | 2025-08-18T11:34:45.003395+00:00 | node2      |
|  3 | 167772320 | no recovery target specified | 2025-08-18T14:19:31.253098+00:00 | node1      |
+----+-----------+------------------------------+----------------------------------+------------+

The result shows there have been 3 leader changes between node1 and node2 and the times the changes occurred.

Patroni is an extremely handy solution for managing a highly available Postgres cluster.

Backups

A truly HA solution would require data to be backed up in a separate location such that in case you lose data or cannot recover fully from the cluster, the backups can instead be used for recovery. The solution you pick for backing up your data depends on your business requirements.

Any solution for backups should at least use the pg_basebackup utility. Remember pg_dump or pg_dumpall are not backup tools

Barman is an easy to use solution that can be used to manage backups. To set up barman, first create the required roles by executing the queries below on the primary node

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -U patroni_super -d postgres -c \"CREATE USER streaming_barman WITH REPLICATION ENCRYPTED PASSWORD 'streaming_barman'; CREATE USER barman WITH SUPERUSER ENCRYPTED PASSWORD 'barman';\""

Then run the ./barman.sh command. When the script completes, a new container will be running, with barman installed in it. The script configures backups for the primary node. The BACKUP_DIR directory is shared between the barman container and the primary node(node1) container, via a docker volume. We'll use the BACKUP_DIR directory to recover the primary node, using the backups from barman.

Check the current state of barman.

$ docker exec -t barman bash -c "source /var/lib/barman/.bashrc && barman check node1"
Server node1:
        PostgreSQL: OK
        superuser or standard user with backup privileges: OK
        PostgreSQL streaming: OK
        wal_level: OK
        replication slot: OK
        directories: OK
        retention policy settings: OK
        backup maximum age: OK (no last_backup_maximum_age provided)
        backup minimum size: OK (22.9 MiB)
        wal maximum age: OK (no last_wal_maximum_age provided)
        wal size: OK (0 B)
        compression settings: OK
        failed backups: OK (there are 0 failed backups)
        minimum redundancy requirements: OK (have 1 non-incremental backups, expected at least 0)
        pg_basebackup: OK
        pg_basebackup compatible: OK
        pg_basebackup supports tablespaces mapping: OK
        systemid coherence: OK
        pg_receivexlog: OK
        pg_receivexlog compatible: OK
        receive-wal running: OK
        archive_mode: OK
        archive_command: OK
        continuous archiving: OK
        archiver errors: OK

Everything seems to be working as expected.

Check if there exists any backups. The barman.sh script will create an initial backup before the setup is complete.

$ docker exec -t barman bash -c "source /var/lib/barman/.bashrc && barman list-backups node1"
node1 20250818T183402 'first-backup' - F - Mon Aug 18 18:34:04 2025 - Size: 38.9 MiB - WAL Size: 0 B - WAITING_FOR_WALS
node1 20250818T183028 'first-backup' - F - Mon Aug 18 18:30:30 2025 - Size: 54.9 MiB - WAL Size: 16.0 MiB - WAITING_FOR_WALS
node1 20250817T174259 'first-backup' - F - Sun Aug 17 17:43:01 2025 - Size: 54.9 MiB - WAL Size: 16.0 MiB

Also, creating a new backup is as easy as running the command below. All the backups will be stored inside the /var/lib/barman/node1/base in the barman node(container).

$ docker exec -t barman bash -c "source /var/lib/barman/.bashrc && barman backup --name first-backup node1"
Starting backup using postgres method for server node1 in /var/lib/barman/node1/base/20250818T183402
Backup start at LSN: 0/C0000C8 (00000004000000000000000C, 000000C8)
Starting backup copy via pg_basebackup for 20250818T183402
Copy done (time: 1 second)
Finalising the backup.
....

Restoring from backup

The backup is only valid after testing it out and ensuring you can recover well from it. We can test this out on the primary node(node1).

Create new table foo inserting 100 records in it

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -U patroni_super
-d postgres -c \"CREATE TABLE foo(id SERIAL PRIMARY KEY, k INT NOT NULL); INSERT INTO foo(k) SELECT i FROM generate_series(1, 100) i;
\""
CREATE TABLE
INSERT 0 100

Run a new backup with barman

$ docker exec -t barman bash -c "source /var/lib/barman/.bashrc && barman backup --name first-backup node1"
Starting backup using postgres method for server node1 in /var/lib/barman/node1/base/20250818T185835
Backup start at LSN: 0/E02A6F0 (00000004000000000000000E, 0002A6F0)
Starting backup copy via pg_basebackup for 20250818T185835
Copy done (time: 7 seconds)
Finalising the backup.
Backup size: 22.9 MiB
Backup end at LSN: 0/10000060 (000000040000000000000010, 00000060)
Backup completed (start time: 2025-08-18 18:58:35.575776, elapsed time: 7 seconds)
Processing xlog segments from streaming for node1 (batch size: 3)
        00000004000000000000000E
        00000004000000000000000F
        000000040000000000000010
Processing xlog segments from file archival for node1
        00000004000000000000000E
        00000004000000000000000F
        00000004000000000000000F.00000028.backup
        000000040000000000000010

Now let's put the cluster is some sort of maintenance mode to prevent automatic failover. With that done we can shutdown primary node and then delete the data directory. We'll use barman to recover the data(remember the BACKUP_DIR we used earlier).

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml pause cluster1 --wait"
'pause' request sent, waiting until it is recognized by all nodes
Success: cluster management is paused

The cluster is paused.

Stop the Postgres server and remove the data directory

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data/ stop"
waiting for server to shut down.... done
server stopped
$ docker exec testPG.1 bash -c "rm -rf /usr/local/pgsql/data/*"

The BACKUP_DIR directory should now be empty

$ ls -la $BACKUP_DIR
total 8
drwx------ 2     991     991 4096 Aug 18 22:35 .
drwxrwxrwx 4 vagrant vagrant 4096 Aug 17 20:01 ..

Let's restore the data and test if the primary can fully recover when we bring it back up. We use the latest backup we have from barman.

$ docker exec -t barman bash -c "source /var/lib/barman/.bashrc && barman cron && barman recover node1 latest /home/postgres/.backup"
Starting WAL archiving for server node1
Starting streaming archiver for server node1
Starting check-backup for backup 20250818T183402 of server node1
Processing xlog segments from file archival for node1
        000000020000000000000008
Starting local restore for server node1 using backup 20250818T185835
Destination directory: /home/postgres/.backup
Copying the base backup.
Copying required WAL segments.
Generating archive status files
Identify dangerous settings in destination directory.

IMPORTANT
These settings have been modified to prevent data losses

postgresql.conf line 4: archive_command = false
postgresql.conf line 27: recovery_target = None
postgresql.conf line 28: recovery_target_lsn = None
postgresql.conf line 29: recovery_target_name = None
postgresql.conf line 30: recovery_target_time = None
postgresql.conf line 31: recovery_target_timeline = None
postgresql.conf line 32: recovery_target_xid = None
....
Restore operation completed (start time: 2025-08-18 19:42:46.109241+00:00, elapsed time: less than one second)

The backup has been restored to the data location of the primary server. Now we can restart the primary server.

$ nohup ./run_pg.sh 1 > /tmp/test.1 &

If you get a "Postgresql is not running." warning, start the server with

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data/ start"

That's it. Primary is up and is still the leader of the cluster. Let's check if our data is intact.

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -U patroni_super -d postgres -c \"SELECT count(*) FROM foo;\""
 count
-------
   100
(1 row)

Yes. We have recovered all the data from the backups.

But we are still in a maintenance state.

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml list cluster1"
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Leader  | running   |  4 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Replica | streaming |  4 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  4 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
 Maintenance mode: on

Resume the cluster and exit the maintenance state below

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml resume cluster1 --wait"
'resume' request sent, waiting until it is recognized by all nodes
Success: cluster management is resumed

Now automatic failover is back in place and any one of the healthy standbys can replace the primary, when needed.

Monitoring

Finally, you can add monitoring and visualize the cluster by running the script ./grafana.sh and following the steps outlined here.
You'll need to create a new role for the Postgres exporter to use when gathering metrics from Postgres to Prometheus. The exporter is configured to expose only the default metrics. If additional metrics are needed, check the flags on enabling some of the disabled collectors.

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -U patroni_super -d postgres -c \"CREATE USER prom_pg_exporter WITH SUPERUSER ENCRYPTED PASSWORD 'prom_pg_exporter';\""

The grafana.sh script will install both Grafana for visualization and Prometheus to be used for collecting and storing metrics from Postgres. You can add as many dashboards as you see fit for your use.

DEV Community