Brian Misachi

Posted on Aug 19

High Availability Postgres

#postgres #highavailability

High Availability(HA) refers to a system's ability to be accessible to its users 100 per cent of the time(or as close as possible to 100). It is usually achieved by having proper redundancy in place to ensure when a component in the system fails, and with failover correctly implemented within the system, another healthy component can replace the faulty component and continue servicing requests with little to no human intervention. A HA system can be implemented by having a single Primary(leader) component and other Secondary components(followers) that provide the required redundancy. At any given time one of the secondary components can take the place of the primary if the system requires it.

This post will focus on building a HA cluster with Postgres. The HA solutions offered by Postgres can be categorized into synchronous and asynchronous. In a synchronous solution, when the database receives a commit command from a user, it proceeds to write the changes made up to that point to its own permanent storage and then waits for confirmation from one or more configured standbys that the data has been flushed to their own permanent storage before returning success to the user. In an asynchronous solution, the database returns success once the data has been flushed only to its own permanent storage. Data will then be streamed to stand-by servers at a later time. So there is a high possibility of data loss when using an asynchronous solution.

The way data is moved between the primary and the standbys is through the process of replication. Replication can be achieved synchronously or asynchronously. There are two types of replication available in Postgres.

The main difference between these two is that in physical replication the exact WAL records and location is streamed to standbys. Physical replication is more performant when compared to logical replication since the WAL data is streamed as is with no modifications done. Logical replication deals with publishers and subscribers where modifications to data are formatted(as a change to a row with the DML type, old row value and new row value) and streamed to subscribers that then pull data from the publications they are subscribed to. The biggest advantage of logical replication is the ability to work across major versions of Postgres.

The post will focus on asynchronous (physical) replication. The secondary servers will be operated as hot standbys. Meaning they can accept read queries and also when healthy they can be candidates to replace the primary server. Docker containers will be used for isolation(mimicking different physical nodes) for all servers. Each container represents a separate node. All the containers will be running Postgres version 18beta1 compiled from source.

Create Postgres image by running the ./build.sh script below

#! /bin/bash

set -ex

IMG=postgres/test-0.0.1
IMG_ID=`docker images ${IMG} -q`
PG_TAG=REL_18_BETA1

if [ "${IMG_ID}" = "" ]; then
  if [ ! -d "postgres-${PG_TAG}" ]; then
    wget https://github.com/postgres/postgres/archive/refs/tags/${PG_TAG}.tar.gz && tar -xzf ${PG_TAG}.tar.gz
  fi
  ID=991
  USR=postgres
  USR_HOME=/home/postgres

  cat > Dockerfile << EOF
FROM ubuntu:latest
RUN groupadd -g ${ID} ${USR} && useradd -r -u ${ID} -g ${USR} ${USR}
ADD postgres-${PG_TAG} ${USR_HOME}
WORKDIR ${USR_HOME}
RUN chown -R ${USR}:${USR} ${USR_HOME}
RUN apt-get update && apt-get install -y g++ zlib1g-dev make curl tar gzip perl liblz4-dev libreadline-dev flex bison libicu-dev liburing-dev
RUN apt-get install --reinstall -y pkg-config && ./configure  --with-liburing --enable-debug --with-lz4 && make -j4 && make all && make install
RUN echo "export PATH=/usr/local/pgsql/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" >> /etc/bash.bashrc && \
      chown -R ${USR}:${USR} /usr/local/pgsql
USER ${USR}
EOF

  docker build -t ${IMG}:latest .
  rm Dockerfile
  rm -rf postgres-${PG_TAG} ${PG_TAG}.tar.gz
else
  echo "Image ${IMG} already exists with ID ${IMG_ID}"
fi

With the image built, create the first container for running the primary. We will call it testPG.1

$ export ARCHIVE_DIR=~/mydir # Replace
$ mkdir -p $ARCHIVE_DIR
$ sudo chown -R 991:991 $ARCHIVE_DIR # Ensure Postgres user owns the directory
$ docker run -d --name testPG.1 -p 5432:5432 --mount type=bind,source=${ARCHIVE_DIR},target=/home/postgres/.tmp --restart=on-failure postgres/test-0.0.1:latest bash -c 'tail /dev/null -f'

Remember to update ARCHIVE_DIR to a valid directory in your local filesystem.

Once first node is up and running, a few configurations need to be changed

$ docker exec -it testPG.1 bash  # Attach to a tty so commands can be executed inside the container
postgres@e67b763ec2fb:~$ echo "export PGDATA=/usr/local/pgsql/data" >> ~/.bashrc && source ~/.bashrc  
postgres@e67b763ec2fb:~$ pg_ctl -l logfile initdb  # Only run this if the data directory does not already exist
postgres@e67b763ec2fb:~$ echo "host    replication repl 172.17.0.0/16 md5" >> $PGDATA/pg_hba.conf  # Add replication user to the `pg_hba.conf` file. Replace address 172.17.0.0 appropriately. Use the command "ifconfig docker0 | grep -w "inet" | awk '{print $2}' | sed 's/[0-9].[0-9]$/0.0/'" to get right interface address on your system
postgres@e67b763ec2fb:~$ cat >> $PGDATA/postgresql.conf << EOF
wal_level = replica  # Allow physical replication
archive_mode = on  # Turn on WAL archiving
archive_command = 'test ! -f /home/postgres/.tmp/%f && cp %p /home/postgres/.tmp/%f'  # Send old WAL files to archive location
restore_command = 'cp /home/postgres/.tmp/%f %p'  # Restore WALs from archive
wal_keep_size = 512  # Amount of WALs in MBs to keep before removal
listen_addresses = '*'  # Allow all. Not safe at all
EOF
postgres@e67b763ec2fb:~$ pg_ctl -l logfile start
postgres@e67b763ec2fb:~$ psql -c "CREATE USER repl WITH REPLICATION ENCRYPTED PASSWORD 'repl';" 
postgres@e67b763ec2fb:~$ exit

Now our soon to be primary node should be up and running.

Next, create another container as the second node. We will call it testPG.2. Remember to use the same ARCHIVE_DIR location.

$ export ARCHIVE_DIR=/mydir
$ docker run -d --name testPG.2 -p 5433:5432 --mount type=bind,source=${ARCHIVE_DIR},target=/home/postgres/.tmp --restart=on-failure postgres/test-0.0.1:latest bash -c 'tail /dev/null -f'

In testPG.2 node, remove the data directory if it already exists then use the pg_basebackup tool to copy data from the primary server(testPG.1) and begin the replication process.

$ echo "export PGDATA=/usr/local/pgsql/data" >> ~/.bashrc && source ~/.bashrc
$ docker exec testPG.2 bash -c "/usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile stop && rm -rf $PGDATA"

# If you get a "pg_ctl: directory "/usr/local/pgsql/data" does not exist" message, that is fine. The directory hasn't yet been created

$ SERV_IP=`docker inspect --format='{{ .NetworkSettings.IPAddress }}' testPG.1` && docker exec -u root testPG.2 bash -c "echo '$SERV_IP:5432:replication:repl:repl' > /home/postgres/.pgpass && chmod 0600 /home/postgres/.pgpass && chown postgres /home/postgres/.pgpass"

$ SERV_IP=`docker inspect --format='{{ .NetworkSettings.IPAddress }}' testPG.1` && docker exec -t testPG.2 bash -c "/usr/local/pgsql/bin/pg_basebackup -D $PGDATA -h $SERV_IP -p 5432 -Xs -R -P -U repl && /usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile start"
23724/23724 kB (100%), 1/1 tablespace
waiting for server to start.... done
server started

The server immediately goes into standby mode and starts receiving WAL records from the primary(testPG.1). Check processes running under testPG.1

$ docker exec testPG.1 bash -c "ps -faux | grep postgres"
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres     276  0.0  0.4 208844 24484 ?        Ss   17:46   0:00 /usr/local/pgsql/bin/postgres
....
postgres     285  0.0  0.0 208976  4384 ?        Ss   17:46   0:00  \_ postgres: archiver last was 000000010000000000000005.00000060.backup
postgres     286  0.0  0.1 210424  5248 ?        Ss   17:46   0:00  \_ postgres: logical replication launcher
postgres     319  0.0  0.1 210684  9256 ?        Ss   17:48   0:00  \_ postgres: walsender repl 172.17.0.1(47148) streaming 0/6000168

There is a single WAL sender process streaming records to the standby server.
In the testPG.2 node, a WAL receiver process is running which receives streamed WAL records from the primary and re-applies the changes on the standby server.

$ docker exec testPG.2 bash -c "ps -faux | grep postgres"
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
....
postgres      31  0.0  0.4 208844 24588 ?        Ss   17:48   0:00 /usr/local/pgsql/bin/postgres
....
postgres      37  0.0  0.1 209596  6040 ?        Ss   17:48   0:00  \_ postgres: startup recovering 000000010000000000000006
postgres      46  0.0  0.0 209452  4800 ?        Ss   17:48   0:00  \_ postgres: walreceiver streaming 0/6000168

Currently, we have one primary server and one standby server. The primary services both read and write request while the standby is limited to read-only requests. Read requests are mostly SELECT-like queries from users while write requests involve modifications to data(DDL, DML queries)

We can easily add more standbys to our cluster as needed to help with load distribution and reduce chances of a service outage when both servers(primary and standby) become unavailable at the same time. Adding another standby(testPG.3) is straightforward and similar to what we did when setting up testPG.2.

$ docker run -d --name testPG.3 -p 5434:5432 --mount type=bind,source=${ARCHIVE_DIR},target=/home/postgres/.tmp --restart=on-failure postgres/test-0.0.1:latest bash -c 'tail /dev/null -f'

$ SERV_IP=`docker inspect --format='{{ .NetworkSettings.IPAddress }}' testPG.1` && docker exec -u root testPG.3 bash -c "echo '$SERV_IP:5432:replication:repl:repl' > /home/postgres/.pgpass && chmod 0600 /home/postgres/.pgpass && chown postgres /home/postgres/.pgpass"

$ SERV_IP=`docker inspect --format='{{ .NetworkSettings.IPAddress }}' testPG.1` && docker exec -t testPG.3 bash -c "/usr/local/pgsql/bin/pg_basebackup -D $PGDATA -h $SERV_IP -p 5432 -Xs -R -P -U repl && /usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile start"
23724/23724 kB (100%), 1/1 tablespace
waiting for server to start.... done
server started

With the second standby server testPG.3 up and running, the primary server testPG.1 should now have two WAL sender processes streaming data to the standbys

$ docker exec testPG.1 bash -c "ps -faux | grep postgres"
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
....
postgres      17  0.0  0.4 208844 24460 ?        Ss   10:12   0:00 /usr/local/pgsql/bin/postgres
....
postgres      34  0.0  0.2 211036 12732 ?        Ss   10:13   0:00  \_ postgres: walsender repl 172.17.0.1(60918) streaming 0/8000060
postgres      82  0.0  0.1 210568  7852 ?        Ss   10:22   0:00  \_ postgres: walsender repl 172.17.0.1(57804) streaming 0/8000060

Failover

In a highly available system, when the primary fails, one of the healthy standbys needs replace it in order to minimize service disruptions. In our case, one of the standby gets promoted to the new primary and starts servicing incoming write queries from users. Let's promote the first standby server testPG.2. Later we can make the second standby testPG.3 start following the new primary(testPG.2). We'll then bring up the old primary server testPG.1 as a standby, following the new leader.

First, shutdown the the current primary(testPG.1)

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile stop"

Now we do not have a primary in our cluster. We need to act quick to promote a healthy standby to take its place and start accepting writes.
Next, promote the first standby(testPG.2)

$ docker exec testPG.2 bash -c "/usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile promote"
waiting for server to promote.... done
server promoted

We have a new primary, which is good. But now our second standby still does not recognize the new primary. We need to make it follow the new leader. So we update its connection string to the primary. Get the current connection information on testPG.3

$ docker exec testPG.3 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"SHOW primary_conninfo;\""
                                                                                                                                              primary_conninfo

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
 user=repl password=repl channel_binding=disable host=172.17.0.1 port=5432 sslmode=disable sslnegotiation=postgres sslcompression=0 sslcertmode=disable sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=disa
ble krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable
(1 row)

The port still references the old primary. We need to change that. Copy the entire string and update the port to 5433 for the testPG.2 server.

$ DOCKER_BRIDGE_INTERFACE=`ifconfig docker0 | grep -w "inet" | awk '{print $2}'` && \
    docker exec testPG.3 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"ALTER SYSTEM SET primary_conninfo = '\
        user=repl password=repl channel_binding=disable host=$DOCKER_BRIDGE_INTERFACE \
        port=5433 sslmode=disable sslnegotiation=postgres sslcompression=0 \
        sslcertmode=disable sslsni=1 ssl_min_protocol_version=TLSv1.2 \
        gssencmode=disable krbsrvname=postgres gssdelegation=0 \
        target_session_attrs=any load_balance_hosts=disable';\""
ALTER SYSTEM

$ docker exec testPG.3 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"SELECT pg_reload_conf();\"" # Reload configurations
 pg_reload_conf
----------------
 t
(1 row)
$ docker exec testPG.3 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"SHOW primary_conninfo;\""
                                                                                                                                              primary_conninfo

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
 user=repl password=repl channel_binding=disable host=172.17.0.1 port=5434 sslmode=disable sslnegotiation=postgres sslcompression=0 sslcertmode=disable sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=disable krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable
(1 row)

The standby is now following the new leader. Data is being streamed from the new primary(testPG.2) via the 172.17.0.1:54356 socket to the standby, which is what we want.

$ docker exec testPG.2 bash -c "ps -faux | grep postgres"
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
....
postgres      17  0.0  0.4 208844 24732 ?        Ss   10:13   0:00 /usr/local/pgsql/bin/postgres
....
postgres    1166  0.0  0.1 210568  7800 ?        Ss   11:27   0:00  \_ postgres: walsender repl 172.17.0.1(54356) streaming 0/90001E0

Now we can bring back the old primary(testPG.1) and add it to the cluster as a standby server.
Before restarting the server, ensure an empty standby.signal file exists in the data directory. This ensures it starts up as a standby and then we
can connect to it to update it connection string. The standby.signal file was not created earlier for the standbys as it is part of what pg_basebackup automatically does for us after it is done copying data to a new location.

$ docker exec testPG.1 bash -c "touch $PGDATA/standby.signal && /usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile start"

The server should now be up and running as standby. But it still does not know how to reach the new primary(testPG.2) server.

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"SHOW primary_conninfo;\""
 primary_conninfo
------------------

(1 row)

The connection string is empty. This is because testPG.1 server was the initial primary and did not need to stream data from another server hence no connection string was required. The connection string needs to be updated with valid key-values

$ DOCKER_BRIDGE_INTERFACE=`ifconfig docker0 | grep -w "inet" | awk '{print $2}'` && \
    docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"ALTER SYSTEM SET primary_conninfo = '\
        user=repl password=repl channel_binding=disable host=$DOCKER_BRIDGE_INTERFACE \
        port=5433 sslmode=disable sslnegotiation=postgres sslcompression=0 \
        sslcertmode=disable sslsni=1 ssl_min_protocol_version=TLSv1.2 \
        gssencmode=disable krbsrvname=postgres gssdelegation=0 \
        target_session_attrs=any load_balance_hosts=disable';\""
ALTER SYSTEM

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"SELECT pg_reload_conf();\"" # Reload configurations
 pg_reload_conf
----------------
 t
(1 row)

testPG.1 now follows the new leader, testPG.2 and is able to receive streamed WAL records as shown below with the WAL receiver process running and connected to the primary.

$ docker exec testPG.1 bash -c "ps -faux | grep postgres"
....
postgres     226  0.0  0.4 208844 24736 ?        Ss   11:39   0:00 /usr/local/pgsql/bin/postgres
....
postgres     232  0.0  0.1 210620  6308 ?        Ss   11:39   0:00  \_ postgres: startup recovering 000000020000000000000009
postgres     836  0.0  0.0 210488  4728 ?        Ss   11:47   0:00  \_ postgres: walreceiver streaming 0/90001E0

We have successfully built a "3-node" cluster with one primary and two standbys using docker containers. We have also managed to failover to a standby when the primary was unreachable. We were able to reconfigure remaining standbys to follow the new primary.
The most notable part of the whole process is the manual interventions required especially for failover. Tools such as Patroni or Repmgr exist, that provide solutions to make the process of setting up and managing HA less painful. HA using Patroni will be covered in another post.

DEV Community

High Availability Postgres

Failover

Top comments (0)