Summary of Chapter# 11: "Streaming Replication" from the book "The Internals of PostgreSQL" Part-2

#postgres #apacheage #database #agenssql

This blog aims to assist you in understanding the final concepts of Chapter 11 [Streaming Replication] from the book The Internals of PostgreSQL.

Note: Ensure that you have a thorough understanding of
Chapter 11 Part-1 and basics of PostgreSQL before we proceed to Chapter 11 Part-2, as it forms the foundation for our exploration.

So, Let's Continue:

Behavior When a Failure Occurs

I describe how the primary server behaves when a synchronous standby server fails, and how to deal with the situation.
Even if a synchronous standby server fails and is no longer able to return an ACK response, the primary server will continue to wait for responses forever. This means that running transactions cannot commit and subsequent query processing cannot be started.
There are two ways to avoid such a situation.
One is to use multiple standby servers to increase system availability.
The other is to manually switch from synchronous to asynchronous mode by performing the following steps:
(1) Set the parameter synchronous_standby_names to an empty string.
synchronous_standby_names = ''
(2) Execute the pg_ctl command with reload option.
postgres> pg_ctl -D $PGDATA reload

Managing Multiple-Standby Servers

sync_priority and sync_state

The primary server assigns the sync_priority and sync_state attributes to all managed standby servers and treats each standby server according to its respective values.
The sync_priority attribute indicates the priority of the standby server in synchronous mode. The lower the value, the higher the priority. The special value 0 means that the standby server is 'in asynchronous mode'.
The priorities of the standby servers are assigned in the order listed in the primary server's configuration parameter synchronous_standby_names.
sync_state is the state of the standby server. The sync_state attribute indicates the state of the standby server. It can be one of the following values:
sync: The standby server is in synchronous mode and is the highest priority standby server that is currently working.
potential: The standby server is in synchronous mode and is a lower priority standby server that is currently working. If the current sync standby server fails, this standby server will be promoted to sync state.
async: The standby server is in asynchronous mode. (It will never be in 'sync' or 'potential' mode.)

How the Primary Manages Multiple-standbys

The primary server waits for ACK responses from the synchronous standby server alone.
In other words, the primary server confirms only the synchronous standby's writing and flushing of WAL data. Streaming replication, therefore, ensures that only the synchronous standby is in a consistent and synchronous state with the primary.

Managing multiple standby servers in PostgreSQL is depicted in the figure below:

Behavior When a Failure Occurs

When either a potential or an asynchronous standby server has failed, the primary server terminates the walsender process connected to the failed standby and continues all processing.
In other words, transaction processing on the primary server would not be affected by the failure of either type of standby server.
When a synchronous standby server has failed, the primary server terminates the walsender process connected to the failed standby, and replaces the synchronous standby with the highest priority potential standby.

Replacing of synchronous standby server in PostgreSQL is depicted in the figure below:

Detecting Failures of Standby Servers

Streaming replication uses two common failure detection procedures that do not require any special hardware.
(1) Failure detection of standby server process:
When a connection drop between the walsender and walreceiver is detected, the primary server immediately determines that the standby server or walreceiver process is faulty.
When a low-level network function returns an error by failing to write or read the socket interface of the walreceiver, the primary server also immediately determines its failure.
(2) Failure detection of hardware and networks:
If a walreceiver does not return anything within the time set for the parameter wal_sender_timeout (default 60 seconds), the primary server determines that the standby server is faulty.
In contrast to the failure described above, it takes a certain amount of time, up to wal_sender_timeout seconds, to confirm the standby's death on the primary server even if a standby server is no longer able to send any response due to some failures (e.g., standby server's hardware failure, network failure, etc.).