DEV Community: Brian Misachi

From Postgres to Iceberg

Brian Misachi — Wed, 05 Nov 2025 17:04:33 +0000

Modern applications generate a lot of data mostly stored in ACID compliant OLTP storage systems like Postgres, Mysql, SQLite etc. Once collected, this data needs to be queried and processed for analytics or any other purposes to get meaningful insights for business decision-making. The data to be queried could be stored in different databases and in order to access all the data you would need to run queries on multiple different databases. This is less efficient and adds extra complexity to the system.

One simple solution would be to use a central OLTP database, for collating all the data and handling all your analytical queries from it. It can be another Postgres instance acting as a data warehouse. This solution works but has a few issues such as hard requirements on the shape the data(structured), scaling limits etc.

Data lakes offer a different solution. It allows provides storage for all types of images, files(in different formats e.g CSV, Json, Apache Parquet, Avro, pdfs etc), videos etc. The data can be stored in its raw form and can be accessed later to be processed further(as in Extract Load Transform) for your analytics or machine learning purposes. Data lakes are usually built on object storages like S3, Google Cloud Storage, Azure Blob Storage, Minio etc. The standout advantage of a data lake is the promise of limitless storage and ability to store virtually anything. Once data is in the data lake and the necessary transformations have been applied to it, it needs to be exposed to various stakeholders(machine learning engineers, analysts, "AI people", execs etc) to derive business value from it.

Data warehouses built on "traditional" OLTP databases, like Postgres, that already organize data in tables and come with in-built SQL query processing engines make this part easier. For a data lake where the data to be queried is scattered across multiple files, querying the data gets complicated. The data lake for the most part only gets you storage. You have to bring your own compute layer where you query and process the data. Luckily, there already exists a number of solutions.

A common solution is using open table formats like Apache Iceberg(others are Delta lake and Apache Hudi). With these tools you get the benefits of traditional database functionality on your data lake i.e ACID guarantees, transactions. The Iceberg specification defines an open table format that enables accessing related data stored in separate files in a distributed storage system, as one table.

Iceberg Design

The design for Iceberg is as shown below:

The design is pretty simple. The catalog holds metadata about the tables being managed by Iceberg. A snapshot holds the state of the table at any point in time. Any change to the table results in a new snapshot being generated. The snapshot stores the manifest list, which is a list of manifest files. Each data file is tracked by a manifest file.

Iceberg is able to efficiently manage large amounts of data stored in the data lake. The data layer supports storing data in open formats like Apache parquet or Avro. Apache Parquet is an open columnar data format for efficient data storage and retrieval. With this, you automatically get the benefits of column storage for your analytical workloads. Engines like Apache Spark, Apache Flink, Presto, Trino etc can be used for the compute layer for data querying and processing.

The are two methods for getting data into the data lake: batching data or streaming data directly from the source. Batching is less complicated and can be scheduled nightly. Streaming on the other hand is more involving and requires more effort to get right. A common setup is usually Debezium for change-data-capture(CDC) with an event streaming platform like Apache kafka or Redpanda.

In the remainder of this post, we'll go through setting up a simple pipeline to stream data from a Postgres server to an Iceberg table. The setup will use Postgres(we love Postgres here) as an SQL catalog, Minio which is an open source S3 compatible object storage, Apache parquet format for the data files and Trino as the processing framework(compute layer).

Change Data Capture Tool

We'll also use a simple CDC tool I have been working on. It implements the Postgres wire protocol and utilizes logical replication to capture DML changes and stream it to the data lake. It is written in mostly Rust but the part for uploading data to the data lake is done in Python. It makes it easier to do any additional processing of the data in Python before uploading to the data lake(ETL) or as in ELT, upload the data as is to the data lake then perform the transformation part later. The Python data ecosystem is rich and provides a lot of excellent tools to work with data.

Streaming data can result in an explosion of very small files being created in the data lake. This can cause a lot of network chatter in transferring every record being captured by the CDC tool and slow down data processing. It helps to buffer the data locally and only send it over the network in larger batches. This can potentially speed up the pipeline( as data is first saved on local disk) and also help reduce on network costs. pg_rusted_wire uses segments, which are fixed-sized file chunks to accumulate data received from Postgres. Data in the active segment is later uploaded to the data lake once the segment reaches the configured size or when the configured upload time interval is reached. Bothe the interval and segment size can be adjusted with the constants MAX_WRITE_INTERVAL and MAX_SEGMENT_SIZE respectively, in wire.rs.

Setting up the Data Lake

To begin, clone the repository here and execute the run.sh bash script. Once the script has completed you should be able to see three services running with a healthy status -- Trino takes a few seconds(~20s) to startup and transition to a healthy state. Navigate to http://localhost:9001 on your browser to browse the Minio object storage UI. Use the username minio_user and password pass1234.

The default bucket warehouse should be empty. A new catalog called example_cat has also been created. The catalog should be empty since we have not added any tables.

$ docker exec -it iceberg-trino-example-metastore-1 psql -U repl_user -d iceberg
psql (18.0)
Type "help" for help.

iceberg=# table iceberg_tables; -- Empty
 catalog_name | table_namespace | table_name | metadata_location | previous_metadata_location
--------------+-----------------+------------+-------------------+----------------------------
(0 rows)

iceberg=# table iceberg_namespace_properties; -- Empty
 catalog_name | namespace | property_key | property_value
--------------+-----------+--------------+----------------
(0 rows)

iceberg=#

We can now create a schema on Iceberg to hold our table. We also create a new table called employee to store employee details. The Trino processing engine will be used to perform these actions.

$ docker exec -it iceberg-trino-example-trino_connector-1 trino
trino> CREATE SCHEMA example_cat.example_schema WITH (location = 's3://warehouse/test'); 
CREATE SCHEMA
trino> CREATE TABLE example_cat.example_schema.employee
     (
       id INTEGER,
       name VARCHAR,
       salary DECIMAL(10,2)
     )
     WITH (
       format = 'PARQUET'
     );
CREATE TABLE
trino>

Now we have a schema example_schema and table employee and a catalog named example_cat. Accessing our table requires the full path in the <catalog_name>.<schame_name>.<table_name> format. In out case, the table name is example_cat.example_schema.employee.

The bucket created earlier should now have some data in it. These are metadata describing the state of the table that has been created. One is a metadata file and the other is a snapshot file(prefixed with "snap"). We don't have a manifest file yet since we don't have any data yet.

Let's add some data to the table

$ docker exec -it iceberg-trino-example-trino_connector-1 trino
trino> INSERT INTO example_cat.example_schema.employee (id, name, salary) VALUES (1, 'Sam Evans', 55000);
    INSERT INTO example_cat.example_schema.employee (id, name, salary) VALUES (2, 'James Bond', 12000);
INSERT: 1 row

Query 20251031_121000_00003_dw2kt, FINISHED, 1 node
Splits: 23 total, 23 done (100.00%)
2.66 [0 rows, 0B] [0 rows/s, 0B/s]

INSERT: 1 row

Query 20251031_121003_00004_dw2kt, FINISHED, 1 node
Splits: 23 total, 23 done (100.00%)
0.78 [0 rows, 0B] [0 rows/s, 0B/s]

trino> SELECT * FROM example_cat.example_schema.employee; -- Query the data
 id |    name    |  salary
----+------------+----------
  1 | Sam Evans  | 55000.00
  2 | James Bond | 12000.00
(2 rows)

Query 20251031_121103_00005_dw2kt, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0.57 [2 rows, 1KB] [3 rows/s, 1.77KB/s]

trino> exit

More metadata files are generated. We can browse it in the object store UI

A new directory named data has also been created that stores the data files

There are two parquet data files. We inserted two records in separate transactions. Each transaction created it's own data file and metadata files: When the state of a table changes(delete, insert, update) a new snapshot is created to track the new table state. New metadata file(stores location of the snapshot) and manifest file(to track the new data file) are also created.

We can drop the table to clear all the files associated with the table including the metadata files and data files

$ docker exec -it iceberg-trino-example-trino_connector-1 trino
trino> DROP TABLE example_cat.example_schema.employee;
DROP TABLE
trino> exit

Streaming Data to Iceberg

We can now setup a simple pipeline with pg_rusted_wire to stream data from Postgres to Iceberg. In this example, we will use the same Postgres instance both as the catalog store and as the data source. It will use logical streaming replication to capture table changes from Postgres.

First, prepare the data source for logical streaming.

$ docker exec -it iceberg-trino-example-metastore-1 psql -U repl_user -d iceberg -c "ALTER SYSTEM SET wal_level = 'logical'"
ALTER SYSTEM

$ docker restart iceberg-trino-example-metastore-1
iceberg-trino-example-metastore-1

Then create the employee table from which we will get data. We also create a publication on the table which the CDC tool will subscribe, to capture changes made to the table.

$ docker exec -it iceberg-trino-example-metastore-1 psql -U repl_user -d postgres
postgres=# CREATE TABLE employee
( id serial primary key,
  name varchar,
  salary decimal(10,2)
);
CREATE TABLE
postgres=# INSERT INTO employee (name, salary) select 'Mkamze Mwatela' || i, i*200 from generate_series(1, 100000) i;
INSERT 0 100000
postgres=# CREATE PUBLICATION pub1 FOR TABLE employee;
CREATE PUBLICATION
postgres=# exit

The employee table has been created and has 100k records.

Let's also recreate the employee table on Iceberg. The data streamed from Postgres will be stored in this table.

$ docker exec -it iceberg-trino-example-trino_connector-1 trino
trino> CREATE TABLE example_cat.example_schema.employee
     (
       id INTEGER,
       name VARCHAR,
       salary DECIMAL(10,2)
     )
     WITH (
       format = 'PARQUET'
     );
CREATE TABLE
trino> exit

Next create a file called config in a directory of your choosing e.g /home/your_user/.tmp/config. The file contains credentials for connecting to the object store and the catalog store. Copy the details below into the file.

Use the command here ifconfig docker0 | grep -w "inet" | awk '{print $2}' to get the host docker ip address. Replace the host_ip below with the output from the command.

S3_SECRET_KEY='pass1234'
S3_ENDPOINT='http://host_ip:9000'
S3_ACCESS_KEY='minio_user'
CATALOG_URI='postgresql+psycopg2://repl_user:pass1234@host_ip:5432/iceberg'
TABLE_NAME='example_cat.example_schema.employee'

Open a new terminal window and clone the repo at https://github.com/misachi/pg_rusted_wire. Run the example CDC tool as shown below

The /home/your_user/.tmp/ should be the directory that has the config file described above.

$ git clone https://github.com/misachi/pg_rusted_wire.git && cd pg_rusted_wire

$ DOCKER_IP=`ifconfig docker0 | grep -w "inet" | awk '{print $2}'` && cargo run --example lrepl -- -u repl_user -P pass1234 -H $DOCKER_IP -d postgres -p 5432 --table employee --publication pub1 --config-dir /home/your_user/.tmp/

If there are no errors, then after a few seconds(depending of configured MAX_WRITE_INTERVAL setting) the initial data in the employee table should have been copied to the employee Iceberg table. We can check using Trino with the SQL command below

$ docker exec -it iceberg-trino-example-trino_connector-1 trino
trino> SELECT COUNT(*) FROM example_cat.example_schema.employee;
 _col0
--------
 100000
(1 row)

Query 20251102_201645_00003_ack9h, FINISHED, 1 node
Splits: 10 total, 10 done (100.00%)
0.41 [100K rows, 0B] [242K rows/s, 0B/s]

All the 100k records in the Postgres table have been copied to Iceberg.

Inserts

We can then insert additional data to our Postgres employee table and it will be streamed to Iceberg

$ docker exec -it iceberg-trino-example-metastore-1 psql -U repl_user -d postgres -c "INSERT INTO employee (name, salary) select 'Manjaz' || i, i*200 from generate_series(1, 5) i"
INSERT 0 5

Five more records have been inserted in the table. The records count on Iceberg table should now be 100,005.

$ docker exec -it iceberg-trino-example-trino_connector-1 trino
trino> SELECT COUNT(*) FROM example_cat.example_schema.employee;
 _col0
--------
 100005
(1 row)

Query 20251102_202608_00004_ack9h, FINISHED, 1 node
Splits: 11 total, 11 done (100.00%)
0.40 [100K rows, 0B] [253K rows/s, 0B/s]

trino> exit

The object storage web UI shows something similar

There are 2 data files generated. One is from the initial copy of 100k records with size 1.2MBs. The second data file is for the 5 records inserted later with size 1.3KB.

Updates

When a record is updated in the Postgres table, the change is captured and streamed to Iceberg.

$ docker exec -it iceberg-trino-example-metastore-1 psql -U repl_user -d postgres
psql (18.0)
Type "help" for help.

postgres=# SELECT * FROM employee WHERE name like '%Manjaz%'; -- Initial salary records for the 5 employees
   id   |  name   | salary
--------+---------+---------
 100001 | Manjaz1 |  200.00
 100002 | Manjaz2 |  400.00
 100003 | Manjaz3 |  600.00
 100004 | Manjaz4 |  800.00
 100005 | Manjaz5 | 1000.00
(5 rows)

postgres=# UPDATE employee SET salary = 4000 WHERE name like '%Manjaz%';
UPDATE 5
postgres=# SELECT * FROM employee WHERE name like '%Manjaz%';
   id   |  name   | salary
--------+---------+---------
 100001 | Manjaz1 | 4000.00
 100002 | Manjaz2 | 4000.00
 100003 | Manjaz3 | 4000.00
 100004 | Manjaz4 | 4000.00
 100005 | Manjaz5 | 4000.00
(5 rows)
postgres=# exit

The employees with names that start with Manjaz have all been updated.

$ docker exec -it iceberg-trino-example-trino_connector-1 trino
trino> SELECT COUNT(*) FROM example_cat.example_schema.employee;
 _col0
--------
 100005
(1 row)

Query 20251102_204321_00012_ack9h, FINISHED, 1 node
Splits: 11 total, 11 done (100.00%)
0.51 [100K rows, 0B] [195K rows/s, 0B/s]

trino> SELECT * FROM example_cat.example_schema.employee WHERE name like '%Manjaz%';
   id   |  name   | salary
--------+---------+---------
 100001 | Manjaz1 | 4000.00
 100002 | Manjaz2 | 4000.00
 100003 | Manjaz3 | 4000.00
 100004 | Manjaz4 | 4000.00
 100005 | Manjaz5 | 4000.00
(5 rows)

Query 20251102_204403_00013_ack9h, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
1.24 [100K rows, 1.21MB] [80.4K rows/s, 996KB/s]

trino> exit

The updates to the employee table in Postgres have been successfully streamed to Iceberg table.

The Minio object storage browser now shows 3 data files. All the updated records were in the data file named 00000-0-14307206-2120-4897-a3ef-fddd94900de2.parquet since they were created in a single transaction. A new data file named 00000-0-5af272e2-4e18-4828-a66f-b6667e982e16.parquet has been created as a result of the update. UPDATEs are done in copy-on-write mode by default: changes to the data file are not done in place but the data is first copied in memory then updated and finally written to a new file.

Deletes

Deleting records works similar to Updates and results in records being removed from the Iceberg table. Since deletes change the state of the table, new snapshot and hence new metadata file will be generated. A new data file is is also created with a new manifest file pointing to it.

Postgres is purpose built to handle OLTP workloads and it does that really well. Its row-based data storage design might not work well for all analytical workloads - depending on specific needs and data size. OLAP data stores designed specifically for analytics come in handy when Postgres no longer meets your needs. We have shown one such solution where you have a data lake for storing your raw data in different files and treating related files as a single table to run your queries on with Apache Iceberg. Data lake's ability to handle large, diverse datasets makes it an essential foundation for modern data-driven decision-making. When combined with open table formats like Apache Iceberg, data lakes become even more powerful offering reliable data management, faster queries, and a scalable path toward advanced analytics.

Postgres Replication Slots

Brian Misachi — Fri, 29 Aug 2025 13:57:00 +0000

The WAL(Redo log) is essential to normal database operations. In some sense "The LOG is the DATABASE" has some truth to it. The WAL stores a log of all activities performed in the database. In other words, it maintains the state of the database system. Having the WAL is useful in many ways and one of the most important role is in replication.

Replication in Postgres makes significant use of the WAL when copying data changes from one database server to another. Every data change, once committed is replicated to all configured standbys from reading the WAL. It is important that the WAL is stored durably and long enough for new changes to be replicated to standbys.

Postgres aims to be as resource efficient as possible. WAL files can build up quickly and if not well managed can fill up available disk space bringing the system to a halt. Postgres handles the situation by safely deleting older WAL files, during a checkpoint, to recover disk space for other uses. Once the REDO location is confirmed to be ahead of a WAL file( i.e data has been flushed up to that point) then, that WAL file and all other old files are removed or recycled as part of the database cleanup process. An issue with this process is that a WAL file can be deleted before its changes are safely replicated downstream to standbys: leading to inconsistency between the primary/publisher and a standby/subscriber. When this occurs, replication would normally stop until the issue is fixed or a new full backup is made from the primary node.

Delete Not WAL

Archiving

Postgres offers a few solutions for keeping WALs long enough for standbys to access them before they are deleted. One such solution is continuous archiving. In archiving mode, old WAL files are copied to a separate location before being recycled from the pg_wal directory. The archive location can be a local disk or remote server. Standbys can still access the archive location during recovery and stay consistent with the primary server. Archiving was used in the post HA Postgres to store old WAL files when setting up the HA cluster.

Archiving is primarily used for backups and PITR and so can be used in addition to other solutions.

Slots

Another solution for preventing pre-mature WAL deletion is replication slots. A replication slot is useful in maintaining the state of WAL being accessed by a standby or subscriber that are tracking changes from the primary. The slot prevents deletion of WAL files and removal of old versions of tuples that are still required to be replicated downstream. The replication slot is a simple data structure represented on disk with the following fields

// in src/include/replication/slot.h

typedef struct ReplicationSlotPersistentData
{
    NameData    name;
    Oid         database;
    ReplicationSlotPersistency persistency;
    TransactionId xmin;
    TransactionId catalog_xmin;
    XLogRecPtr  restart_lsn;
    ReplicationSlotInvalidationCause invalidated;
    XLogRecPtr  confirmed_flush;
    XLogRecPtr  two_phase_at;
    bool        two_phase;
    NameData    plugin;
    char        synced;
    bool        failover;
} ReplicationSlotPersistentData;

The interesting field in the struct is the restart_lsn which indicates the location at which data will be read. The confirmed_flush points to a WAL location that has been flushed and acknowledged by the standby. xmin maintains the transaction horizon such that tuple versions with transaction ids newer than the value of xmin cannot be removed during a vacuum.

Creating a replication slot.

postgres=# SELECT * FROM pg_create_logical_replication_slot('test_slot', 'test_decoding', false, true);
 slot_name |    lsn
-----------+-----------
 test_slot | 0/17ABC18
(1 row)

To view all the replication slots available in the database we can query the pg_replication_slots view

postgres=# SELECT slot_name, database, active, xmin, restart_lsn, confirmed_flush_lsn, synced FROM pg_replication_slots;
 slot_name | database | active | xmin | restart_lsn | confirmed_flush_lsn | synced
-----------+----------+--------+------+-------------+---------------------+--------
 test_slot | postgres | f      |      | 0/17ABBE0   | 0/17ABC18           | f
(1 row)

It displays the field values from the slot data structure shown above. The next WAL position to be read starts at location 0/17ABBE0.

A new slot called test_slot has been created to use the test logical decoding output plugin test_decoding. The slot is a permanent one therefore it will be persisted to disk. It also configured to for decoding of prepared transactions, if used.

Create a table and put some data in it.

postgres=# CREATE TABLE foo(id SERIAL, k INT NOT NULL);
CREATE TABLE

For a logical slot, DDL changes are not replicated. When we check for any new change from the slot, only the BEGIN and COMMIT statements will be output.

postgres=# SELECT * FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL);
    lsn    | xid |    data
-----------+-----+------------
 0/17B1550 | 764 | BEGIN 764
 0/17D3B50 | 764 | COMMIT 764
(2 rows)

Now if we insert some data, we'll get some meaningful decoded output

postgres=# INSERT INTO foo VALUES (1,1);
INSERT 0 1
postgres=# SELECT * FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL);
    lsn    | xid |                         data
-----------+-----+------------------------------------------------------
 0/17B1550 | 764 | BEGIN 764
 0/17D3B50 | 764 | COMMIT 764
 0/17D6CE0 | 765 | BEGIN 765
 0/17D6CE0 | 765 | table public.foo: INSERT: id[integer]:1 k[integer]:1
 0/17D6D50 | 765 | COMMIT 765
(5 rows)

We have not actually consumed any data from the slot. pg_logical_slot_peek_changes function enables us to look into the data that is yet to be replicated from the slot. It does not update the location from which the next data would be read. If we check the state of the slot, it should have the same values as before.

postgres=# SELECT slot_name, database, active, xmin, restart_lsn, confirmed_flush_lsn, synced FROM pg_replication_slots;
 slot_name | database | active | xmin | restart_lsn | confirmed_flush_lsn | synced
-----------+----------+--------+------+-------------+---------------------+--------
 test_slot | postgres | f      |      | 0/17ABBE0   | 0/17ABC18           | f
(1 row)

We can read the changes from the WAL at 0/17ABBE0 and advance the slot's restart_lsn position with the pg_logical_slot_get_changes function.

To consume from the slot and advance the restart_lsn position, the pg_logical_slot_get_changes is used. The pg_logical_slot_get_changes and pg_logical_slot_peek_changes functions are similar. They both share the same logic in pg_logical_slot_get_changes_guts function. The major difference between the two functions is that pg_logical_slot_peek_changes does not move the restart_lsn position after it is done(confirm=false in pg_logical_slot_get_changes_guts). The condition check here confirms if restart_lsn should be updated once reading from the WAL is done.

if (ctx->reader->EndRecPtr != InvalidXLogRecPtr && confirm)
{
    LogicalConfirmReceivedLocation(ctx->reader->EndRecPtr);
    ReplicationSlotMarkDirty();
}

The actual update of restart_lsn is done in the LogicalConfirmReceivedLocation function when the next expected restart point is valid.

If the state of the slot is updated, the slot is marked dirty in ReplicationSlotMarkDirty and is scheduled for flushing to disk in the next checkpoint.

We can now consume from the slot.

postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL);
    lsn    | xid |                         data
-----------+-----+------------------------------------------------------
 0/17B1550 | 764 | BEGIN 764
 0/17D3B50 | 764 | COMMIT 764
 0/17D6CE0 | 765 | BEGIN 765
 0/17D6CE0 | 765 | table public.foo: INSERT: id[integer]:1 k[integer]:1
 0/17D6D50 | 765 | COMMIT 765
(5 rows)
postgres=# SELECT slot_name, database, active, xmin, restart_lsn, confirmed_flush_lsn, synced FROM pg_replication_slots;
 slot_name | database | active | xmin | restart_lsn | confirmed_flush_lsn | synced
-----------+----------+--------+------+-------------+---------------------+--------
 test_slot | postgres | f      |      | 0/17ABD78   | 0/17D6E90           | f

Notice the restart_lsn location has been updated 0/17ABD78. All the changes have now been consumed and reflected in the slot's state. There are no new changes at the new restart_lsn location

postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL);
 lsn | xid | data
-----+-----+------
(0 rows)

WAL Preservation

The main purpose for replication slots is preventing deletion of WALs. Any WAL segment older than the one pointed to by restart_lsn is scheduled for deletion during the next checkpoint cycle.

Let's check it out. At this point there is only one WAL segment file 000000010000000000000001 with the inode number 516146

$ ls -lhi $PGDATA/pg_wal
total 17M
516146 -rw------- 1 postgres postgres  16M Aug 28 17:37 000000010000000000000001
516118 drwx------ 2 postgres postgres 4.0K Aug 28 17:36 archive_status
516119 drwx------ 2 postgres postgres 4.0K Aug 28 17:36 summaries

Adding more data to the foo table to fill up the active WAL segment file. The min_wal_size is set to the default 80MB.

postgres=# INSERT INTO foo SELECT i, i FROM generate_series(1, 247580) i;
INSERT 0 247580
postgres=# SELECT slot_name, database, active, xmin, restart_lsn, confirmed_flush_lsn, synced FROM pg_replication_slots;
 slot_name | database | active | xmin | restart_lsn | confirmed_flush_lsn | synced
-----------+----------+--------+------+-------------+---------------------+--------
 test_slot | postgres | f      |      | 0/17ABD78   | 0/17D6E90           | f

The slot's state has not changed and restart_lsn is still pointing to a location in the original WAL segment. Therefore, the 000000010000000000000001 segment file will not be removed immediately even after a checkpoint.
A new WAL segment file 000000010000000000000002 has also been created

$ ls -lhi $PGDATA/pg_wal
total 33M
516146 -rw------- 1 postgres postgres  16M Aug 28 17:43 000000010000000000000001
518594 -rw------- 1 postgres postgres  16M Aug 28 17:43 000000010000000000000002
516118 drwx------ 2 postgres postgres 4.0K Aug 28 17:42 archive_status
516119 drwx------ 2 postgres postgres 4.0K Aug 28 17:42 summaries

In order to delete the original segment file, we need to consume from the test_slot slot using the pg_logical_slot_get_changes function. With the wal_recycle setting turned on, the 000000010000000000000001 segment file will be recycled(inode number remains the same) and it will be renamed renamed to 000000010000000000000003 New WAL records will then be stored in it. You can try it out to see this in action.

Idle Slots and System Resources

It is good practice to delete unused slots in order to release resources being used by them. A slot uses the shared memory to store its in-memory state. Unused slots consume memory and hold up the process of cleaning up old WAL files leading to disk bloat. The situation can quickly get worse in a busy server. WAL files newer than restart_lsn are preserved for as long as the slot's state is not updated.

After adding more data to the table, multiple WAL files are still stored on disk even after a checkpoint.

$ ls -lhi $PGDATA/pg_wal
total 161M
516146 -rw------- 1 postgres postgres  16M Aug 28 17:43 000000010000000000000001
518594 -rw------- 1 postgres postgres  16M Aug 28 17:44 000000010000000000000002
518596 -rw------- 1 postgres postgres  16M Aug 28 17:44 000000010000000000000003
518597 -rw------- 1 postgres postgres  16M Aug 28 17:44 000000010000000000000004
518598 -rw------- 1 postgres postgres  16M Aug 28 17:44 000000010000000000000005
518595 -rw------- 1 postgres postgres  16M Aug 28 17:44 000000010000000000000006
518599 -rw------- 1 postgres postgres  16M Aug 28 17:44 000000010000000000000007
518600 -rw------- 1 postgres postgres  16M Aug 28 17:44 000000010000000000000008
518601 -rw------- 1 postgres postgres  16M Aug 28 17:44 000000010000000000000009
518602 -rw------- 1 postgres postgres  16M Aug 28 17:44 00000001000000000000000A
516118 drwx------ 2 postgres postgres 4.0K Aug 28 17:42 archive_status
516119 drwx------ 2 postgres postgres 4.0K Aug 28 17:42 summaries

The current active WAL segment is 00000001000000000000000A and none of the older segments have been deleted. The size of the pg_wal directory will continue to grow unbounded if the slot remains inactive. To free up space, we would need to get rid of the slot.

In order to clearly show the space reclamation, turn off recycling of WALs (on by default)

$ echo "wal_recycle = off" >> $PGDATA/postgresql.conf
$ pg_ctl -l logfile restart  # Assumes data directory is already set to PGDATA

Drop the replication slot

postgres=# SELECT pg_drop_replication_slot('test_slot');
 pg_drop_replication_slot
--------------------------

(1 row)
-- Request a checkpoint immediately to cleanup older WAL files.
postgres=# CHECKPOINT;
CHECKPOINT

The slot has been successfully deleted. The pg_wal directory has also been freed of old WAL files

$ ls -lhi $PGDATA/pg_wal
total 17M
518602 -rw------- 1 postgres postgres  16M Aug 28 17:52 00000001000000000000000A
516118 drwx------ 2 postgres postgres 4.0K Aug 28 17:48 archive_status
516119 drwx------ 2 postgres postgres 4.0K Aug 28 17:48 summaries

The size of pg_wal directory is now 17MB. The previous size was 161MB. Which is a 89.441 per cent reduction in size. That makes a big difference especially in a large database where WAL files fill up quickly. The pg_replication_slots view can be used to quickly find and delete idle slots.

postgres=# SELECT slot_name, database, active, xmin, restart_lsn, confirmed_flush_lsn, synced FROM pg_replication_slots WHERE active = 'f';
 slot_name  | database | active | xmin | restart_lsn | confirmed_flush_lsn | synced
------------+----------+--------+------+-------------+---------------------+--------
 test_slot1 | postgres | f      |      | 0/A116D08   | 0/A116D40           | f
 test_slot2 | postgres | f      |      | 0/A116D40   | 0/A116D78           | f
(2 rows)

-- Delete based on how long the slot has been idle(e.g 5 minutes)
postgres=# SELECT slot_name, pg_drop_replication_slot(slot_name) FROM pg_replication_slots WHERE (extract(epoch from now() - inactive_since) / 60) > 5 AND active = 'f';
 slot_name  | pg_drop_replication_slot
------------+--------------------------
 test_slot1 |
 test_slot2 |
(2 rows)

Compared to other solutions, replication slots provide a smoother experience for handling replication and monitoring replication progress.

HA Postgres with Patroni and Barman

Brian Misachi — Tue, 19 Aug 2025 19:53:39 +0000

In the last post we managed to set up a HA 3-node Postgres cluster. We set up physical replication where WAL records are streamed from the primary to be replayed on the standbys. We were also able to manually failover to one of the standbys when the primary was shutdown. A lot of manual processes were involved which can be time consuming, error prone and could potentially lead to data loss.

This post will show how Patroni can be helpful in automation of HA Postgres cluster and also when managing failover when the cluster becomes unhealthy. Barman will be used for backup and recovery management. Taking it further, we will add monitoring of the cluster using tools like Grafana and Prometheus Exciting!!

We will be using scripts from this repository. Clone the repository to your local filesystem and follow along.

First create 3 nodes testPG.1, testPG.2 and testPG.3. You can choose to create as many nodes(containers) as your host can allow. The run_pg.sh script can be used for this part. Each node created will be installed with Postgres server version 18beta1 and additionally install Patroni. If this is the first time running the script, it will also create an ETCD v3 container node as the DCS for the cluster.

# ARCHIVE_DIR and BACKUP_DIR should be separate directories

$ export ARCHIVE_DIR=/path/to/archive # Replace
$ sudo chown -R 991:991 $ARCHIVE_DIR # Ensure Postgres user owns the directory
$ export BACKUP_DIR=/path/to/backup/data  # Replace
$ sudo chown -R 991:991 $BACKUP_DIR # Ensure Postgres user owns the directory
$ nohup ./run_pg.sh 1 5432 > /tmp/test.1 &  # nohup ./run_pg.sh <node number> <port> &
$ nohup ./run_pg.sh 2 5433 > /tmp/test.2 & # node 2
$ nohup ./run_pg.sh 3 5434 > /tmp/test.3 & # node 3

That's it. We now have a 3-node cluster with the above commands. Check the current state of the cluster and its member

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml list cluster1"
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Leader  | running   |  2 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Replica | streaming |  2 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  2 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+

Node1(testPG.1) is the primary server. Nodes 2 and 3 are standbys and are actively streaming changes from the primary. Since we have no data yet, none of the standbys are lagging behind the primary.

Failover

When the cluster becomes unhealthy e.g the primary goes down for some reason, patroni processes in each node coordinate and eventually select a healthy standby node by leader election to become the new primary. This part rarely requires any human intervention, unless something extremely out of the normal has happened like the DCS is down or an issue with the standbys.

Let's bring down the current primary to see how patroni handles failover.

$ CMD=`docker exec testPG.1 bash -c "ps -C patroni | grep patroni"` && docker exec testPG.1 bash -c "kill `echo $CMD | awk '{print $1}'`"

We send a SIGTERM signal to allow patroni process to cleanly shutdown Postgres server. Since this is the primary, Patroni performs a final checkpoint of any data not yet flushed to disk for writes done since the last checkpoint. This can be helpful in reducing the amount of WAL records to be replayed in recovery during the next restart.

$ tail /tmp/test.1 -n 15
....
2025-08-18 11:34:37,128 INFO: no action. I am (node1), the leader with the lock
2025-08-18 11:34:40.692 UTC [40] LOG:  received fast shutdown request
2025-08-18 11:34:40.694 UTC [40] LOG:  aborting any active transactions
....
2025-08-18 11:34:40.696 UTC [44] LOG:  shutting down
2025-08-18 11:34:40.708 UTC [44] LOG:  checkpoint starting: shutdown immediate
2025-08-18 11:34:40.758 UTC [44] LOG:  checkpoint complete: wrote 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.051 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16383 kB, estimate=16383 kB; lsn=0/8000028, redo lsn=0/8000028
2025-08-18 11:34:40.822 UTC [40] LOG:  database system is shut down

The node1 patroni process releases the leader lock allowing other nodes to compete and acquire the lock. The old primary is removed from the members list of the cluster. If we check the current list we get.

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml list cluster1"
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Leader  | running   |  3 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  3 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+

Notice node1 is gone and we have a new leader, node2. We can then proceed to fix node1 before we bring it back to the cluster.

Once the patroni process in each node detects the change in the cluster(missing primary). The first healthy node to acquire the leader lock becomes the new primary while the remaining standbys begin to follow the new leader. In this case node2(testPG.2) becomes the new primary while node3(testPG.3) begins to follow the new leader.

Now you can send a SIGKILL signal to terminate the node1 patroni process

$ CMD=`docker exec testPG.1 bash -c "ps -C patroni | grep patroni"` && docker exec testPG.1 bash -c "kill -9 `echo $CMD | awk '{print $1}'`"

The old primary can be put back into the cluster after being fixed. When this is done, it first attempts to acquire the leader lock and then it detects the lock is already being held by another server, it quickly transitions to a standby and follows the current leader.

$ nohup ./run_pg.sh 1 > /tmp/test.1 &  # Bring old primary back up
....
2025-08-18 13:31:03,430 INFO: Lock owner: node2; I am node1
2025-08-18 13:31:03,430 INFO: establishing a new patroni heartbeat connection to postgres
cp: cannot stat '/home/postgres/.tmp/00000004.history': No such file or directory
2025-08-18 13:31:03.432 UTC [970] LOG:  waiting for WAL to become available at 0/9000018
2025-08-18 13:31:03,491 INFO: no action. I am (node1), a secondary, and following a leader (node2)
cp: cannot stat '/home/postgres/.tmp/000000030000000000000009': No such file or directory
2025-08-18 13:31:08.433 UTC [996] LOG:  started streaming WAL from primary at 0/9000000 on timeline 3
2025-08-18 13:31:14,023 INFO: no action. I am (node1), a secondary, and following a leader (node2)
2025-08-18 13:31:23,934 INFO: no action. I am (node1), a secondary, and following a leader (node2)

And now node1 has been added back to the cluster, as a standby.

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml list cluster1"
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Replica | streaming |  3 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Leader  | running   |  3 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  3 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+

SwitchOver

In order to promote node1 back to the primary role we can use the switchover command, given that this is a healthy cluster.

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml switchover cluster1 --leader node2 --candidate node1 --force"  # Using force to skip the prompts from patroni
Current cluster topology
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Replica | streaming |  3 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Leader  | running   |  3 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  3 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
2025-08-18 14:19:31.73068 Successfully switched over to "node1"
+ Cluster: cluster1 (7539602363152794219) ------------+----+-----------+----------------------+
| Member | Host       | Role    | State               | TL | Lag in MB | Tags                 |
+--------+------------+---------+---------------------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Leader  | running             |  3 |           | clonefrom: true      |
|        |            |         |                     |    |           | failover_priority: 1 |
+--------+------------+---------+---------------------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Replica | stopping            |    |   unknown | clonefrom: true      |
|        |            |         |                     |    |           | failover_priority: 1 |
+--------+------------+---------+---------------------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | in archive recovery |  3 |         0 | clonefrom: true      |
|        |            |         |                     |    |           | failover_priority: 1 |
+--------+------------+---------+---------------------+----+-----------+----------------------+

The cluster is still in an inconsistent state.
The switching over process from the primary node(node2) logs looks like this. The output is truncated for clarity.

$ tail /tmp/test.2 -n 100
2025-08-18 14:19:28,597 INFO: received switchover request with leader=node2 candidate=node1 scheduled_at=None
2025-08-18 14:19:28,608 INFO: Got response from node1
....
2025-08-18 14:19:28,658 INFO: Lock owner: node2; I am node2
2025-08-18 14:19:28,759 INFO: switchover: demoting myself
2025-08-18 14:19:28,759 INFO: Demoting self (graceful)
2025-08-18 14:19:28.764 UTC [39] LOG:  checkpoint starting: immediate force wait
....
2025-08-18 14:19:31,033 INFO: Leader key released
2025-08-18 14:19:31,082 INFO: Lock owner: node1; I am node2
2025-08-18 14:19:31,082 INFO: switchover: demote in progress
....

The cluster is back to a consistent state and with original leader node1 and nodes node2 and node3 as followers.

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml list cluster1"
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Leader  | running   |  4 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Replica | streaming |  4 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  4 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+

Having changed the leader multiple times, patroni also provides a way to view leadership changes within the cluster

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml history cluster1"
+----+-----------+------------------------------+----------------------------------+------------+
| TL |       LSN | Reason                       | Timestamp                        | New Leader |
+----+-----------+------------------------------+----------------------------------+------------+
|  1 | 117440672 | no recovery target specified | 2025-08-18T09:23:50.986078+00:00 | node1      |
|  2 | 134217888 | no recovery target specified | 2025-08-18T11:34:45.003395+00:00 | node2      |
|  3 | 167772320 | no recovery target specified | 2025-08-18T14:19:31.253098+00:00 | node1      |
+----+-----------+------------------------------+----------------------------------+------------+

The result shows there have been 3 leader changes between node1 and node2 and the times the changes occurred.

Patroni is an extremely handy solution for managing a highly available Postgres cluster.

Backups

A truly HA solution would require data to be backed up in a separate location such that in case you lose data or cannot recover fully from the cluster, the backups can instead be used for recovery. The solution you pick for backing up your data depends on your business requirements.

Any solution for backups should at least use the pg_basebackup utility. Remember pg_dump or pg_dumpall are not backup tools

Barman is an easy to use solution that can be used to manage backups. To set up barman, first create the required roles by executing the queries below on the primary node

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -U patroni_super -d postgres -c \"CREATE USER streaming_barman WITH REPLICATION ENCRYPTED PASSWORD 'streaming_barman'; CREATE USER barman WITH SUPERUSER ENCRYPTED PASSWORD 'barman';\""

Then run the ./barman.sh command. When the script completes, a new container will be running, with barman installed in it. The script configures backups for the primary node. The BACKUP_DIR directory is shared between the barman container and the primary node(node1) container, via a docker volume. We'll use the BACKUP_DIR directory to recover the primary node, using the backups from barman.

Check the current state of barman.

$ docker exec -t barman bash -c "source /var/lib/barman/.bashrc && barman check node1"
Server node1:
        PostgreSQL: OK
        superuser or standard user with backup privileges: OK
        PostgreSQL streaming: OK
        wal_level: OK
        replication slot: OK
        directories: OK
        retention policy settings: OK
        backup maximum age: OK (no last_backup_maximum_age provided)
        backup minimum size: OK (22.9 MiB)
        wal maximum age: OK (no last_wal_maximum_age provided)
        wal size: OK (0 B)
        compression settings: OK
        failed backups: OK (there are 0 failed backups)
        minimum redundancy requirements: OK (have 1 non-incremental backups, expected at least 0)
        pg_basebackup: OK
        pg_basebackup compatible: OK
        pg_basebackup supports tablespaces mapping: OK
        systemid coherence: OK
        pg_receivexlog: OK
        pg_receivexlog compatible: OK
        receive-wal running: OK
        archive_mode: OK
        archive_command: OK
        continuous archiving: OK
        archiver errors: OK

Everything seems to be working as expected.

Check if there exists any backups. The barman.sh script will create an initial backup before the setup is complete.

$ docker exec -t barman bash -c "source /var/lib/barman/.bashrc && barman list-backups node1"
node1 20250818T183402 'first-backup' - F - Mon Aug 18 18:34:04 2025 - Size: 38.9 MiB - WAL Size: 0 B - WAITING_FOR_WALS
node1 20250818T183028 'first-backup' - F - Mon Aug 18 18:30:30 2025 - Size: 54.9 MiB - WAL Size: 16.0 MiB - WAITING_FOR_WALS
node1 20250817T174259 'first-backup' - F - Sun Aug 17 17:43:01 2025 - Size: 54.9 MiB - WAL Size: 16.0 MiB

Also, creating a new backup is as easy as running the command below. All the backups will be stored inside the /var/lib/barman/node1/base in the barman node(container).

$ docker exec -t barman bash -c "source /var/lib/barman/.bashrc && barman backup --name first-backup node1"
Starting backup using postgres method for server node1 in /var/lib/barman/node1/base/20250818T183402
Backup start at LSN: 0/C0000C8 (00000004000000000000000C, 000000C8)
Starting backup copy via pg_basebackup for 20250818T183402
Copy done (time: 1 second)
Finalising the backup.
....

Restoring from backup

The backup is only valid after testing it out and ensuring you can recover well from it. We can test this out on the primary node(node1).

Create new table foo inserting 100 records in it

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -U patroni_super
-d postgres -c \"CREATE TABLE foo(id SERIAL PRIMARY KEY, k INT NOT NULL); INSERT INTO foo(k) SELECT i FROM generate_series(1, 100) i;
\""
CREATE TABLE
INSERT 0 100

Run a new backup with barman

$ docker exec -t barman bash -c "source /var/lib/barman/.bashrc && barman backup --name first-backup node1"
Starting backup using postgres method for server node1 in /var/lib/barman/node1/base/20250818T185835
Backup start at LSN: 0/E02A6F0 (00000004000000000000000E, 0002A6F0)
Starting backup copy via pg_basebackup for 20250818T185835
Copy done (time: 7 seconds)
Finalising the backup.
Backup size: 22.9 MiB
Backup end at LSN: 0/10000060 (000000040000000000000010, 00000060)
Backup completed (start time: 2025-08-18 18:58:35.575776, elapsed time: 7 seconds)
Processing xlog segments from streaming for node1 (batch size: 3)
        00000004000000000000000E
        00000004000000000000000F
        000000040000000000000010
Processing xlog segments from file archival for node1
        00000004000000000000000E
        00000004000000000000000F
        00000004000000000000000F.00000028.backup
        000000040000000000000010

Now let's put the cluster is some sort of maintenance mode to prevent automatic failover. With that done we can shutdown primary node and then delete the data directory. We'll use barman to recover the data(remember the BACKUP_DIR we used earlier).

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml pause cluster1 --wait"
'pause' request sent, waiting until it is recognized by all nodes
Success: cluster management is paused

The cluster is paused.

Stop the Postgres server and remove the data directory

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data/ stop"
waiting for server to shut down.... done
server stopped
$ docker exec testPG.1 bash -c "rm -rf /usr/local/pgsql/data/*"

The BACKUP_DIR directory should now be empty

$ ls -la $BACKUP_DIR
total 8
drwx------ 2     991     991 4096 Aug 18 22:35 .
drwxrwxrwx 4 vagrant vagrant 4096 Aug 17 20:01 ..

Let's restore the data and test if the primary can fully recover when we bring it back up. We use the latest backup we have from barman.

$ docker exec -t barman bash -c "source /var/lib/barman/.bashrc && barman cron && barman recover node1 latest /home/postgres/.backup"
Starting WAL archiving for server node1
Starting streaming archiver for server node1
Starting check-backup for backup 20250818T183402 of server node1
Processing xlog segments from file archival for node1
        000000020000000000000008
Starting local restore for server node1 using backup 20250818T185835
Destination directory: /home/postgres/.backup
Copying the base backup.
Copying required WAL segments.
Generating archive status files
Identify dangerous settings in destination directory.

IMPORTANT
These settings have been modified to prevent data losses

postgresql.conf line 4: archive_command = false
postgresql.conf line 27: recovery_target = None
postgresql.conf line 28: recovery_target_lsn = None
postgresql.conf line 29: recovery_target_name = None
postgresql.conf line 30: recovery_target_time = None
postgresql.conf line 31: recovery_target_timeline = None
postgresql.conf line 32: recovery_target_xid = None
....
Restore operation completed (start time: 2025-08-18 19:42:46.109241+00:00, elapsed time: less than one second)

The backup has been restored to the data location of the primary server. Now we can restart the primary server.

$ nohup ./run_pg.sh 1 > /tmp/test.1 &

If you get a "Postgresql is not running." warning, start the server with

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data/ start"

That's it. Primary is up and is still the leader of the cluster. Let's check if our data is intact.

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -U patroni_super -d postgres -c \"SELECT count(*) FROM foo;\""
 count
-------
   100
(1 row)

Yes. We have recovered all the data from the backups.

But we are still in a maintenance state.

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml list cluster1"
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Leader  | running   |  4 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Replica | streaming |  4 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  4 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
 Maintenance mode: on

Resume the cluster and exit the maintenance state below

$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml resume cluster1 --wait"
'resume' request sent, waiting until it is recognized by all nodes
Success: cluster management is resumed

Now automatic failover is back in place and any one of the healthy standbys can replace the primary, when needed.

Monitoring

Finally, you can add monitoring and visualize the cluster by running the script ./grafana.sh and following the steps outlined here.
You'll need to create a new role for the Postgres exporter to use when gathering metrics from Postgres to Prometheus. The exporter is configured to expose only the default metrics. If additional metrics are needed, check the flags on enabling some of the disabled collectors.

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -U patroni_super -d postgres -c \"CREATE USER prom_pg_exporter WITH SUPERUSER ENCRYPTED PASSWORD 'prom_pg_exporter';\""

The grafana.sh script will install both Grafana for visualization and Prometheus to be used for collecting and storing metrics from Postgres. You can add as many dashboards as you see fit for your use.

High Availability Postgres

Brian Misachi — Tue, 19 Aug 2025 19:52:53 +0000

High Availability(HA) refers to a system's ability to be accessible to its users 100 per cent of the time(or as close as possible to 100). It is usually achieved by having proper redundancy in place to ensure when a component in the system fails, and with failover correctly implemented within the system, another healthy component can replace the faulty component and continue servicing requests with little to no human intervention. A HA system can be implemented by having a single Primary(leader) component and other Secondary components(followers) that provide the required redundancy. At any given time one of the secondary components can take the place of the primary if the system requires it.

This post will focus on building a HA cluster with Postgres. The HA solutions offered by Postgres can be categorized into synchronous and asynchronous. In a synchronous solution, when the database receives a commit command from a user, it proceeds to write the changes made up to that point to its own permanent storage and then waits for confirmation from one or more configured standbys that the data has been flushed to their own permanent storage before returning success to the user. In an asynchronous solution, the database returns success once the data has been flushed only to its own permanent storage. Data will then be streamed to stand-by servers at a later time. So there is a high possibility of data loss when using an asynchronous solution.

The way data is moved between the primary and the standbys is through the process of replication. Replication can be achieved synchronously or asynchronously. There are two types of replication available in Postgres.

The main difference between these two is that in physical replication the exact WAL records and location is streamed to standbys. Physical replication is more performant when compared to logical replication since the WAL data is streamed as is with no modifications done. Logical replication deals with publishers and subscribers where modifications to data are formatted(as a change to a row with the DML type, old row value and new row value) and streamed to subscribers that then pull data from the publications they are subscribed to. The biggest advantage of logical replication is the ability to work across major versions of Postgres.

The post will focus on asynchronous (physical) replication. The secondary servers will be operated as hot standbys. Meaning they can accept read queries and also when healthy they can be candidates to replace the primary server. Docker containers will be used for isolation(mimicking different physical nodes) for all servers. Each container represents a separate node. All the containers will be running Postgres version 18beta1 compiled from source.

Create Postgres image by running the ./build.sh script below

#! /bin/bash

set -ex

IMG=postgres/test-0.0.1
IMG_ID=`docker images ${IMG} -q`
PG_TAG=REL_18_BETA1

if [ "${IMG_ID}" = "" ]; then
  if [ ! -d "postgres-${PG_TAG}" ]; then
    wget https://github.com/postgres/postgres/archive/refs/tags/${PG_TAG}.tar.gz && tar -xzf ${PG_TAG}.tar.gz
  fi
  ID=991
  USR=postgres
  USR_HOME=/home/postgres

  cat > Dockerfile << EOF
FROM ubuntu:latest
RUN groupadd -g ${ID} ${USR} && useradd -r -u ${ID} -g ${USR} ${USR}
ADD postgres-${PG_TAG} ${USR_HOME}
WORKDIR ${USR_HOME}
RUN chown -R ${USR}:${USR} ${USR_HOME}
RUN apt-get update && apt-get install -y g++ zlib1g-dev make curl tar gzip perl liblz4-dev libreadline-dev flex bison libicu-dev liburing-dev
RUN apt-get install --reinstall -y pkg-config && ./configure  --with-liburing --enable-debug --with-lz4 && make -j4 && make all && make install
RUN echo "export PATH=/usr/local/pgsql/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" >> /etc/bash.bashrc && \
      chown -R ${USR}:${USR} /usr/local/pgsql
USER ${USR}
EOF

  docker build -t ${IMG}:latest .
  rm Dockerfile
  rm -rf postgres-${PG_TAG} ${PG_TAG}.tar.gz
else
  echo "Image ${IMG} already exists with ID ${IMG_ID}"
fi

With the image built, create the first container for running the primary. We will call it testPG.1

$ export ARCHIVE_DIR=~/mydir # Replace
$ mkdir -p $ARCHIVE_DIR
$ sudo chown -R 991:991 $ARCHIVE_DIR # Ensure Postgres user owns the directory
$ docker run -d --name testPG.1 -p 5432:5432 --mount type=bind,source=${ARCHIVE_DIR},target=/home/postgres/.tmp --restart=on-failure postgres/test-0.0.1:latest bash -c 'tail /dev/null -f'

Remember to update ARCHIVE_DIR to a valid directory in your local filesystem.

Once first node is up and running, a few configurations need to be changed

$ docker exec -it testPG.1 bash  # Attach to a tty so commands can be executed inside the container
postgres@e67b763ec2fb:~$ echo "export PGDATA=/usr/local/pgsql/data" >> ~/.bashrc && source ~/.bashrc  
postgres@e67b763ec2fb:~$ pg_ctl -l logfile initdb  # Only run this if the data directory does not already exist
postgres@e67b763ec2fb:~$ echo "host    replication repl 172.17.0.0/16 md5" >> $PGDATA/pg_hba.conf  # Add replication user to the `pg_hba.conf` file. Replace address 172.17.0.0 appropriately. Use the command "ifconfig docker0 | grep -w "inet" | awk '{print $2}' | sed 's/[0-9].[0-9]$/0.0/'" to get right interface address on your system
postgres@e67b763ec2fb:~$ cat >> $PGDATA/postgresql.conf << EOF
wal_level = replica  # Allow physical replication
archive_mode = on  # Turn on WAL archiving
archive_command = 'test ! -f /home/postgres/.tmp/%f && cp %p /home/postgres/.tmp/%f'  # Send old WAL files to archive location
restore_command = 'cp /home/postgres/.tmp/%f %p'  # Restore WALs from archive
wal_keep_size = 512  # Amount of WALs in MBs to keep before removal
listen_addresses = '*'  # Allow all. Not safe at all
EOF
postgres@e67b763ec2fb:~$ pg_ctl -l logfile start
postgres@e67b763ec2fb:~$ psql -c "CREATE USER repl WITH REPLICATION ENCRYPTED PASSWORD 'repl';" 
postgres@e67b763ec2fb:~$ exit

Now our soon to be primary node should be up and running.

Next, create another container as the second node. We will call it testPG.2. Remember to use the same ARCHIVE_DIR location.

$ export ARCHIVE_DIR=/mydir
$ docker run -d --name testPG.2 -p 5433:5432 --mount type=bind,source=${ARCHIVE_DIR},target=/home/postgres/.tmp --restart=on-failure postgres/test-0.0.1:latest bash -c 'tail /dev/null -f'

In testPG.2 node, remove the data directory if it already exists then use the pg_basebackup tool to copy data from the primary server(testPG.1) and begin the replication process.

$ echo "export PGDATA=/usr/local/pgsql/data" >> ~/.bashrc && source ~/.bashrc
$ docker exec testPG.2 bash -c "/usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile stop && rm -rf $PGDATA"

# If you get a "pg_ctl: directory "/usr/local/pgsql/data" does not exist" message, that is fine. The directory hasn't yet been created

$ SERV_IP=`docker inspect --format='{{ .NetworkSettings.IPAddress }}' testPG.1` && docker exec -u root testPG.2 bash -c "echo '$SERV_IP:5432:replication:repl:repl' > /home/postgres/.pgpass && chmod 0600 /home/postgres/.pgpass && chown postgres /home/postgres/.pgpass"

$ SERV_IP=`docker inspect --format='{{ .NetworkSettings.IPAddress }}' testPG.1` && docker exec -t testPG.2 bash -c "/usr/local/pgsql/bin/pg_basebackup -D $PGDATA -h $SERV_IP -p 5432 -Xs -R -P -U repl && /usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile start"
23724/23724 kB (100%), 1/1 tablespace
waiting for server to start.... done
server started

The server immediately goes into standby mode and starts receiving WAL records from the primary(testPG.1). Check processes running under testPG.1

$ docker exec testPG.1 bash -c "ps -faux | grep postgres"
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres     276  0.0  0.4 208844 24484 ?        Ss   17:46   0:00 /usr/local/pgsql/bin/postgres
....
postgres     285  0.0  0.0 208976  4384 ?        Ss   17:46   0:00  \_ postgres: archiver last was 000000010000000000000005.00000060.backup
postgres     286  0.0  0.1 210424  5248 ?        Ss   17:46   0:00  \_ postgres: logical replication launcher
postgres     319  0.0  0.1 210684  9256 ?        Ss   17:48   0:00  \_ postgres: walsender repl 172.17.0.1(47148) streaming 0/6000168

There is a single WAL sender process streaming records to the standby server.
In the testPG.2 node, a WAL receiver process is running which receives streamed WAL records from the primary and re-applies the changes on the standby server.

$ docker exec testPG.2 bash -c "ps -faux | grep postgres"
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
....
postgres      31  0.0  0.4 208844 24588 ?        Ss   17:48   0:00 /usr/local/pgsql/bin/postgres
....
postgres      37  0.0  0.1 209596  6040 ?        Ss   17:48   0:00  \_ postgres: startup recovering 000000010000000000000006
postgres      46  0.0  0.0 209452  4800 ?        Ss   17:48   0:00  \_ postgres: walreceiver streaming 0/6000168

Currently, we have one primary server and one standby server. The primary services both read and write request while the standby is limited to read-only requests. Read requests are mostly SELECT-like queries from users while write requests involve modifications to data(DDL, DML queries)

We can easily add more standbys to our cluster as needed to help with load distribution and reduce chances of a service outage when both servers(primary and standby) become unavailable at the same time. Adding another standby(testPG.3) is straightforward and similar to what we did when setting up testPG.2.

$ docker run -d --name testPG.3 -p 5434:5432 --mount type=bind,source=${ARCHIVE_DIR},target=/home/postgres/.tmp --restart=on-failure postgres/test-0.0.1:latest bash -c 'tail /dev/null -f'

$ SERV_IP=`docker inspect --format='{{ .NetworkSettings.IPAddress }}' testPG.1` && docker exec -u root testPG.3 bash -c "echo '$SERV_IP:5432:replication:repl:repl' > /home/postgres/.pgpass && chmod 0600 /home/postgres/.pgpass && chown postgres /home/postgres/.pgpass"

$ SERV_IP=`docker inspect --format='{{ .NetworkSettings.IPAddress }}' testPG.1` && docker exec -t testPG.3 bash -c "/usr/local/pgsql/bin/pg_basebackup -D $PGDATA -h $SERV_IP -p 5432 -Xs -R -P -U repl && /usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile start"
23724/23724 kB (100%), 1/1 tablespace
waiting for server to start.... done
server started

With the second standby server testPG.3 up and running, the primary server testPG.1 should now have two WAL sender processes streaming data to the standbys

$ docker exec testPG.1 bash -c "ps -faux | grep postgres"
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
....
postgres      17  0.0  0.4 208844 24460 ?        Ss   10:12   0:00 /usr/local/pgsql/bin/postgres
....
postgres      34  0.0  0.2 211036 12732 ?        Ss   10:13   0:00  \_ postgres: walsender repl 172.17.0.1(60918) streaming 0/8000060
postgres      82  0.0  0.1 210568  7852 ?        Ss   10:22   0:00  \_ postgres: walsender repl 172.17.0.1(57804) streaming 0/8000060

Failover

In a highly available system, when the primary fails, one of the healthy standbys needs replace it in order to minimize service disruptions. In our case, one of the standby gets promoted to the new primary and starts servicing incoming write queries from users. Let's promote the first standby server testPG.2. Later we can make the second standby testPG.3 start following the new primary(testPG.2). We'll then bring up the old primary server testPG.1 as a standby, following the new leader.

First, shutdown the the current primary(testPG.1)

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile stop"

Now we do not have a primary in our cluster. We need to act quick to promote a healthy standby to take its place and start accepting writes.
Next, promote the first standby(testPG.2)

$ docker exec testPG.2 bash -c "/usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile promote"
waiting for server to promote.... done
server promoted

We have a new primary, which is good. But now our second standby still does not recognize the new primary. We need to make it follow the new leader. So we update its connection string to the primary. Get the current connection information on testPG.3

$ docker exec testPG.3 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"SHOW primary_conninfo;\""
                                                                                                                                              primary_conninfo

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
 user=repl password=repl channel_binding=disable host=172.17.0.1 port=5432 sslmode=disable sslnegotiation=postgres sslcompression=0 sslcertmode=disable sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=disa
ble krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable
(1 row)

The port still references the old primary. We need to change that. Copy the entire string and update the port to 5433 for the testPG.2 server.

$ DOCKER_BRIDGE_INTERFACE=`ifconfig docker0 | grep -w "inet" | awk '{print $2}'` && \
    docker exec testPG.3 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"ALTER SYSTEM SET primary_conninfo = '\
        user=repl password=repl channel_binding=disable host=$DOCKER_BRIDGE_INTERFACE \
        port=5433 sslmode=disable sslnegotiation=postgres sslcompression=0 \
        sslcertmode=disable sslsni=1 ssl_min_protocol_version=TLSv1.2 \
        gssencmode=disable krbsrvname=postgres gssdelegation=0 \
        target_session_attrs=any load_balance_hosts=disable';\""
ALTER SYSTEM

$ docker exec testPG.3 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"SELECT pg_reload_conf();\"" # Reload configurations
 pg_reload_conf
----------------
 t
(1 row)
$ docker exec testPG.3 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"SHOW primary_conninfo;\""
                                                                                                                                              primary_conninfo

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
 user=repl password=repl channel_binding=disable host=172.17.0.1 port=5434 sslmode=disable sslnegotiation=postgres sslcompression=0 sslcertmode=disable sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=disable krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable
(1 row)

The standby is now following the new leader. Data is being streamed from the new primary(testPG.2) via the 172.17.0.1:54356 socket to the standby, which is what we want.

$ docker exec testPG.2 bash -c "ps -faux | grep postgres"
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
....
postgres      17  0.0  0.4 208844 24732 ?        Ss   10:13   0:00 /usr/local/pgsql/bin/postgres
....
postgres    1166  0.0  0.1 210568  7800 ?        Ss   11:27   0:00  \_ postgres: walsender repl 172.17.0.1(54356) streaming 0/90001E0

Now we can bring back the old primary(testPG.1) and add it to the cluster as a standby server.
Before restarting the server, ensure an empty standby.signal file exists in the data directory. This ensures it starts up as a standby and then we
can connect to it to update it connection string. The standby.signal file was not created earlier for the standbys as it is part of what pg_basebackup automatically does for us after it is done copying data to a new location.

$ docker exec testPG.1 bash -c "touch $PGDATA/standby.signal && /usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile start"

The server should now be up and running as standby. But it still does not know how to reach the new primary(testPG.2) server.

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"SHOW primary_conninfo;\""
 primary_conninfo
------------------

(1 row)

The connection string is empty. This is because testPG.1 server was the initial primary and did not need to stream data from another server hence no connection string was required. The connection string needs to be updated with valid key-values

$ DOCKER_BRIDGE_INTERFACE=`ifconfig docker0 | grep -w "inet" | awk '{print $2}'` && \
    docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"ALTER SYSTEM SET primary_conninfo = '\
        user=repl password=repl channel_binding=disable host=$DOCKER_BRIDGE_INTERFACE \
        port=5433 sslmode=disable sslnegotiation=postgres sslcompression=0 \
        sslcertmode=disable sslsni=1 ssl_min_protocol_version=TLSv1.2 \
        gssencmode=disable krbsrvname=postgres gssdelegation=0 \
        target_session_attrs=any load_balance_hosts=disable';\""
ALTER SYSTEM

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"SELECT pg_reload_conf();\"" # Reload configurations
 pg_reload_conf
----------------
 t
(1 row)

testPG.1 now follows the new leader, testPG.2 and is able to receive streamed WAL records as shown below with the WAL receiver process running and connected to the primary.

$ docker exec testPG.1 bash -c "ps -faux | grep postgres"
....
postgres     226  0.0  0.4 208844 24736 ?        Ss   11:39   0:00 /usr/local/pgsql/bin/postgres
....
postgres     232  0.0  0.1 210620  6308 ?        Ss   11:39   0:00  \_ postgres: startup recovering 000000020000000000000009
postgres     836  0.0  0.0 210488  4728 ?        Ss   11:47   0:00  \_ postgres: walreceiver streaming 0/90001E0

We have successfully built a "3-node" cluster with one primary and two standbys using docker containers. We have also managed to failover to a standby when the primary was unreachable. We were able to reconfigure remaining standbys to follow the new primary.
The most notable part of the whole process is the manual interventions required especially for failover. Tools such as Patroni or Repmgr exist, that provide solutions to make the process of setting up and managing HA less painful. HA using Patroni will be covered in another post.

Fun with Postgres Recovery

Brian Misachi — Wed, 09 Jul 2025 23:14:59 +0000

I recently had an issue with a test Postgres server that didn't shut down cleanly and could not restart because it was stuck in recovery. I tried all the quick tricks I could get from browsing the internet but all of them failed. I ended up using pg_resetwal to get to a point where the database could be able to start up again. This meant losing some data that had not been written to disk before the crash. I decided to look into how recovery really works for Postgres, which ended up in yet another extension pg_wal_recovery. In this post, I'll demonstrate a simple table recovery using Postgres. I'll be using /usr/local/pgsql/data as my Postgres data location. If this is different for you, then you can change it.

First the definition of Recovery[I googled :) ]: Data recovery is the process of restoring a database to a consistent state after a crash or power loss. Changes to data are usually stored separately as a sequential log. The actual data pages(table data) can be written to permanent storage later. When a crash occurs but before data is flushed to disk, the database can be in an inconsistent state whereby there are changes in the log that are not reflected in the table file. During the next restart, the database needs to "fix" this inconsistency when it reads the log and replays every record from the last REDO location in order to restore the database to a consistent state.

Postgres uses the WAL to log changes made to tables before the data is flushed to permanent storage during a checkpoint or by the background writer. The WAL plays a very important role in the operation of database activities such as replication, point-in-time-recovery(PITR), recovery, backups and others.

Postgres uses a control file PGDATA/global/pg_control to store state information required by various parts of the database. This is where details like checkpoint location, redo location, the database state value(enum), the next transaction ID to be assigned, next OID and many more are stored. A copy of the file data is stored in shared memory for easy access and is routinely synced with the data on disk. When the database starts up it reads the control file and checks values of REDO and Checkpoint locations. If these two values are the same and the database was shut down cleanly, then normal startup process continues. If not, the recovery process follows. See StartupProcessMain in src/backend/postmaster/startup.c and StartupXLOG in src/backend/access/transam/xlog.c. Also see InitWalRecovery and PerformWalRecovery in transam/xlogrecovery.c for the low-level implementations.

The control file data can be read using the pg_controldata tool. A sample output is as follows.

postgres@f63dffa121c7:~$ pg_controldata -D /usr/local/pgsql/data/
pg_control version number:            1800
Catalog version number:               202505071
Database system identifier:           7504776984629272596
Database cluster state:               in production
pg_control last modified:             Tue Jul  8 21:38:34 2025
Latest checkpoint location:           16/18031820
Latest checkpoint's REDO location:    16/18031820
Latest checkpoint's REDO WAL file:    000000010000001600000018
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          1:1847483885
Latest checkpoint's NextOID:          443131
.....

In this case, both REDO and Checkpoint point to the same location in the WAL file and the database is up and running as shown in the value of Database cluster state: in production.

Now if we run the CHECKPOINT command, the checkpoint and redo positions in the control file will change.

postgres=# CHECKPOINT;
CHECKPOINT

The control file data has changed

pg_control version number:            1800
Catalog version number:               202505071
Database system identifier:           7504776984629272596
Database cluster state:               in production
pg_control last modified:             Tue Jul  8 22:06:08 2025
Latest checkpoint location:           16/18031928
Latest checkpoint's REDO location:    16/180318D0
Latest checkpoint's REDO WAL file:    000000010000001600000018
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          1:1847483885
.....

Looking at how much data has changed before and after the checkpoint command we get

postgres=# select '16/18031928'::pg_lsn - '16/18031820'::pg_lsn as checkpoint_diff_bytes;
 checkpoint_diff_bytes
-----------------------
                   264
(1 row)

postgres=# select '16/180318D0'::pg_lsn - '16/18031820'::pg_lsn as redo_diff_bytes;
 redo_diff_bytes
-----------------
             176
(1 row)

We can see which records have added to the WAL when the checkpoint command was called using the pg_waldump tool

postgres@f63dffa121c7:~$ pg_waldump --start=16/18031820 /usr/local/pgsql/data/pg_wal/000000010000001600000018
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 16/18031820, prev 16/180317E8, desc: CHECKPOINT_SHUTDOWN redo 16/18031820; tli 1; prev tli 1; fpw true; wal_level replica; xid 1:1847483885; oid 443131; multi 1; offset 0; oldest xid 2147483649 in DB 5; oldest multi 1 in DB 5; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 16/18031898, prev 16/18031820, desc: RUNNING_XACTS nextXid 1847483885 latestCompletedXid 1847483884 oldestRunningXid 1847483885
rmgr: XLOG        len (rec/tot):     30/    30, tx:          0, lsn: 16/180318D0, prev 16/18031898, desc: CHECKPOINT_REDO wal_level replica
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 16/180318F0, prev 16/180318D0, desc: RUNNING_XACTS nextXid 1847483885 latestCompletedXid 1847483884 oldestRunningXid 1847483885
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 16/18031928, prev 16/180318F0, desc: CHECKPOINT_ONLINE redo 16/180318D0; tli 1; prev tli 1; fpw true; wal_level replica; xid 1:1847483885; oid 443131; multi 1; offset 0; oldest xid 2147483649 in DB 5; oldest multi 1 in DB 5; oldest/newest commit timestamp xid: 0/0; oldest running xid 1847483885; online
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 16/180319A0, prev 16/18031928, desc: RUNNING_XACTS nextXid 1847483885 latestCompletedXid 1847483884 oldestRunningXid 1847483885
pg_waldump: error: error in WAL record at 16/180319A0: invalid record length at 16/180319D8: expected at least 24, got 0

No actual data records have been added. The redo record is now at 16/180318D0 from 16/18031820 position while the checkpoint record is now at 16/18031928 from 16/18031820. Now we proceed to the demo.

Open a client sessions and create a table foo and insert some data to it then, do a "dirty" shutdown of the server before any checkpoint is ran

postgres=# DROP TABLE IF EXISTS foo;
NOTICE:  table "foo" does not exist, skipping
DROP TABLE
postgres=# CREATE TABLE foo(id SERIAL, k INT NOT NULL);
CREATE TABLE
postgres=# INSERT INTO foo SELECT i, i*100 FROM generate_series(1, 100) i;
INSERT 0 100

In another terminal session, kill the server using the --mode=immediate option

postgres@f63dffa121c7:~$ pg_ctl -D /usr/local/pgsql/data -l ~/logfile stop --mode=immediate
waiting for server to shut down.... done
server stopped

Table foo should now have 100 records since the transaction committed, right? Not really.

On commit, a user can only be sure that data has been stored in the WAL but not in the table files. Before the database can be restarted, we need to do two things: copy the control file and copy the WAL segment file to another location. Create a directory named global inside the new location and copy the control file into the global directory. Then copy the last active WAL segment file(as shown in Latest checkpoint's REDO WAL file row when you run the command pg_controldata -D /usr/local/pgsql/data/) to the same location. Ensure the postgres user has read/write access the the location. I'll use the /tmp directory for this case. The final structure should look like this

postgres@f63dffa121c7:~$ cp /usr/local/pgsql/data/global/pg_control /tmp/global/
postgres@f63dffa121c7:~$ cp /usr/local/pgsql/data/pg_wal/000000010000001600000018 /tmp/
postgres@f63dffa121c7:~$ ls /tmp/
000000010000001600000018  global
postgres@f63dffa121c7:~$ ls /tmp/global/
pg_control

Now clear the WAL data using the pg_resetwal tool. This is because, by default, when Postgres starts up it will first check if it needs to go into recovery mode and it will use the WAL to recover any un-checkpointed data. Manual recovery will be done later using the pg_wal_recovery extension.

postgres@f63dffa121c7:~$ pg_resetwal -D /usr/local/pgsql/data/ -f
Write-ahead log reset

pg_resetwal should never be used in a production database. It has destructive side effects that can lead to data loss. Its usage should be limited to situations where the user knows what they are doing and they are aware of the risks involved.

Looking at the control data now, the checkpoint and redo locations match. If there were any WAL segment files in the pg_wal directory prior to running the command, they have been removed. Also, a new WAL segment file has been created 000000010000001600000019.

postgres@f63dffa121c7:~$ pg_controldata -D /usr/local/pgsql/data/
pg_control version number:            1800
Catalog version number:               202505071
Database system identifier:           7504776984629272596
Database cluster state:               shut down
pg_control last modified:             Wed Jul  9 10:46:20 2025
Latest checkpoint location:           16/19000028
Latest checkpoint's REDO location:    16/19000028
Latest checkpoint's REDO WAL file:    000000010000001600000019
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          1:1847483887
Latest checkpoint's NextOID:          451323
.....

Now the database can be restarted since we know it won't go into automatic recovery mode.

postgres@f63dffa121c7:~$ pg_ctl -D /usr/local/pgsql/data -l ~/logfile start
waiting for server to start.... done
server started

Once it is up, we can check the foo table to find out if there is any data in it.

postgres=# SELECT * FROM foo;
 id | k
----+---
(0 rows)

It returned no rows. Does it mean the data inserted before has been lost? Well....Not really.
The data was committed so we can be sure that, at the very least, the data has been logged in the WAL. We can check this using the pg_wal_recovery extension.

postgres=# SELECT * FROM wal_list_records('/tmp');
WARNING:  invalid record length at 16/18065410: expected at least 24, got 0
      wal_file_name       |     wal_type      | wal_record
--------------------------+-------------------+-------------
 000000010000001600000018 | CHECKPOINT_REDO   | 16/1805D4E0
 000000010000001600000018 | RUNNING_XACTS     | 16/1805D500
 000000010000001600000018 | CHECKPOINT_ONLINE | 16/1805D538
 000000010000001600000018 | RUNNING_XACTS     | 16/1805D5B0
 000000010000001600000018 | FPI_FOR_HINT      | 16/1805D5E8
 000000010000001600000018 | FPI_FOR_HINT      | 16/1805E0C0
 000000010000001600000018 | FPI_FOR_HINT      | 16/1805F880
 000000010000001600000018 | RUNNING_XACTS     | 16/18060D00
 000000010000001600000018 | INSERT+INIT       | 16/18060D38
 000000010000001600000018 | INSERT            | 16/18060D78
 000000010000001600000018 | INSERT            | 16/18060DB8
 000000010000001600000018 | INSERT            | 16/18060DF8
 000000010000001600000018 | INSERT            | 16/18060E38
 .....
 000000010000001600000018 | INSERT            | 16/18062550
 000000010000001600000018 | INSERT            | 16/18062590
 000000010000001600000018 | INSERT            | 16/180625D0
 000000010000001600000018 | INSERT            | 16/18062610
 000000010000001600000018 | COMMIT            | 16/18062650
 000000010000001600000018 | INSERT            | 16/18062678
 000000010000001600000018 | INSERT_LEAF       | 16/18063CD8
 000000010000001600000018 | INSERT            | 16/18064268
 000000010000001600000018 | INSERT_LEAF       | 16/180644B8
 000000010000001600000018 | INPLACE           | 16/180644F8
 000000010000001600000018 | COMMIT            | 16/18065388
(115 rows)

The extension read WAL records starting from 16/1805D4E0, the last REDO location in the control file that was copied earlier. There are 100 WAL records from offset 16/18060D38 to 16/18062610 in the WAL segment file. This matches the number of inserts that was done just before the database was shutdown. We can check this using pg_waldump tool which provides way more information about the WAL records

postgres@f63dffa121c7:~$ pg_waldump --start=16/18060D38 --end=16/18062610  /tmp/000000010000001600000018
rmgr: Heap        len (rec/tot):     63/    63, tx: 1847483887, lsn: 16/18060D38, prev 16/18060D00, desc: INSERT+INIT off: 1, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
rmgr: Heap        len (rec/tot):     63/    63, tx: 1847483887, lsn: 16/18060D78, prev 16/18060D38, desc: INSERT off: 2, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
rmgr: Heap        len (rec/tot):     63/    63, tx: 1847483887, lsn: 16/18060DB8, prev 16/18060D78, desc: INSERT off: 3, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
.....
rmgr: Heap        len (rec/tot):     63/    63, tx: 1847483887, lsn: 16/18062590, prev 16/18062550, desc: INSERT off: 98, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
rmgr: Heap        len (rec/tot):     63/    63, tx: 1847483887, lsn: 16/180625D0, prev 16/18062590, desc: INSERT off: 99, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0

The last thing to check before replaying the WAL records is the data file(table foo) to confirm if there is any data present. For this, the pageinspect extension comes in handy.

postgres=# SELECT lp, t_ctid, t_xmin, t_xmax FROM heap_page_items(get_raw_page('foo', 0)) WHERE lp_len > 0;
 lp | t_ctid | t_xmin | t_xmax
----+--------+--------+--------
(0 rows)

Here we query the content of the first page(zero-based) in the data file. The lp(line pointer) is the item number of the record within the page. t_ctid is the location of the record within the data file e.g (0,3) can be read as third item on the first page. t_xmin and t_xmax are the transaction IDs of the inserting and deleting transactions respectively.

The table is empty. We have to replay the WAL in order to get the data into the table. The pg_wal_recovery extension will be used for this purpose. It will output the last record replayed in the WAL when done.

postgres=# select * FROM wal_recover('/tmp');
WARNING:  invalid record length at 16/18065410: expected at least 24, got 0
 wal_type | wal_record
----------+-------------
 COMMIT   | 16/18065388
(1 row)

The "lost" data should now appear in the data files. Again, we can run the query using the helper functions from the pageinspect extension

postgres=# SELECT lp, t_ctid, t_xmin, t_xmax FROM heap_page_items(get_raw_page('foo', 0)) WHERE lp_len > 0;
 lp  | t_ctid  |   t_xmin   | t_xmax
-----+---------+------------+--------
   1 | (0,1)   | 1847483887 |      0
   2 | (0,2)   | 1847483887 |      0
   3 | (0,3)   | 1847483887 |      0
.....
  97 | (0,97)  | 1847483887 |      0
  98 | (0,98)  | 1847483887 |      0
  99 | (0,99)  | 1847483887 |      0
 100 | (0,100) | 1847483887 |      0
(100 rows)

The data is now present in the table file. In this case, all 100 records fit in a single page(each page is 8kb in size). After a restart, querying the foo table now returns all 100 records. The data has been restored.

postgres=# SELECT * FROM foo;
 id  |   k
-----+-------
   1 |   100
   2 |   200
.....
  99 |  9900
 100 | 10000
(100 rows)

In this post I demonstrated how to perform a simple data recovery using the WAL. In real life systems, the chances of having to do recovery like this, are very low. Postgres already ships with a robust recovery infrastructure that has been battle tested for many years, so automatic recovery would handle such for you anyway. Building the extension made me go deeper into the lower-level parts of recovery in Postgres. The extension is for educational purposes only and would not be suitable for a production database or any database that has data you care about.

Tracking Postgres "fsyncs" with bpftrace

Brian Misachi — Mon, 19 May 2025 21:12:56 +0000

Data systems such as Postgres make strong guarantees when it comes to durability. The aim is to reduce the chances of data loss at any cost. The fsync and fdatasync system calls are often used to provide this guarantee to the user that data has been safely flushed to the storage device. The user can then be certain that in any case of power failure they would not lose their data.

However, these calls are not cheap. An application would block while waiting for the system calls to return. Napkin math latency numbers puts it at around 1ms. I did a test for fsync system call on my laptop just to compare whether I'd get the same latency values. The bpftrace script I used looks like this


BEGIN
{
    printf("Tracing fsync latency... Hit Ctrl-C to end.\n");
}

tracepoint:syscalls:sys_enter_fsync,
tracepoint:syscalls:sys_enter_fdatasync
{
    @start[tid] = nsecs;
}

tracepoint:syscalls:sys_exit_fsync,
tracepoint:syscalls:sys_exit_fdatasync 
/@start[tid]/
{
    cat("/proc/%d/cmdline", pid);  // Who is making the call?
    printf(" ==> %d\n", (nsecs - @start[tid]));
    delete(@start, tid);
}

The script for making the fsync call is written in Golang here

package main

import (
    "log"
    "os"
)

func main()  {
    buf := []byte("hello world")
    file, err := os.OpenFile("/tmp/testfile.txt", os.O_CREATE|os.O_WRONLY, 0777)

    if err != nil {
        log.Fatalf("OpenFile: %v", err)
    }
    defer file.Close()

    file.Write(buf)
    file.Sync()  // Flush to disk
}

After running the Go script a couple of times, the results I got were as follows:

Run bpftrace --unsafe fysnc_lat.bt

Attaching 5 probes...
Tracing fsync latency... Hit Ctrl-C to end.
 ==> 1095741
 ==> 967711
 ==> 1086844
 ==> 1095359

Not too far from the napkin value. The measured values are in nanoseconds.

Having done this, I wanted to know how often Postgres does flush data to disk. My assumption at this point would be that a single transaction makes at least one fsync call to ensure durabilty(for every commit a WAL record is generated and flushed to disk). I wanted to find out if it holds using Postgres. If 1 fsync call takes 1ms, then you'd expect to do around 1000 transactions per second for a single thread scenario. This is mostly an over-simplification because transactions do more than just making fsync calls, so the latency would be over 1ms in a real system.

The bpftrace script to track fsync calls made by Postgres

BEGIN
{
    printf("Tracing Postgres fsync calls... Hit Ctrl-C to end.\n");
    printf("%-6s %6s %s\n", "PID", "FD", "ProcName < - > Filename");
}

tracepoint:syscalls:sys_enter_fsync,tracepoint:syscalls:sys_enter_fdatasync
/comm == "postgres"/ {
    @procs[pid] = count();
    @sum["total"] = count();

    if (!@files_fsyncd[args->fd]) {
        @files_fsyncd[args->fd] = 1;  // Mark FD as seen

        printf("\n%-6d %6d ", pid, args->fd);
        cat("/proc/%d/cmdline", pid);
        printf(" < - > ");
        system("readlink /proc/%d/fd/%d", pid, args->fd); // Get filename from fd
    }
}

END {
    clear(@files_fsyncd);
}

The script fires when a Postgres process makes an fsync or fdatasync call. It will display the process name(for example the checkpointer, walwriter etc), process ID, total number of calls made by all Postgres processes, the name of the file and the file descriptor number.

The stats from running a single client, using sysbench for a write-only workload for about 600 seconds is as follows

SQL statistics:
    queries performed:
        read:                            0
        write:                           1435372
        other:                           717686
        total:                           2153058
    transactions:                        358843 (598.07 per sec.)
    queries:                             2153058 (3588.42 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

The client performs 358843 transactions while doing about 2153058 queries. That is around 6 queries per transaction.

Here is the result from running the bpftrace script bpftrace --unsafe fsync_files.bt

Attaching 4 probes...
Tracing fsync calls... Hit Ctrl-C to end.
PID        FD ProcName < - > Filename
6202       14 postgres: sbtest sbtest 192.168.33.10(50984) idle < - > /usr/local/pgsql/data/pg_wal/0000000100000010000000ED

3094        6 postgres: walwriter  < - > /usr/local/pgsql/data/pg_wal/0000000100000010000000ED

3091        7 postgres: checkpointer  < - >
6212        8 postgres: autovacuum worker sbtest < - > /usr/local/pgsql/data/pg_wal/000000010000001100000001

3091       12 postgres: checkpointer  < - > /usr/local/pgsql/data/base/16416/2841

3091       16 postgres: checkpointer  < - > /usr/local/pgsql/data/base/16416/24741

3091        9 postgres: checkpointer  < - > /usr/local/pgsql/data/base/16416/2840

3091       13 postgres: checkpointer  < - > /usr/local/pgsql/data/base/16416/24730

3091       11 postgres: checkpointer  < - > /usr/local/pgsql/data/base/16416/2840_vm

3091       18 postgres: checkpointer  < - > /usr/local/pgsql/data/pg_wal

3091       17 postgres: checkpointer  < - > /usr/local/pgsql/data/base/16416/24759

3091       10 postgres: checkpointer  < - > /usr/local/pgsql/data/base/16416/2840_fsm

3091       15 postgres: checkpointer  < - > /usr/local/pgsql/data/base/16416/24730_vm

3091       19 postgres: checkpointer  < - >
^C


@procs[6253]: 1
@procs[6257]: 1
@procs[6255]: 1
@procs[6221]: 1
@procs[6223]: 1
@procs[6251]: 1
@procs[6259]: 1
@procs[6212]: 1
@procs[6216]: 1
@procs[6262]: 1
@procs[3091]: 152
@procs[3094]: 253
@procs[6202]: 358898
@sum[total]: 359313

The client process with PID 6202 is reported as having made 358898 fsync calls. This is around the same number reported by sysbench above. Every transaction command made a fsync call. But the actual number of transactions per second is about 598 when the ideal number is 1000. What could be the reason for the difference? I had a few thoughts(there could be more...)

One reason could be due to the number of queries being performed in a single transaction(avg 6) causing some overhead in query processing. Also, there are other processes competing for shared resources with this client. The checkpointer, walwriter, autovaccum, bgwriter are the background processes that Postgres runs to perform other essential services. The checkpointer and walwriter(PIDs 3094 and 3091) made a combined 405 fsync calls. Checkpointing will flush all data that is in memory to disk. This can be a lot of data especially if there are many dirty buffers still in memory. Flushing lots of data at once can be very slow and increase response times in the system.

Other factors such as query processing depending on the number of queries in the transaction, locks, memory IO, network etc can increase transaction processing times. The transaction processing time, therefore, is not just the cost of a fsync call and would likely take more than 1ms. In a highly concurrent system, an optimization the database would use is to group multiple commits and only issue one fsync call to flush all the data at once to disk which reduces the number of IOs.

Postgres: Ordered Queries and the Planner

Brian Misachi — Thu, 01 May 2025 22:52:20 +0000

Most SQL queries require the results of the particular query to be ordered in a some way. The returned data can still be implicitly ordered by the primary key(if present) if the user does not apply their own ordering to the query. In Postgres and other SQL like databases, the way to order the result set from a query is by using the ORDER BY clause.

Sorting rows has an additional cost attached to it, which can lead to interesting choice of plans to execute a query. The planner generates multiple paths for query execution with each path representing a different way to get the results for the query. The path with the lowest cost is checked if it produces already sorted results, if not it would have to be sorted and the final costs re-compared against other paths. If it is already sorted, it would remain the cheapest path and used to execute the query.

B+tree indexes are already sorted by default. By default, tables with a primary key are sorted on this field. When building index paths the planner considers both forward and backward scan paths on the index thus sorting in ascending or descending order does not have significant difference. Sorting on an indexed field in most cases performs better than non-indexed fields.

postgres=# CREATE TABLE bar(id1 INT, id2 INT, id3 INT, id4 INT, descr TEXT);  -- new table
postgres=# INSERT INTO bar SELECT i, i*3, i+i, i*2, 'hello' || i FROM generate_series(1, 10000000) i; -- 10M records
postgres=# CREATE INDEX idx_id1_id2_id3 ON bar(id1, id2, id3);

postgres=# explain analyze select * from bar where id1 >= 1000000 and id1 < 2000001 order by id1; -- there is an index in id1
                                                                QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_id1_id2_id3 on bar  (cost=0.43..168152.90 rows=998886 width=28) (actual time=0.046..139.920 rows=1000001.00 loo
ps=1)
   Index Cond: ((id1 >= 1000000) AND (id1 < 2000001))
   Index Searches: 1
   Buffers: shared hit=11188
 Planning Time: 0.126 ms
 Execution Time: 187.269 ms
(6 rows)

At a minimum 2 IOs are issued. The index is scanned then there is a heap lookup to get rows that are not already in the index, with no explicit sort step. Low CPU usage because there are no comparisons being done because the returned data is already ordered on the sort key.

Using a non-indexed field

postgres=# explain analyze select * from bar where id1 >= 1000000 and id1 < 2000001 order by id4;
                                                                   QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=291593.20..294090.41 rows=998886 width=28) (actual time=404.509..509.841 rows=1000001.00 loops=1)
   Sort Key: id4
   Sort Method: external merge  Disk: 38208kB
   Buffers: shared hit=11188, temp read=9547 written=9564
   ->  Index Scan using idx_id1_id2_id3 on bar  (cost=0.43..168152.90 rows=998886 width=28) (actual time=0.033..142.248 rows=1000001.
00 loops=1)
         Index Cond: ((id1 >= 1000000) AND (id1 < 2000001))
         Index Searches: 1
         Buffers: shared hit=11188
 Planning Time: 0.112 ms
 Execution Time: 565.804 ms
(10 rows)

Other than 2 IOs being issued for each index, there is an explicit sort step at the top level leading to higher CPU usage, for key comparisons. Also, every query only has work_mem amount of memory for sorting purposes. If the data returned does not fit the allocated work_mem memory, additional IOs are issued in the sort step to read in 9547 pages from disk and 9564 pages written out to disk.

In an attempt to reduce the disk IOs during sorting, we can increase the work_mem size(Remember work_mem is per query, so be careful)

postgres=# SET work_mem = 102400;
SET
postgres=# SHOW work_mem;
 work_mem
----------
 100MB
(1 row)

postgres=# explain analyze select * from bar where id1 >= 1000000 and id1 < 2000001 order by id4;
                                                                    QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
-------------
 Sort  (cost=209251.10..211748.31 rows=998886 width=28) (actual time=245.392..296.690 rows=1000001.00 loops=1)
   Sort Key: id4
   Sort Method: quicksort  Memory: 71452kB
   Buffers: shared hit=10089
   ->  Bitmap Heap Scan on bar  (cost=21199.02..109712.31 rows=998886 width=28) (actual time=26.995..129.239 rows=1000001.00 loops=1)
         Recheck Cond: ((id1 >= 1000000) AND (id1 < 2000001))
         Heap Blocks: exact=7354
         Buffers: shared hit=10089
         ->  Bitmap Index Scan on idx_id1_id2_id3  (cost=0.00..20949.30 rows=998886 width=0) (actual time=25.790..25.791 rows=1000001
.00 loops=1)
               Index Cond: ((id1 >= 1000000) AND (id1 < 2000001))
               Index Searches: 1
               Buffers: shared hit=2735
 Planning Time: 0.141 ms
 Execution Time: 348.114 ms
(14 rows)

There is some improvement in execution time as the sorting is done entirely in memory but also the plan changes to an in-memory bitmap scan. It is not advisable to 'blindly' increase the work_mem since it could mislead the planner in choosing an inferior plan and also does lead to sub-optimal usage of memory -- each query is allocated work_mem sized chunk of memory. It is used here only for demo purposes.

Using both indexed and non-indexed fields

postgres=# explain analyze select * from bar where id1 >= 1000000 and id1 < 2000001 order by id1, id4;
                                                                   QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Incremental Sort  (cost=0.61..208698.77 rows=998886 width=28) (actual time=0.061..293.947 rows=1000001.00 loops=1)
   Sort Key: id1, id4
   Presorted Key: id1
   Full-sort Groups: 31251  Sort Method: quicksort  Average Memory: 26kB  Peak Memory: 26kB
   Buffers: shared hit=10089
   ->  Index Scan using idx_id1_id2_id3 on bar  (cost=0.43..163748.90 rows=998886 width=28) (actual time=0.041..144.617 rows=1000001.
00 loops=1)
         Index Cond: ((id1 >= 1000000) AND (id1 < 2000001))
         Index Searches: 1
         Buffers: shared hit=10089
 Planning Time: 0.112 ms
 Execution Time: 341.344 ms
(11 rows)

The planner opts for incremental sort plan where the order-by clause is only partially sorted on a few columns, but not all. In this case, the sort key is on 2 fields id1 and id4. There is an index on id1, so it is already sorted on this column. Instead of doing a full sort on both columns, the pre-sorted rows need only be sorted again on the id4 column. Therefore, saving some CPU cycles and performs better than a case where both fields would have be sorted. There have been past cases where incremental sort has produced plans with issues. Incremental sort can be turned off with the SET enable_incremental_sort = off command.

Well, it helps to sort data on already ordered columns. In other cases incremental sort plan can help with partially ordered columns in an order-by clause. A helpful way to build indexes would be to have a multiple column index or even covering indexes, with the first key column used to limit data returned while next columns used to aid sorting.

SELECT * FROM test WHERE fld1 >= [start_limit] AND fld1 < [end_limit]  ORDER BY fld2; -- index on both fld1 and fld2; CREATE INDEX idx_f1_f2 ON test(fld1, fld2);

Unused indexes can lead to unnecessary bloat and have negative effects on performance. It is important to only add useful columns to the index to keep the index as small as possible. Test and measure to determine what works best for your case.

Postgres: Index Scans

Brian Misachi — Sun, 06 Apr 2025 17:35:47 +0000

Using an index to help improve query performance is common practice. An index is a structure organized in such a way to help make it cheaper to access data records stored on disk. One example of such a structure is the b+tree. In Postgres case, the data stored on disk is referred to as the heap.

Having an index on a table does not guarantee that it will be used to retrieve data for a query. It is important to first test and measure the effect of adding an index in order to find out if it helps improve performance for a target query. In the worst case, blindly adding an index to every query performance problem encountered may further degrade performance since indexes do not come cheap as well. There can be cases where the planner might choose a different, non-index, plan to execute a query even though an index exists that matches the query filter and/or the order-by clause. The best tool to use for this is the EXPLAIN command. To demonstrate a trivial scenario:

First we create a table and an index

postgres=# CREATE TABLE foo(id1 INT, id2 INT, id3 INT, id4 INT, descr TEXT);  -- new table
postgres=# INSERT INTO foo SELECT i, i*3, i+i, i*2, 'hello' || i FROM generate_series(1, 10000000) i; -- 10M records
postgres=# CREATE INDEX idx_id1_id2_id3 ON foo(id1, id2, id3);

Then run EXPLAIN on a query

postgres=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM foo WHERE id1 > 1000;

The path selected by the planner is shown below

                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Seq Scan on foo  (cost=0.00..198530.00 rows=9999027 width=28) (actual time=0.283..2260.973 rows=9999000 loops=1)
   Filter: (id1 > 1000)
   Rows Removed by Filter: 1000
   Buffers: shared hit=16274 read=57256
 Planning:
   Buffers: shared read=4
 Planning Time: 0.386 ms
 Execution Time: 2906.885 ms
(8 rows)

The query is looking for all the records where id1 field value is greater than 1000.

Let's dig into the details of the plan: the planner picks a sequential scan as the cheapest path to executing the query. The sequential scan plan has zero startup cost and total cost of executing the query is estimated to be 198530.00. Number of rows to be fetched is estimated at 9999027 while the real number of rows retrieved is 9999000(very close). A curious mind might wonder why an index scan is ignored when there is an index on id1 field. When using an index, the database needs to read the index structure, then if the data to be returned is not in the index, read the relevant data pages from disk. This means, the database would potentially have to perform 2 disk IOs for each index entry, one in the index file then another in the heap to get the data. This can be really expensive when there are many rows to be returned. It adds on to the cost of an index scan. For a sequential scan it only needs a single disk read if the page is not in the buffer. Often times the index might be entirely in memory if it fits, hence speeding up index lookups.

The EXPLAIN command can only show the cheapest plan that was chosen by the planner, out of all the plans that were considered. Meaning if the user needs to investigate why a particular plan was not chosen, they will be limited in what they can get from the EXPLAIN command. This is because the planner throws away other plans it considers inferior and such plans are never made available during debugging. I made an extension PG_ALL_PLANS to show all plans considered by the planner. Using the extension with our query above

postgres=# SELECT * FROM show_all_plans('EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM foo WHERE id1 > 1000');
                                                                  query_plans
-----------------------------------------------------------------------------------------------------------------------------------------------
 -------------------------------Plan 1-------------------------------
 Seq Scan on foo  (cost=0.00..198530.00 rows=9999027 width=28) (actual time=0.250..2603.390 rows=9999000 loops=1)
   Filter: (id1 > 1000)
   Rows Removed by Filter: 1000
   Buffers: shared hit=16182 read=57348
 Planning:
   Buffers: shared hit=47 read=4 dirtied=3
 Planning Time: 0.939 ms
 Execution Time: 3263.912 ms

 -------------------------------Plan 2-------------------------------
 Index Scan using idx_id1_id2_id3 on foo  (cost=0.43..498815.28 rows=9999027 width=28) (actual time=0.444..3309.395 rows=9999000 loops=1)
   Index Cond: (id1 > 1000)
   Buffers: shared hit=2 read=111835 written=8120
 Planning:
   Buffers: shared hit=16229 read=57352 dirtied=3
 Planning Time: 0.947 ms
 Execution Time: 3981.410 ms

 -------------------------------Plan 3-------------------------------
 Bitmap Heap Scan on foo  (cost=231288.89..429813.47 rows=9999027 width=28) (actual time=1319.351..3678.529 rows=9999000 loops=1)
   Recheck Cond: (id1 > 1000)
   Rows Removed by Index Recheck: 35
   Heap Blocks: exact=40497 lossy=33026
   Buffers: shared read=111837
   ->  Bitmap Index Scan on idx_id1_id2_id3  (cost=0.00..228789.14 rows=9999027 width=0) (actual time=1301.783..1301.783 rows=9999000 loops=1)
         Index Cond: (id1 > 1000)
         Buffers: shared read=38314
 Planning:
   Buffers: shared hit=16231 read=169190 dirtied=3 written=8120
 Planning Time: 0.951 ms
 Execution Time: 4338.259 ms

 -------------------------------Plan 4-------------------------------
 Gather  (cost=1000.00..1126516.03 rows=9999027 width=28) (actual time=0.524..1658.062 rows=9999000 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=16228 read=57302
   ->  Parallel Seq Scan on foo  (cost=0.00..125613.33 rows=4166261 width=28) (actual time=0.118..917.219 rows=3333000 loops=3)
         Filter: (id1 > 1000)
         Rows Removed by Filter: 333
         Buffers: shared hit=16228 read=57302
 Planning:
   Buffers: shared hit=16231 read=281027 dirtied=3 written=8120
 Planning Time: 0.952 ms
 Execution Time: 2434.354 ms

(47 rows)

There are 4 plans displayed now in addition to the plain sequential scan plan. The additional plans are thrown away by the planner. From this, we can see that index scan is more expensive, has a higher startup cost(0.43) and total cost(498815.28) for retrieving all the tuples, than the sequential scan. One interesting part is that the Bitmap Heap Scan has a lower total cost(429813.470 than the index scan but is considered inferior due to its higher startup cost(231288.89). The extension can be a great tool to dig deeper into the choices made by the planner when executing a query.

We can attempt to persuade the planner to pick the index scan plan by limiting the number of rows returned between 1000 to 1.5M. This helps reduce the selectivity of the plan thus only a small number of rows are returned.

postgres=# SELECT * FROM show_all_plans('EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM foo WHERE id1 > 1000 AND id1 < 1500000');
                                                                query_plans
--------------------------------------------------------------------------------------------------------------------------------------------
 -------------------------------Plan 1-------------------------------
 Index Scan using idx_id1_id2_id3 on foo  (cost=0.43..187892.55 rows=1498365 width=28) (actual time=0.023..548.997 rows=1498999 loops=1)
   Index Cond: ((id1 > 1000) AND (id1 < 1500000))
   Buffers: shared hit=2 read=16768
 Planning:
   Buffers: shared read=4
 Planning Time: 0.170 ms
 Execution Time: 651.542 ms

 -------------------------------Plan 2-------------------------------
 Seq Scan on foo  (cost=0.00..223530.00 rows=1498365 width=28) (actual time=0.204..2333.970 rows=1498999 loops=1)
   Filter: ((id1 > 1000) AND (id1 < 1500000))
   Rows Removed by Filter: 8501001
   Buffers: shared hit=10696 read=62834
 Planning:
   Buffers: shared hit=2 read=16772
 Planning Time: 0.174 ms
 Execution Time: 2432.406 ms

 -------------------------------Plan 3-------------------------------
 Bitmap Heap Scan on foo  (cost=38406.68..205106.49 rows=1498365 width=28) (actual time=236.667..539.696 rows=1498999 loops=1)
   Recheck Cond: ((id1 > 1000) AND (id1 < 1500000))
   Heap Blocks: exact=11023
   Buffers: shared hit=171 read=16599
   ->  Bitmap Index Scan on idx_id1_id2_id3  (cost=0.00..38032.08 rows=1498365 width=0) (actual time=233.444..233.445 rows=1498999 loops=1)
         Index Cond: ((id1 > 1000) AND (id1 < 1500000))
         Buffers: shared hit=171 read=5576
 Planning:
   Buffers: shared hit=10698 read=79606
 Planning Time: 0.182 ms
 Execution Time: 638.010 ms

 -------------------------------Plan 4-------------------------------
 Gather  (cost=1000.00..286866.50 rows=1498365 width=28) (actual time=0.518..2226.432 rows=1498999 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=11023 read=62507
   ->  Parallel Seq Scan on foo  (cost=0.00..136030.00 rows=624319 width=28) (actual time=0.093..805.573 rows=499666 loops=3)
         Filter: ((id1 > 1000) AND (id1 < 1500000))
         Rows Removed by Filter: 2833667
         Buffers: shared hit=11023 read=62507
 Planning:
   Buffers: shared hit=10869 read=96205
 Planning Time: 0.182 ms
 Execution Time: 2349.584 ms

(46 rows)

Now the planner chooses the index scan as it has a lower total cost(187892.55) than the sequential plan(223530.00). The index scan potentially doing 2 IOs per entry still manages to be cheaper than the sequential scan. It does this majorly due to the reduced number of rows it has to scan(1498365 compared to 9999027 rows) hence reduced number of page reads. Also, the sequential scan reads a large number of rows which it end ups throwing away when filtering the results(8501001 rows).

It is good practice to log and analyze slow queries to figure out the performance bottlenecks before adding an index or any other techniques. Always check your indexes if they are being used at all. Indexes occupy disk space and can degrade performance especially for write intensive workloads, so it would make more sense to only add an index to a relation if it is used in the execution of the relevant query.

Making Postgres better with OrioleDB

Brian Misachi — Fri, 28 Feb 2025 23:10:38 +0000

One of the biggest advantages Postgres has over lots of other DBMSs out there is the fact that it allows external improvements or new features, without requiring any interactions with the core codebase, via extensions. There are many extensions each with a unique purpose(I have also written two trivial extensions pg_wal_ext and pg_table_bloat )

One extension that especially stands out is OrioleDB. It provides an alternative storage engine to Postgres. Postgres comes with only one storage engine based on the heap. This is different from MySQL which comes with several storage engines: innodb, myissam etc.

Postgres heap based storage engine works well in most cases but it also presents a number of issues, some of which are: bloat introduced by how updates are handled, the need for garbage collection(vacuum), transaction wraparound

OrioleDB comes with the promise of solving the issues presented by the Postgres heap. I tested it out and the results were pretty good. I'll describe the process I used and the results below.

First you need a patched up version of Postgres: 16(tag:patches16_33) or 17(tag:patches17_5). The steps are available on github

The test is for a read only workload on 16 tables each with 25M records(approx 94GB for each setup). I used a server with 64GB RAM, 20cores and 500GB NVME SSD(ext4 filesystem). I use my setup scripts to build and install Postgres. The benchmarking tool used is sysbench. The OrioleDB used is built from source from the main branch commit 0c484c4.

The install steps for OrioleDB:

# Install dependencies
apt update
apt install python3 python3-dev python3-pip python3-setuptools python3-testresources libzstd1 libzstd-dev libssl-dev libcurl4-openssl-dev

git clone https://github.com/orioledb/orioledb
cd orioledb
git reset --hard 0c484c4 # optional
make USE_PGXS=1 ORIOLEDB_PATCHSET_VERSION=5
echo "shared_preload_libraries = 'orioledb.so'" >> /usr/local/pgsql/data/postgresql.conf

The buffer pool for both setups is set to 16GB i.e shared_buffers for heap and orioledb.main_buffers for Orioledb.

To run the actual tests, I used https://github.com/misachi/sysbench-graphing-tests. Once everything has been setup, running ./run.threads <run number> pgsql should execute the tests. The script ensures the database is warmed up before running the tests.

I did 2 test runs for each setup: 2 runs of Postgres heap and 2 runs of Postgres with OrioleDB extension. The results are as shown below

Runs labelled out/res2 and out/res3 show results for running normal Postgres with heap based storage engine while res4 and res5 show results for Postgres with OrioleDB.

The impressive bit, OrioleDB is able to outperform Postgres heap while using less CPU and memory resources. This can be attributed partly to OrioleDB's lock-free page reads thus reduced contention and index-organized tables.

Memory usage(MB) for Postgres heap

CPU usage for Postgres heap

Memory usage(MB) for OrioleDB

CPU usage for OrioleDB

Overwriting data

Brian Misachi — Thu, 14 Nov 2024 09:56:32 +0000

Ensuring durability when performing writes is not as trivial as it may seem. If not handled well, it might lead to data loss. There is another thing that needs to be taken of when performing writes: page alignment.

On most linux based operating systems, data is operated on in units of pages. The typical size of a single page is 4kilobytes(4096bytes), although, the physical storage device(e.g the SSD) might be operating on different sized physical blocks. All requests(read/write) are handled in page sizes(4kb) by the kernel. For example, a request to read 10bytes will eventually result in a read of 4096bytes. The excess 4086bytes is wasted if not needed.

With this understanding, I set out to test the effects of writing data that is not aligned to page size(less than 4kb). Turns out overwriting data with unaligned writes(small) can be very slow. It's important to mention, the tests were done on a limited machine -- 5gb memory, 80gb disk, 8cores Ubuntu jammy vm. The system was installed with EXT4 filesystem on kernel version 5.15.0-124-generic.

There was no major differences when writing new data to a file for both aligned and unaligned writes(~2.1secs). Overwriting existing data in a file gave more interesting results. I used the code here for the tests.

There is a 5 second sleep time added in both tests. This I added to get enough profile time as the tests ended before the profiler could start. With this, 5 seconds can be shaved off the real time for both tests. Meaning 6seconds real time for aligned writes and 2minutes 27seconds real time for unaligned writes. I'm sure there is a better way to do it.

Aligned writes results(run with #define IS_ALIGNED_PAGE 1)

$ time make stalls

gcc -g main.c -o ~/.tmp/file_io/stalls.o
~/.tmp/file_io/stalls.o
Time Taken: 6.195574 seconds

real    0m10.954s
user    0m0.315s
sys     0m6.003s

Looking at the result for aligned writes, it takes 6.3s in total to complete. Of this, 0.3s is used in user land with 6s used by the system in kernel space(syscalls). In order to make sense of the results, I profiled and collected the stacks traces for analysis. The call stacks can be visualized in the flamegraphs as shown below

Of the 7974536us duration about 29% of it is used for servicing fsync calls. To the far right, the thin tall tower, that is the visualized stacks for the write calls. We can see the time is mostly dominated by fsync calls. Ignore the futex section which is mostly time consumed during sleep as explained above.

Pages are being written to disk with the ext4_writepages function -- the custom filesystem function for writing dirty pages back to disk. Since we are writing in page size(4kb) units, there is no need to read the pages from disk before updating them, then writing them back to disk again. The section for issuing page writes is highlighted below

Unaligned writes results(run with #define IS_ALIGNED_PAGE 0)

$ time make stalls

gcc -g main.c -o ~/.tmp/file_io/stalls.o
~/.tmp/file_io/stalls.o
Time Taken: 33.140095 seconds

real    2m32.503s
user    0m1.326s
sys     0m31.955s

About 1minute 50seconds cannot be accounted for. This is probably time spent blocking while performing read I/O. Let's see if profiling data supports the statement.

For unaligned writes the stacks are dominated by page reads. This happens since we are issuing writes in sizes smaller than the page size. The filesystem checks this and has to read the full page from disk and update it in memory before writing it back to disk leading to write stalls. This can be seen from the function __wait_on_buffer. Writes wait on page blocks being read causing increased latency.

To the far left, the thin tall tower, that is the stack used for the fsync call. Fsync calls will flush dirty pages to disk.
The wait'ing on reads can be clearly seen as shown

The filesystem code attempts to write to the page but checks whether it is writing a full page or just a portion of the page. If not a full page a page read request is issued. Depending on page size and disk block size, this can be a single request(pagesize == blocksize) or multiple requests(pagesize > blocksize). The code is shown below for the ext4_block_write_begin function. Parts of the function code have been removed to highlight important bits.

static int ext4_block_write_begin(struct page *page, loff_t pos, unsigned len,
                  get_block_t *get_block)
{
    /* Code removed here for clarity */

    for (bh = head, block_start = 0; bh != head || !block_start;
        block++, block_start = block_end, bh = bh->b_this_page) {

        /* Code removed here for clarity */

        if (!buffer_mapped(bh)) {
            WARN_ON(bh->b_size != blocksize);
            err = get_block(inode, block, bh, 1);
            if (err)
                break;
            if (buffer_new(bh)) {
                if (PageUptodate(page)) {
                    clear_buffer_new(bh);
                    set_buffer_uptodate(bh);
                    mark_buffer_dirty(bh);
                    continue;
                }
                if (block_end > to || block_start < from)
                    zero_user_segments(page, to, block_end,
                               block_start, from);
                continue;
            }
        }
        if (PageUptodate(page)) {
            set_buffer_uptodate(bh);
            continue;
        }
        if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
            !buffer_unwritten(bh) &&
            (block_start < from || block_end > to)) {
            ext4_read_bh_lock(bh, 0, false);
            wait[nr_wait++] = bh;
        }
    }
    /*
     * If we issued read requests, let them complete.
     */
    for (i = 0; i < nr_wait; i++) {
        wait_on_buffer(wait[i]);
        if (!buffer_uptodate(wait[i]))
            err = -EIO;
    }

    /* Code removed here for clarity */

    return err;
}

The check for size is done at this section

if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
            !buffer_unwritten(bh) &&
            (block_start < from || block_end > to)) 
{
    ext4_read_bh_lock(bh, 0, false);
    wait[nr_wait++] = bh;
}

If it is smaller than required size, a disk read request is initiated in the function call ext4_read_bh_lock(bh, 0, false)
Later on, the read request is waited on if it hasn't completed in wait_on_buffer. All pending read requests for disk blocks for the particular page are waited on before processing can continue. See code snippet below

for (i = 0; i < nr_wait; i++) {
    wait_on_buffer(wait[i]);
    if (!buffer_uptodate(wait[i]))
        err = -EIO;
}

The code is from kernel v5.15.121.

The key takeaway is: Overwriting data can be inefficient and take longer if it isn't done in chunks that match the size of a system's memory pages (typically 4kb). Working in smaller or larger units than the page size can result in more processing overhead, slowing down system performance.

I've also developed a basic filesystem that can help you get familiar with some filesystem concepts.

Durable writes

Brian Misachi — Wed, 16 Oct 2024 18:08:24 +0000

When writing data to a file, one of the things you have to deal with is ensuring the data is reliably persisted to a non-volatile storage device. Storage devices like magnetic disks(HDD), SSDs, persistent memory(NV RAM) etc. are used to offer some form of durable storage, such that once data is stored in the device it is "guaranteed" to exist even after an outage(e.g. power loss). The storage device can be local to your system or on a network to be accessed remotely.

When updating a file using the write system call, an easy mistake to make is to assume the data is immediately "saved" to permanent storage. Is this assumption correct? Well...it depends :). I will attempt to answer the question by describing what happens when a user attempts to write to a file.

Abstractions everywhere...

Multiple caching layers exist between the user applications and the storage device. Caching can start from the language library like glibc all the way to the write back cache in the storage device. This is mostly for performance reasons. These caches temporarily store data before sending it to the storage devices.
Performing I/O is expensive and can easily take up to hundreds of milliseconds or even seconds(in some slower disks). This can degrade performance for user applications, as the CPU has to wait for data transfer to and from disk storage. This leads to wasted CPU cycles as the CPU will sit idle waiting for I/O to complete. Caching data is an optimization used to keep data, in memory, closer to the CPU thus reducing latency and ensuring the CPU is kept busy doing meaningful work.

There are various levels of caching that the system provides as shown in the figure below

As depicted in the diagram above, caching enables fast reads(reduced I/O to retrieve data from slow storage) and also improves write performance since the user does not need to wait for actual writes to the storage device to complete. Writes to disk will be taken care of by the kernel in the background threads(bdflush et al.). Something interesting is that even writing the data to the storage device might not guarantee durability. How? Some storage devices have an internal writeback cache where data is stored for a short period before being flushed to disk. The heuristic used to flush the data might differ for different disk vendors and is opaque to the kernel: meaning even if the kernel issues a sync command to the device it can't guarantee that the data is stored on permanent storage when the write returns.

With this, a scenario can occur where a user issues a write and the system experiences a power loss before the data reaches the storage device. At this point, the data could still be in one of the cache layers and there is a possibility for data loss. This means that after the system is recovered, some of the data the user thought was "stored to durable storage" might not be available.

What now?

This does not mean there is nothing we can do to reduce the chances of data loss. The scenario above mostly describes buffered writes. One possible way to reduce the possibility of data loss is using unbuffered writes, bypassing the page cache and interfacing "directly" with the underlying storage device. I use "directly" loosely here since this is not usually the case as you still need to go through the filesystem when opening/writing to a file. Unbuffered writes can be done by opening the file with O_SYNC and O_DIRECT flags.
The type of filesystem being used also matters -- ext4 filesystem behavior may be different from zfs, for example when handling the flags above and general handling of writes(O_DIRECT flag is not yet supported by some filesystems).

Calling fsync or fdatasysnc ensures data is flushed at least to the storage device. The difference is fsync flushes both data and metadata to the underlying storage device. The data being what the user wants to write to the file while the metadata is the details about the file such as inode data which can be file modification time, file size etc. Because fdatasync only flushes the file data, it performs better than fsync with the caveat that it might cause failures during recovery due to missing or incorrect metadata.

However the calls to fsync and fdatasync, synchronous system call, should be used sparingly for performance reasons since the application has to wait for the data to be flushed out to disk. One way to optimize this is to batch writes and sync at intervals, therefore combining multiple write/sync calls into a single syscall.

So, back to the question above: Is it correct to assume data is written to the storage device after a write call? The answer still remains...it depends.

Debugging Postgres

Brian Misachi — Tue, 08 Oct 2024 08:43:35 +0000

Debugging Postgres can be daunting at first, due to the the size and complexity of the project. Luckily, there are tools that can help make the process bearable. One such tool is the debugger: A debugger or debugging tool is a computer program used to test and debug other programs (the "target" program) [Wikipedia].

This post is about using a debugger to achieve two things:

Exploring the Postgres codebase
Tracking bugs in Postgres

First, you'll need to setup Postgres as described here.

Exploring the Postgres codebase

When working on a large project such as Postgres, you might want to follow particular code paths in order to understand how some functionality is implemented from start to end. For example, I want to follow the function calls for insertions in the database -- Which parts of the code are called from the point of issuing am INSERT query to when I get back the query results.
Debuggers come in handy for this particular use case as compared to looking through the codebase in your editor on your own, trying to figure out which function called another.

I will be using the GNU Debugger(GDB) for this post. You can use any debugger you are most comfortable with.

Let's get started by creating a dummy database and table to insert data into

postgres=# -- Creating test database
postgres=# CREATE DATABASE testdb;
CREATE DATABASE
postgres=# -- Connect to testdb
postgres=# \c testdb
You are now connected to database "testdb" as user "postgres".
testdb=# -- Create test table for inserting to
testdb=# CREATE TABLE test_tbl(id SERIAL PRIMARY KEY, name TEXT);
CREATE TABLE
testdb=#

Next, we need to get the process id of our client process with this command

ps -faux | grep postgres

This should display something that looks like the following

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres      11  0.0  0.3 161752 18028 ?        Ss   06:50   0:00 /usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data
postgres     359  0.0  0.4 162000 20628 ?        Ss   11:32   0:00  \_ postgres: checkpointer
postgres     360  0.0  0.0 161904  4368 ?        Ss   11:32   0:00  \_ postgres: background writer
postgres     361  0.0  0.1 161752  7600 ?        Ss   11:32   0:00  \_ postgres: walwriter
postgres     362  0.0  0.1 163384  6896 ?        Ss   11:32   0:00  \_ postgres: autovacuum launcher
postgres     363  0.0  0.1 163332  5360 ?        Ss   11:32   0:00  \_ postgres: logical replication launcher
postgres     373  0.0  0.3 164484 17728 ?        Ss   11:49   0:00  \_ postgres: postgres testdb [local] idle

From the above, the Postgres process we are interested in is the last one postgres testdb [local] idle with the Process ID(PID) being 373.
The other processes shown are simply Postgres backends tasked with performing various background functions e.g checkpointing data, writing WAL data from buffers to disk, vacuuming bloated tables etc -- We are not interested in these for this post.

I can now start to trace an insert statement. I will have 2 terminals open. One for running the client(psql) and another for hooking the debugger to the client process.
In the terminal for running the debugger, I will run the following command gdb <command>

gdb /usr/local/pgsql/bin/postgres

This should launch gdb using the Postgres executable. This will open up the gdb prompt on to the screen.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/pgsql/bin/postgres...
(gdb)

Attach to the PID we got earlier -- 373.

(gdb) attach 373

In the terminal running the client(psql), I will run the INSERT command:

testdb=# -- Inserting an apple :)
testdb=# INSERT INTO test_tbl(name) VALUES ('apples');

The client will then freeze as the debugger takes control.

I will switch to the debugger terminal and begin stepping through the code. First I need to determine which function to break/stop at. For this example I will use heap_insert function as the breaking point since I am doing an INSERT: so I know for sure I will need to insert data into the heap(Think of heap as disk space where table data is stored in Postgres.)

(gdb) b heap_insert
Breakpoint 1 at 0x56040e35e7b1: file heapam.c, line 1819.
(gdb) c
Continuing.

Breakpoint 1, heap_insert (relation=0x7f1a6e5ca360, tup=0x560410b68e10, cid=0, options=0, bistate=0x0) at heapam.c:1819
1819    {
(gdb) bt
#0  heap_insert (relation=0x7f1a6e5ca360, tup=0x560410b68e10, cid=0, options=0, bistate=0x0) at heapam.c:1819
#1  0x000056040e36d1b3 in heapam_tuple_insert (relation=0x7f1a6e5ca360, slot=0x560410b68d08, cid=0, options=0, bistate=0x0) at heapam_handler.c:252
#2  0x000056040e5feb8c in table_tuple_insert (rel=0x7f1a6e5ca360, slot=0x560410b68d08, cid=0, options=0, bistate=0x0) at ../../../src/include/access/tableam.h:1400
#3  0x000056040e60090d in ExecInsert (context=0x7ffd763931e0, resultRelInfo=0x560410b680b8, slot=0x560410b68d08, canSetTag=true, inserted_tuple=0x0, insert_destrel=0x0) at nodeModifyTable.c:1133
#4  0x000056040e604acd in ExecModifyTable (pstate=0x560410b67eb0) at nodeModifyTable.c:3790
#5  0x000056040e5c8a49 in ExecProcNodeFirst (node=0x560410b67eb0) at execProcnode.c:464
#6  0x000056040e5bcd58 in ExecProcNode (node=0x560410b67eb0) at ../../../src/include/executor/executor.h:273
#7  0x000056040e5bf68f in ExecutePlan (estate=0x560410b67c68, planstate=0x560410b67eb0, use_parallel_mode=false, operation=CMD_INSERT, sendTuples=false, numberTuples=0, direction=ForwardScanDirection,
    dest=0x560410b64ee8, execute_once=true) at execMain.c:1670
#8  0x000056040e5bd2ec in standard_ExecutorRun (queryDesc=0x560410b5a038, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:365
#9  0x000056040e5bd173 in ExecutorRun (queryDesc=0x560410b5a038, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:309
#10 0x000056040e844858 in ProcessQuery (plan=0x560410b64d98, sourceText=0x560410a67bb8 "INSERT INTO test_tbl(name) VALUES ('apples');", params=0x0, queryEnv=0x0, dest=0x560410b64ee8, qc=0x7ffd76393640)
    at pquery.c:160
#11 0x000056040e8461f5 in PortalRunMulti (portal=0x560410ae4fe8, isTopLevel=true, setHoldSnapshot=false, dest=0x560410b64ee8, altdest=0x560410b64ee8, qc=0x7ffd76393640) at pquery.c:1277
#12 0x000056040e84578b in PortalRun (portal=0x560410ae4fe8, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x560410b64ee8, altdest=0x560410b64ee8, qc=0x7ffd76393640) at pquery.c:791
#13 0x000056040e83eaa8 in exec_simple_query (query_string=0x560410a67bb8 "INSERT INTO test_tbl(name) VALUES ('apples');") at postgres.c:1272
#14 0x000056040e8437d0 in PostgresMain (dbname=0x560410a9ee80 "testdb", username=0x560410a9ee68 "postgres") at postgres.c:4652
#15 0x000056040e77bf0b in BackendRun (port=0x560410a911e0) at postmaster.c:4439
#16 0x000056040e77b797 in BackendStartup (port=0x560410a911e0) at postmaster.c:4167
#17 0x000056040e777eb3 in ServerLoop () at postmaster.c:1781
#18 0x000056040e777767 in PostmasterMain (argc=3, argv=0x560410a635b0) at postmaster.c:1465
#19 0x000056040e63bb69 in main (argc=3, argv=0x560410a635b0) at main.c:198
(gdb)

Set a breakpoint with b heap_insert. Next, allow execution to continue until our breakpoint is triggered, with c. Finally, we want to know how execution got to this point(the break-point) with bt for backtrace.

Execution starts from the main function through the executor functions(Exec*) to the table and heap functions. The source files and line numbers for all call sites are also shown.
As you can see, this is a powerful tool to help find your way through complex codebases very quickly.

Tracking bugs in Postgres

When a client process(e.g psql) encounters a problem it may crash or issue a warning, even though this depends on the severity of the problem.
I will try to investigate a Segmentation fault(SIGSEGV) error.

I will be using the PG_WAL_EXT extension for this example. In the pg_xlog_records function, I will comment out the allocation for the reader object defined in XLogReaderState *xlog_reader. This will throw a segmentation error when the object(xlog_reader) is touched since there will be no memory allocated to it

// xlog_reader = XLogReaderAllocate(wal_segment_size, directory,
    //                                  XL_ROUTINE(.page_read = &read_xlog_page,
    //                                             .segment_open = wal_segment_open,
    //                                             .segment_close = wal_segment_close),
    //                                  &private);

    // if (xlog_reader == NULL)
    //     ereport(ERROR,
    //             (errcode(ERRCODE_OUT_OF_MEMORY),
    //              errmsg("out of memory"),
    //              errdetail("Failed while allocating a WAL reading processor.")));

This should trigger a segfault when the query is ran. The client connection will be restored and a crash message shown as below:

postgres=# select * from pg_xlog_records('/usr/local/pgsql/data/pg_wal/000000010000000000000001');
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
postgres=#

To debug this I will open up the debugger again with the command format gdb <command> -c <core-file> as below

gdb /usr/local/pgsql/bin/postgres -c /tmp/core.postgres.1728338434.415

/usr/local/pgsql/bin/postgres is the path to the Postgres executable and /tmp/core.postgres.1728338434.415 being the path to the core dumped file.

The output from the command above looks something like this

The first frame(highlighted) shows the file(pg_wal.c) and exact line(line 194) with the error.

This is an trivial example meant to show how to go about debugging Postgres. Other issues might not be as straight forward to debug and may require looking at multiple frames and following deep call stacks to determine the root cause of an issue with the Postgres server.