<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Brian Misachi</title>
    <description>The latest articles on DEV Community by Brian Misachi (@misachi).</description>
    <link>https://dev.to/misachi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2136791%2Fec1826a0-1366-4857-a2ff-9cacdeb0bcf2.jpg</url>
      <title>DEV Community: Brian Misachi</title>
      <link>https://dev.to/misachi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/misachi"/>
    <language>en</language>
    <item>
      <title>From Postgres to Iceberg</title>
      <dc:creator>Brian Misachi</dc:creator>
      <pubDate>Wed, 05 Nov 2025 17:04:33 +0000</pubDate>
      <link>https://dev.to/misachi/from-postgres-to-iceberg-g4</link>
      <guid>https://dev.to/misachi/from-postgres-to-iceberg-g4</guid>
      <description>&lt;p&gt;Modern applications generate a lot of data mostly stored in ACID compliant OLTP storage systems like Postgres, Mysql, SQLite etc. Once collected, this data needs to be queried and processed for analytics or any other  purposes to get meaningful insights for business decision-making. The data to be queried could be stored in different databases and in order to access all the data you would need to run queries on multiple different databases. This is less efficient and adds extra complexity to the system.&lt;/p&gt;

&lt;p&gt;One simple solution would be to use a central OLTP database, for collating all the data and handling all your analytical queries from it. It can be another Postgres instance acting as a data warehouse. This solution works but has a few issues such as hard requirements on the shape the data(structured), scaling limits etc. &lt;/p&gt;

&lt;p&gt;Data lakes offer a different solution. It allows provides storage for all types of images, files(in different formats e.g CSV, Json, Apache Parquet, Avro, pdfs etc), videos etc. The data can be stored in its raw form and can be accessed later to be processed further(as in Extract Load Transform) for your analytics or machine learning purposes. Data lakes are usually built on object storages like S3, Google Cloud Storage, Azure Blob Storage, Minio etc. The standout advantage of a data lake is the promise of limitless storage and ability to store virtually anything. Once data is in the data lake and the necessary transformations have been applied to it, it needs to be exposed to various stakeholders(machine learning engineers, analysts, "AI people", execs etc) to derive business value from it. &lt;/p&gt;

&lt;p&gt;Data warehouses built on "traditional" OLTP databases, like Postgres, that already organize data in tables and come with in-built SQL query processing engines make this part easier. For a data lake where the data to be queried is scattered across multiple files, querying the data gets complicated. The data lake for the most part only gets you storage. You have to bring your own compute layer where you query and process the data. Luckily, there already exists a number of solutions. &lt;/p&gt;

&lt;p&gt;A common solution is using open table formats like &lt;a href="https://iceberg.apache.org" rel="noopener noreferrer"&gt;Apache Iceberg&lt;/a&gt;(others are &lt;a href="https://delta.io/" rel="noopener noreferrer"&gt;Delta lake&lt;/a&gt; and &lt;a href="https://hudi.apache.org/" rel="noopener noreferrer"&gt;Apache Hudi&lt;/a&gt;). With these tools you get the benefits of traditional database functionality on your data lake i.e ACID guarantees, transactions. The &lt;a href="https://iceberg.apache.org/spec/" rel="noopener noreferrer"&gt;Iceberg specification&lt;/a&gt; defines an open table format that enables accessing related data stored in separate files in a distributed storage system, as one table. &lt;/p&gt;

&lt;h3&gt;
  
  
  Iceberg Design
&lt;/h3&gt;

&lt;p&gt;The design for Iceberg is as shown below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fad4sr9uoqi4rhdpj7ino.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fad4sr9uoqi4rhdpj7ino.png" alt="Iceberg Specification" width="800" height="826"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The design is pretty simple. The catalog holds metadata about the tables being managed by Iceberg. A snapshot holds the state of the table at any point in time. Any change to the table results in a new snapshot being generated. The snapshot stores the manifest list, which is a list of manifest files. Each data file is tracked by a manifest file.&lt;/p&gt;

&lt;p&gt;Iceberg is able to efficiently manage large amounts of data stored in the data lake. The data layer supports storing data in open formats like &lt;a href="https://parquet.apache.org/" rel="noopener noreferrer"&gt;Apache parquet&lt;/a&gt; or &lt;a href="https://avro.apache.org/" rel="noopener noreferrer"&gt;Avro&lt;/a&gt;. Apache Parquet is an open columnar data format for efficient data storage and retrieval. With this, you automatically get the benefits of column storage for your analytical workloads. Engines like Apache Spark, Apache Flink, Presto, Trino etc can be used for the compute layer for data querying and processing.&lt;/p&gt;

&lt;p&gt;The are two methods for getting data into the data lake: batching data or streaming data directly from the source. Batching is less complicated and can be scheduled nightly. Streaming on the other hand is more involving and requires more effort to get right. A common setup is usually &lt;a href="https://debezium.io/documentation/reference/3.3/connectors/postgresql.html" rel="noopener noreferrer"&gt;Debezium&lt;/a&gt; for change-data-capture(CDC) with an event streaming platform like &lt;a href="https://kafka.apache.org/" rel="noopener noreferrer"&gt;Apache kafka&lt;/a&gt; or &lt;a href="https://www.redpanda.com/" rel="noopener noreferrer"&gt;Redpanda&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;In the remainder of this post, we'll go through setting up a simple pipeline to stream data from a Postgres server to an Iceberg table. The setup will use Postgres(we love Postgres here) as an SQL catalog, &lt;a href="https://github.com/minio/minio" rel="noopener noreferrer"&gt;Minio&lt;/a&gt; which is an open source S3 compatible object storage, Apache parquet format for the data files and Trino as the processing framework(compute layer).&lt;/p&gt;

&lt;h3&gt;
  
  
  Change Data Capture Tool
&lt;/h3&gt;

&lt;p&gt;We'll also use a simple &lt;a href="https://github.com/misachi/pg_rusted_wire" rel="noopener noreferrer"&gt;CDC tool&lt;/a&gt; I have been working on. It implements the Postgres wire protocol and utilizes logical replication to capture DML changes and stream it to the data lake. It is written in mostly Rust but the part for uploading data to the data lake is done in &lt;a href="https://github.com/misachi/pg_rusted_wire/blob/main/py_iceberg.py" rel="noopener noreferrer"&gt;Python&lt;/a&gt;. It makes it easier to do any additional processing of the data in Python before uploading to the data lake(ETL) or as in ELT, upload the data as is to the data lake then perform the transformation part later. The Python data ecosystem is rich and provides a lot of excellent tools to work with data.&lt;/p&gt;

&lt;p&gt;Streaming data can result in an explosion of very small files being created in the data lake. This can cause a lot of network chatter in transferring every record being captured by the CDC tool and slow down data processing. It helps to buffer the data locally and only send it over the network in larger batches. This can potentially speed up the pipeline( as data is first saved on local disk) and also help reduce on network costs. &lt;a href="https://github.com/misachi/pg_rusted_wire" rel="noopener noreferrer"&gt;pg_rusted_wire&lt;/a&gt; uses segments, which are fixed-sized file chunks to accumulate data received from Postgres. Data in the active segment is later uploaded to the data lake once the segment reaches the configured size or when the configured upload time interval is reached. Bothe the interval and segment size can be adjusted with the constants &lt;code&gt;MAX_WRITE_INTERVAL&lt;/code&gt; and &lt;code&gt;MAX_SEGMENT_SIZE&lt;/code&gt; respectively, in &lt;code&gt;wire.rs&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setting up the Data Lake
&lt;/h3&gt;

&lt;p&gt;To begin, clone the repository &lt;a href="https://github.com/misachi/iceberg-trino-example" rel="noopener noreferrer"&gt;here&lt;/a&gt; and execute the &lt;code&gt;run.sh&lt;/code&gt; bash script. Once the script has completed you should be able to see three services running with a &lt;code&gt;healthy&lt;/code&gt; status -- Trino takes a few seconds(~20s) to startup and transition to a healthy state. Navigate to &lt;code&gt;http://localhost:9001&lt;/code&gt; on your browser to browse the Minio object storage UI. Use the username &lt;code&gt;minio_user&lt;/code&gt; and password &lt;code&gt;pass1234&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwtu586woooz0k285dg92.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwtu586woooz0k285dg92.png" alt="Minio UI" width="800" height="413"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The default bucket &lt;code&gt;warehouse&lt;/code&gt; should be empty. A new catalog called &lt;code&gt;example_cat&lt;/code&gt; has also been created. The catalog should be empty since we have not added any tables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -it iceberg-trino-example-metastore-1 psql -U repl_user -d iceberg
psql (18.0)
Type "help" for help.

iceberg=# table iceberg_tables; -- Empty
 catalog_name | table_namespace | table_name | metadata_location | previous_metadata_location
--------------+-----------------+------------+-------------------+----------------------------
(0 rows)

iceberg=# table iceberg_namespace_properties; -- Empty
 catalog_name | namespace | property_key | property_value
--------------+-----------+--------------+----------------
(0 rows)

iceberg=#
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can now create a schema on Iceberg to hold our table. We also create a new table called &lt;code&gt;employee&lt;/code&gt; to store employee details. The &lt;a href="https://trino.io/" rel="noopener noreferrer"&gt;Trino&lt;/a&gt; processing engine will be used to perform these actions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -it iceberg-trino-example-trino_connector-1 trino
trino&amp;gt; CREATE SCHEMA example_cat.example_schema WITH (location = 's3://warehouse/test'); 
CREATE SCHEMA
trino&amp;gt; CREATE TABLE example_cat.example_schema.employee
     (
       id INTEGER,
       name VARCHAR,
       salary DECIMAL(10,2)
     )
     WITH (
       format = 'PARQUET'
     );
CREATE TABLE
trino&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we have a schema &lt;code&gt;example_schema&lt;/code&gt; and table &lt;code&gt;employee&lt;/code&gt; and a catalog named &lt;code&gt;example_cat&lt;/code&gt;. Accessing our table requires the full path in the &lt;code&gt;&amp;lt;catalog_name&amp;gt;.&amp;lt;schame_name&amp;gt;.&amp;lt;table_name&amp;gt;&lt;/code&gt; format. In out case, the table name is &lt;code&gt;example_cat.example_schema.employee&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The bucket created earlier should now have some data in it. These are metadata describing the state of the table that has been created. One is a metadata file and the other is a snapshot file(prefixed with "snap"). We don't have a manifest file yet since we don't have any data yet.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv1xk0ijh0ig8px3scsk7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv1xk0ijh0ig8px3scsk7.png" alt="Iceberg-Metadata" width="800" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's add some data to the table&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -it iceberg-trino-example-trino_connector-1 trino
trino&amp;gt; INSERT INTO example_cat.example_schema.employee (id, name, salary) VALUES (1, 'Sam Evans', 55000);
    INSERT INTO example_cat.example_schema.employee (id, name, salary) VALUES (2, 'James Bond', 12000);
INSERT: 1 row

Query 20251031_121000_00003_dw2kt, FINISHED, 1 node
Splits: 23 total, 23 done (100.00%)
2.66 [0 rows, 0B] [0 rows/s, 0B/s]

INSERT: 1 row

Query 20251031_121003_00004_dw2kt, FINISHED, 1 node
Splits: 23 total, 23 done (100.00%)
0.78 [0 rows, 0B] [0 rows/s, 0B/s]

trino&amp;gt; SELECT * FROM example_cat.example_schema.employee; -- Query the data
 id |    name    |  salary
----+------------+----------
  1 | Sam Evans  | 55000.00
  2 | James Bond | 12000.00
(2 rows)

Query 20251031_121103_00005_dw2kt, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0.57 [2 rows, 1KB] [3 rows/s, 1.77KB/s]

trino&amp;gt; exit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;More metadata files are generated. We can browse it in the object store UI&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F87c154gmt55t24ckues7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F87c154gmt55t24ckues7.png" alt="More metadata" width="800" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A new directory named &lt;code&gt;data&lt;/code&gt; has also been created that stores the data files&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn25kv3hdbsnaly7t3c9b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn25kv3hdbsnaly7t3c9b.png" alt="Data directory" width="800" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are two parquet data files. We inserted two records in separate transactions. Each transaction created it's own data file and metadata files:  When the state of a table changes(delete, insert, update) a new snapshot is created to track the new table state. New metadata file(stores location of the snapshot) and manifest file(to track the new data file) are also created.&lt;/p&gt;

&lt;p&gt;We can drop the table to clear all the files associated with the table including the metadata files and data files&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -it iceberg-trino-example-trino_connector-1 trino
trino&amp;gt; DROP TABLE example_cat.example_schema.employee;
DROP TABLE
trino&amp;gt; exit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Streaming Data to Iceberg
&lt;/h3&gt;

&lt;p&gt;We can now setup a simple pipeline with &lt;a href="https://github.com/misachi/pg_rusted_wire" rel="noopener noreferrer"&gt;pg_rusted_wire&lt;/a&gt; to stream data from Postgres to Iceberg. In this example, we will use the same Postgres instance both as the catalog store and as the data source. It will use logical streaming replication to capture table changes from Postgres.&lt;/p&gt;

&lt;p&gt;First, prepare the data source for logical streaming.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -it iceberg-trino-example-metastore-1 psql -U repl_user -d iceberg -c "ALTER SYSTEM SET wal_level = 'logical'"
ALTER SYSTEM

$ docker restart iceberg-trino-example-metastore-1
iceberg-trino-example-metastore-1 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then create the &lt;code&gt;employee&lt;/code&gt; table from which we will get data. We also create a publication on the table which the CDC tool will subscribe, to capture changes made to the table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -it iceberg-trino-example-metastore-1 psql -U repl_user -d postgres
postgres=# CREATE TABLE employee
( id serial primary key,
  name varchar,
  salary decimal(10,2)
);
CREATE TABLE
postgres=# INSERT INTO employee (name, salary) select 'Mkamze Mwatela' || i, i*200 from generate_series(1, 100000) i;
INSERT 0 100000
postgres=# CREATE PUBLICATION pub1 FOR TABLE employee;
CREATE PUBLICATION
postgres=# exit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;employee&lt;/code&gt; table has been created and has 100k records.&lt;/p&gt;

&lt;p&gt;Let's also recreate the &lt;code&gt;employee&lt;/code&gt; table on Iceberg. The data streamed from Postgres will be stored in this table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -it iceberg-trino-example-trino_connector-1 trino
trino&amp;gt; CREATE TABLE example_cat.example_schema.employee
     (
       id INTEGER,
       name VARCHAR,
       salary DECIMAL(10,2)
     )
     WITH (
       format = 'PARQUET'
     );
CREATE TABLE
trino&amp;gt; exit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next create a file called &lt;code&gt;config&lt;/code&gt; in a directory of your choosing e.g &lt;code&gt;/home/your_user/.tmp/config&lt;/code&gt;. The file contains credentials for connecting to the object store and the catalog store. Copy the details below into the file. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Use the command here &lt;code&gt;ifconfig docker0 | grep -w "inet" | awk '{print $2}'&lt;/code&gt; to get the host docker ip address. Replace the &lt;code&gt;host_ip&lt;/code&gt; below with the output from the command.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;S3_SECRET_KEY='pass1234'
S3_ENDPOINT='http://host_ip:9000'
S3_ACCESS_KEY='minio_user'
CATALOG_URI='postgresql+psycopg2://repl_user:pass1234@host_ip:5432/iceberg'
TABLE_NAME='example_cat.example_schema.employee'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open a new terminal window and clone the repo at &lt;a href="https://github.com/misachi/pg_rusted_wire" rel="noopener noreferrer"&gt;https://github.com/misachi/pg_rusted_wire&lt;/a&gt;. Run the example CDC tool as shown below&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The &lt;code&gt;/home/your_user/.tmp/&lt;/code&gt; should be the directory that has the &lt;code&gt;config&lt;/code&gt; file described above.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ git clone https://github.com/misachi/pg_rusted_wire.git &amp;amp;&amp;amp; cd pg_rusted_wire

$ DOCKER_IP=`ifconfig docker0 | grep -w "inet" | awk '{print $2}'` &amp;amp;&amp;amp; cargo run --example lrepl -- -u repl_user -P pass1234 -H $DOCKER_IP -d postgres -p 5432 --table employee --publication pub1 --config-dir /home/your_user/.tmp/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If there are no errors, then after a few seconds(depending of configured &lt;code&gt;MAX_WRITE_INTERVAL&lt;/code&gt; setting) the initial data in the &lt;code&gt;employee&lt;/code&gt; table should have been copied to the &lt;code&gt;employee&lt;/code&gt; Iceberg table. We can check using Trino with the SQL command below&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -it iceberg-trino-example-trino_connector-1 trino
trino&amp;gt; SELECT COUNT(*) FROM example_cat.example_schema.employee;
 _col0
--------
 100000
(1 row)

Query 20251102_201645_00003_ack9h, FINISHED, 1 node
Splits: 10 total, 10 done (100.00%)
0.41 [100K rows, 0B] [242K rows/s, 0B/s]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All the 100k records in the Postgres table have been copied to Iceberg.&lt;/p&gt;

&lt;h4&gt;
  
  
  Inserts
&lt;/h4&gt;

&lt;p&gt;We can then insert additional data to our Postgres &lt;code&gt;employee&lt;/code&gt; table and it will be streamed to Iceberg&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -it iceberg-trino-example-metastore-1 psql -U repl_user -d postgres -c "INSERT INTO employee (name, salary) select 'Manjaz' || i, i*200 from generate_series(1, 5) i"
INSERT 0 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five more records have been inserted in the table. The records count on Iceberg table should now be 100,005.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -it iceberg-trino-example-trino_connector-1 trino
trino&amp;gt; SELECT COUNT(*) FROM example_cat.example_schema.employee;
 _col0
--------
 100005
(1 row)

Query 20251102_202608_00004_ack9h, FINISHED, 1 node
Splits: 11 total, 11 done (100.00%)
0.40 [100K rows, 0B] [253K rows/s, 0B/s]

trino&amp;gt; exit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The object storage web UI shows something similar&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm5gkng4x8qyb38c5o6h5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm5gkng4x8qyb38c5o6h5.png" alt="100k records" width="800" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are 2 data files generated. One is from the initial copy of 100k records with size 1.2MBs. The second data file is for the 5 records inserted later with size 1.3KB. &lt;/p&gt;

&lt;h4&gt;
  
  
  Updates
&lt;/h4&gt;

&lt;p&gt;When a record is updated in the Postgres table, the change is captured and streamed to Iceberg.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -it iceberg-trino-example-metastore-1 psql -U repl_user -d postgres
psql (18.0)
Type "help" for help.

postgres=# SELECT * FROM employee WHERE name like '%Manjaz%'; -- Initial salary records for the 5 employees
   id   |  name   | salary
--------+---------+---------
 100001 | Manjaz1 |  200.00
 100002 | Manjaz2 |  400.00
 100003 | Manjaz3 |  600.00
 100004 | Manjaz4 |  800.00
 100005 | Manjaz5 | 1000.00
(5 rows)

postgres=# UPDATE employee SET salary = 4000 WHERE name like '%Manjaz%';
UPDATE 5
postgres=# SELECT * FROM employee WHERE name like '%Manjaz%';
   id   |  name   | salary
--------+---------+---------
 100001 | Manjaz1 | 4000.00
 100002 | Manjaz2 | 4000.00
 100003 | Manjaz3 | 4000.00
 100004 | Manjaz4 | 4000.00
 100005 | Manjaz5 | 4000.00
(5 rows)
postgres=# exit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The employees with names that start with &lt;code&gt;Manjaz&lt;/code&gt; have all been updated.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -it iceberg-trino-example-trino_connector-1 trino
trino&amp;gt; SELECT COUNT(*) FROM example_cat.example_schema.employee;
 _col0
--------
 100005
(1 row)

Query 20251102_204321_00012_ack9h, FINISHED, 1 node
Splits: 11 total, 11 done (100.00%)
0.51 [100K rows, 0B] [195K rows/s, 0B/s]

trino&amp;gt; SELECT * FROM example_cat.example_schema.employee WHERE name like '%Manjaz%';
   id   |  name   | salary
--------+---------+---------
 100001 | Manjaz1 | 4000.00
 100002 | Manjaz2 | 4000.00
 100003 | Manjaz3 | 4000.00
 100004 | Manjaz4 | 4000.00
 100005 | Manjaz5 | 4000.00
(5 rows)

Query 20251102_204403_00013_ack9h, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
1.24 [100K rows, 1.21MB] [80.4K rows/s, 996KB/s]

trino&amp;gt; exit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The updates to the &lt;code&gt;employee&lt;/code&gt; table in Postgres have been successfully streamed to Iceberg table.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcvbpi208tegazhzf8sdc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcvbpi208tegazhzf8sdc.png" alt="Updates" width="800" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Minio object storage browser now shows 3 data files. All the updated records were in the data file named &lt;code&gt;00000-0-14307206-2120-4897-a3ef-fddd94900de2.parquet&lt;/code&gt; since they were created in a single transaction. A new data file named &lt;code&gt;00000-0-5af272e2-4e18-4828-a66f-b6667e982e16.parquet&lt;/code&gt; has been created as a result of the update. UPDATEs are done in copy-on-write mode by default: changes to the data file are not done in place but the data is first copied in memory then updated and finally written to a new file.&lt;/p&gt;

&lt;h4&gt;
  
  
  Deletes
&lt;/h4&gt;

&lt;p&gt;Deleting records works similar to Updates and results in records being removed from the Iceberg table. Since deletes change the state of the table, new snapshot and hence new metadata file will be generated. A new data file is is also created with a new manifest file pointing to it.&lt;/p&gt;

&lt;p&gt;Postgres is purpose built to handle OLTP workloads and it does that really well. Its row-based data storage design might not work well for all analytical workloads - depending on specific needs and data size. OLAP data stores designed specifically for analytics come in handy when Postgres no longer meets your needs. We have shown one such solution where  you have a data lake for storing your raw data in different files and treating related files as a single table to run your queries on with Apache Iceberg. Data lake's ability to handle large, diverse datasets makes it an essential foundation for modern data-driven decision-making. When combined with open table formats like Apache Iceberg, data lakes become even more powerful offering reliable data management, faster queries, and a scalable path toward advanced analytics.  &lt;/p&gt;

</description>
      <category>database</category>
      <category>postgres</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Postgres Replication Slots</title>
      <dc:creator>Brian Misachi</dc:creator>
      <pubDate>Fri, 29 Aug 2025 13:57:00 +0000</pubDate>
      <link>https://dev.to/misachi/postgres-replication-slots-31ml</link>
      <guid>https://dev.to/misachi/postgres-replication-slots-31ml</guid>
      <description>&lt;p&gt;The &lt;a href="https://www.postgresql.org/docs/current/wal-intro.html" rel="noopener noreferrer"&gt;WAL(Redo log)&lt;/a&gt; is essential to normal database operations. In some sense "The LOG is the DATABASE" has some truth to it. The WAL stores a log of all activities performed in the database. In other words, it maintains the state of the database system. Having the WAL is useful in many ways and one of the most important role is in replication. &lt;/p&gt;

&lt;p&gt;Replication in Postgres makes significant use of the WAL when copying data changes from one database server to another. Every data change, once committed is replicated to all configured standbys from reading the WAL. It is important that the WAL is stored durably and long enough for new changes to be replicated to standbys. &lt;/p&gt;

&lt;p&gt;Postgres aims to be as resource efficient as possible. WAL files can build up quickly and if not well managed can fill up available disk space bringing the system to a halt. Postgres handles the situation by safely deleting older WAL files, during a checkpoint, to recover disk space for other uses. Once the REDO location is confirmed to be ahead of a WAL file( i.e data has been flushed up to that point) then, that WAL file and all other old files are removed or recycled as part of the database cleanup process. An issue with this process is that a WAL file can be deleted before its changes are safely replicated downstream to standbys: leading to inconsistency between the primary/publisher and a standby/subscriber. When this occurs, replication would normally stop until the issue is fixed or a new full backup is made from the primary node.&lt;/p&gt;

&lt;h2&gt;
  
  
  Delete Not WAL
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Archiving
&lt;/h3&gt;

&lt;p&gt;Postgres offers a few solutions for keeping WALs long enough for standbys to access them before they are deleted. One such solution is &lt;a href="https://www.postgresql.org/docs/current/continuous-archiving.html#CONTINUOUS-ARCHIVING" rel="noopener noreferrer"&gt;continuous archiving&lt;/a&gt;. In archiving mode, old WAL files are copied to a separate location before being recycled from the &lt;code&gt;pg_wal&lt;/code&gt; directory. The archive location can be a local disk or remote server. Standbys can still access the archive location during recovery and stay consistent with the primary server. Archiving was used in the post &lt;a href="https://dev.to/misachi/high-availability-postgres-4ahe"&gt;HA Postgres&lt;/a&gt; to store old WAL files when setting up the HA cluster.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Archiving is primarily used for backups and PITR and so can be used in addition to other solutions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Slots
&lt;/h3&gt;

&lt;p&gt;Another solution for preventing pre-mature WAL deletion is &lt;a href="https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION-SLOTS" rel="noopener noreferrer"&gt;replication slots&lt;/a&gt;. A replication slot is useful in maintaining the state of WAL being accessed by a standby or subscriber that are tracking changes from the primary. The slot prevents deletion of WAL files and removal of old versions of tuples that are still required to be replicated  downstream. The replication slot is a simple data structure represented on disk with the following fields&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// in src/include/replication/slot.h

typedef struct ReplicationSlotPersistentData
{
    NameData    name;
    Oid         database;
    ReplicationSlotPersistency persistency;
    TransactionId xmin;
    TransactionId catalog_xmin;
    XLogRecPtr  restart_lsn;
    ReplicationSlotInvalidationCause invalidated;
    XLogRecPtr  confirmed_flush;
    XLogRecPtr  two_phase_at;
    bool        two_phase;
    NameData    plugin;
    char        synced;
    bool        failover;
} ReplicationSlotPersistentData;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interesting field in the struct is the &lt;code&gt;restart_lsn&lt;/code&gt; which indicates the location at which data will be read. The &lt;code&gt;confirmed_flush&lt;/code&gt; points to a WAL location that has been flushed and acknowledged by the standby. &lt;code&gt;xmin&lt;/code&gt; maintains the transaction horizon such that tuple versions with transaction ids newer than the value of &lt;code&gt;xmin&lt;/code&gt; cannot be removed during a vacuum.&lt;/p&gt;

&lt;p&gt;Creating a replication slot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# SELECT * FROM pg_create_logical_replication_slot('test_slot', 'test_decoding', false, true);
 slot_name |    lsn
-----------+-----------
 test_slot | 0/17ABC18
(1 row)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To view all the replication slots available in the database we can query the  &lt;code&gt;pg_replication_slots&lt;/code&gt; view&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# SELECT slot_name, database, active, xmin, restart_lsn, confirmed_flush_lsn, synced FROM pg_replication_slots;
 slot_name | database | active | xmin | restart_lsn | confirmed_flush_lsn | synced
-----------+----------+--------+------+-------------+---------------------+--------
 test_slot | postgres | f      |      | 0/17ABBE0   | 0/17ABC18           | f
(1 row)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It displays the field values from the slot data structure shown above. The next WAL position to be read starts at location &lt;code&gt;0/17ABBE0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A new slot called &lt;code&gt;test_slot&lt;/code&gt; has been created to use the test logical decoding output plugin &lt;a href="https://www.postgresql.org/docs/current/test-decoding.html" rel="noopener noreferrer"&gt;test_decoding&lt;/a&gt;. The slot is a permanent one therefore it will be persisted to disk. It also configured to for decoding of prepared transactions, if used. &lt;/p&gt;

&lt;p&gt;Create a table and put some data in it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# CREATE TABLE foo(id SERIAL, k INT NOT NULL);
CREATE TABLE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a logical slot, DDL changes are not replicated. When we check for any new change from the slot, only the &lt;code&gt;BEGIN&lt;/code&gt; and &lt;code&gt;COMMIT&lt;/code&gt; statements will be output.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# SELECT * FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL);
    lsn    | xid |    data
-----------+-----+------------
 0/17B1550 | 764 | BEGIN 764
 0/17D3B50 | 764 | COMMIT 764
(2 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now if we insert some data, we'll get some meaningful decoded output&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# INSERT INTO foo VALUES (1,1);
INSERT 0 1
postgres=# SELECT * FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL);
    lsn    | xid |                         data
-----------+-----+------------------------------------------------------
 0/17B1550 | 764 | BEGIN 764
 0/17D3B50 | 764 | COMMIT 764
 0/17D6CE0 | 765 | BEGIN 765
 0/17D6CE0 | 765 | table public.foo: INSERT: id[integer]:1 k[integer]:1
 0/17D6D50 | 765 | COMMIT 765
(5 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have not actually consumed any data from the slot. &lt;code&gt;pg_logical_slot_peek_changes&lt;/code&gt; function enables us to look into the data that is yet to be replicated from the slot. It does not update the location from which the next data would be read. If we check the state of the slot, it should have the same values as before.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# SELECT slot_name, database, active, xmin, restart_lsn, confirmed_flush_lsn, synced FROM pg_replication_slots;
 slot_name | database | active | xmin | restart_lsn | confirmed_flush_lsn | synced
-----------+----------+--------+------+-------------+---------------------+--------
 test_slot | postgres | f      |      | 0/17ABBE0   | 0/17ABC18           | f
(1 row)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can read the changes from the WAL at &lt;code&gt;0/17ABBE0&lt;/code&gt; and advance the slot's &lt;code&gt;restart_lsn&lt;/code&gt; position with the &lt;code&gt;pg_logical_slot_get_changes&lt;/code&gt; function.&lt;/p&gt;

&lt;p&gt;To consume from the slot and advance the &lt;code&gt;restart_lsn&lt;/code&gt; position, the &lt;code&gt;pg_logical_slot_get_changes&lt;/code&gt; is used. The &lt;code&gt;pg_logical_slot_get_changes&lt;/code&gt; and &lt;code&gt;pg_logical_slot_peek_changes&lt;/code&gt; functions are similar. They both share the same logic in &lt;a href="https://github.com/postgres/postgres/blob/f5d07085822a144afb169d0f422e25689cb6209f/src/backend/replication/logical/logicalfuncs.c#L99C1-L100C1" rel="noopener noreferrer"&gt;pg_logical_slot_get_changes_guts&lt;/a&gt; function. The major difference between the two functions is that &lt;code&gt;pg_logical_slot_peek_changes&lt;/code&gt; does not move the &lt;code&gt;restart_lsn&lt;/code&gt; position after it is done(&lt;code&gt;confirm=false&lt;/code&gt; in &lt;code&gt;pg_logical_slot_get_changes_guts&lt;/code&gt;). The condition check &lt;a href="https://github.com/postgres/postgres/blob/f5d07085822a144afb169d0f422e25689cb6209f/src/backend/replication/logical/logicalfuncs.c#L289" rel="noopener noreferrer"&gt;here&lt;/a&gt; confirms if &lt;code&gt;restart_lsn&lt;/code&gt; should be updated once reading from the WAL is done.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if (ctx-&amp;gt;reader-&amp;gt;EndRecPtr != InvalidXLogRecPtr &amp;amp;&amp;amp; confirm)
{
    LogicalConfirmReceivedLocation(ctx-&amp;gt;reader-&amp;gt;EndRecPtr);
    ReplicationSlotMarkDirty();
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The actual update of &lt;code&gt;restart_lsn&lt;/code&gt; is done in the &lt;a href="https://github.com/postgres/postgres/blob/f5d07085822a144afb169d0f422e25689cb6209f/src/backend/replication/logical/logical.c#L1879" rel="noopener noreferrer"&gt;LogicalConfirmReceivedLocation&lt;/a&gt; function when the next expected restart point is valid.&lt;br&gt;&lt;br&gt;
If the state of the slot is updated, the slot is marked dirty in &lt;code&gt;ReplicationSlotMarkDirty&lt;/code&gt; and is scheduled for flushing to disk in the next checkpoint. &lt;/p&gt;

&lt;p&gt;We can now consume from the slot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL);
    lsn    | xid |                         data
-----------+-----+------------------------------------------------------
 0/17B1550 | 764 | BEGIN 764
 0/17D3B50 | 764 | COMMIT 764
 0/17D6CE0 | 765 | BEGIN 765
 0/17D6CE0 | 765 | table public.foo: INSERT: id[integer]:1 k[integer]:1
 0/17D6D50 | 765 | COMMIT 765
(5 rows)
postgres=# SELECT slot_name, database, active, xmin, restart_lsn, confirmed_flush_lsn, synced FROM pg_replication_slots;
 slot_name | database | active | xmin | restart_lsn | confirmed_flush_lsn | synced
-----------+----------+--------+------+-------------+---------------------+--------
 test_slot | postgres | f      |      | 0/17ABD78   | 0/17D6E90           | f
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the &lt;code&gt;restart_lsn&lt;/code&gt; location has been updated &lt;code&gt;0/17ABD78&lt;/code&gt;. All the changes have now been consumed and reflected in the slot's state. There are no new changes at the new &lt;code&gt;restart_lsn&lt;/code&gt; location&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# SELECT * FROM pg_logical_slot_get_changes('test_slot', NULL, NULL);
 lsn | xid | data
-----+-----+------
(0 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  WAL Preservation
&lt;/h4&gt;

&lt;p&gt;The main purpose for replication slots is preventing deletion of WALs. Any WAL segment older than the one pointed to by &lt;code&gt;restart_lsn&lt;/code&gt; is scheduled for deletion during the next checkpoint cycle.&lt;/p&gt;

&lt;p&gt;Let's check it out. At this point there is only one WAL segment file &lt;code&gt;000000010000000000000001&lt;/code&gt; with the inode number &lt;code&gt;516146&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ls -lhi $PGDATA/pg_wal
total 17M
516146 -rw------- 1 postgres postgres  16M Aug 28 17:37 000000010000000000000001
516118 drwx------ 2 postgres postgres 4.0K Aug 28 17:36 archive_status
516119 drwx------ 2 postgres postgres 4.0K Aug 28 17:36 summaries
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adding more data to the &lt;code&gt;foo&lt;/code&gt; table to fill up the active WAL segment file. The &lt;code&gt;min_wal_size&lt;/code&gt; is set to the default 80MB.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# INSERT INTO foo SELECT i, i FROM generate_series(1, 247580) i;
INSERT 0 247580
postgres=# SELECT slot_name, database, active, xmin, restart_lsn, confirmed_flush_lsn, synced FROM pg_replication_slots;
 slot_name | database | active | xmin | restart_lsn | confirmed_flush_lsn | synced
-----------+----------+--------+------+-------------+---------------------+--------
 test_slot | postgres | f      |      | 0/17ABD78   | 0/17D6E90           | f
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The slot's state has not changed and &lt;code&gt;restart_lsn&lt;/code&gt; is still pointing to a location in the original WAL segment. Therefore, the &lt;code&gt;000000010000000000000001&lt;/code&gt; segment file will not be removed immediately even after a checkpoint.&lt;br&gt;
A new WAL segment file &lt;code&gt;000000010000000000000002&lt;/code&gt; has also been created&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ls -lhi $PGDATA/pg_wal
total 33M
516146 -rw------- 1 postgres postgres  16M Aug 28 17:43 000000010000000000000001
518594 -rw------- 1 postgres postgres  16M Aug 28 17:43 000000010000000000000002
516118 drwx------ 2 postgres postgres 4.0K Aug 28 17:42 archive_status
516119 drwx------ 2 postgres postgres 4.0K Aug 28 17:42 summaries
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In order to delete the original segment file, we need to consume from the &lt;code&gt;test_slot&lt;/code&gt; slot using the &lt;code&gt;pg_logical_slot_get_changes&lt;/code&gt; function. With the &lt;code&gt;wal_recycle&lt;/code&gt; setting turned on, the &lt;code&gt;000000010000000000000001&lt;/code&gt; segment file will be recycled(inode number remains the same) and it will be renamed renamed to &lt;code&gt;000000010000000000000003&lt;/code&gt; New WAL records will then be stored in it. You can try it out to see this in action.&lt;/p&gt;

&lt;h4&gt;
  
  
  Idle Slots and System Resources
&lt;/h4&gt;

&lt;p&gt;It is good practice to delete unused slots in order to release resources being used by them. A slot uses the shared memory to store its &lt;a href="https://github.com/postgres/postgres/blob/da9f9f75e5ce27a45878ffa262156d18f0046188/src/include/replication/slot.h#L162" rel="noopener noreferrer"&gt;in-memory state&lt;/a&gt;. Unused slots consume memory and hold up the process of cleaning up old WAL files leading to disk bloat. The situation can quickly get worse in a busy server. WAL files newer than &lt;code&gt;restart_lsn&lt;/code&gt; are preserved for as long as the slot's state is not updated. &lt;/p&gt;

&lt;p&gt;After adding more data to the table, multiple WAL files are still stored on disk even after a checkpoint.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ls -lhi $PGDATA/pg_wal
total 161M
516146 -rw------- 1 postgres postgres  16M Aug 28 17:43 000000010000000000000001
518594 -rw------- 1 postgres postgres  16M Aug 28 17:44 000000010000000000000002
518596 -rw------- 1 postgres postgres  16M Aug 28 17:44 000000010000000000000003
518597 -rw------- 1 postgres postgres  16M Aug 28 17:44 000000010000000000000004
518598 -rw------- 1 postgres postgres  16M Aug 28 17:44 000000010000000000000005
518595 -rw------- 1 postgres postgres  16M Aug 28 17:44 000000010000000000000006
518599 -rw------- 1 postgres postgres  16M Aug 28 17:44 000000010000000000000007
518600 -rw------- 1 postgres postgres  16M Aug 28 17:44 000000010000000000000008
518601 -rw------- 1 postgres postgres  16M Aug 28 17:44 000000010000000000000009
518602 -rw------- 1 postgres postgres  16M Aug 28 17:44 00000001000000000000000A
516118 drwx------ 2 postgres postgres 4.0K Aug 28 17:42 archive_status
516119 drwx------ 2 postgres postgres 4.0K Aug 28 17:42 summaries
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The current active WAL segment is &lt;code&gt;00000001000000000000000A&lt;/code&gt; and none of the older segments have been deleted. The size of the &lt;code&gt;pg_wal&lt;/code&gt; directory will continue to grow unbounded if the slot remains inactive. To free up space, we would need to get rid of the slot.&lt;/p&gt;

&lt;p&gt;In order to clearly show the space reclamation, turn off recycling of WALs (on by default)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ echo "wal_recycle = off" &amp;gt;&amp;gt; $PGDATA/postgresql.conf
$ pg_ctl -l logfile restart  # Assumes data directory is already set to PGDATA
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Drop the replication slot&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# SELECT pg_drop_replication_slot('test_slot');
 pg_drop_replication_slot
--------------------------

(1 row)
-- Request a checkpoint immediately to cleanup older WAL files.
postgres=# CHECKPOINT;
CHECKPOINT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The slot has been successfully deleted. The &lt;code&gt;pg_wal&lt;/code&gt; directory has also been freed of old WAL files&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ls -lhi $PGDATA/pg_wal
total 17M
518602 -rw------- 1 postgres postgres  16M Aug 28 17:52 00000001000000000000000A
516118 drwx------ 2 postgres postgres 4.0K Aug 28 17:48 archive_status
516119 drwx------ 2 postgres postgres 4.0K Aug 28 17:48 summaries
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The size of &lt;code&gt;pg_wal&lt;/code&gt; directory is now 17MB. The previous size was 161MB. Which is a 89.441 per cent reduction in size. That makes a big difference especially in a large database where WAL files fill up quickly. The &lt;code&gt;pg_replication_slots&lt;/code&gt; view can be used to quickly find and delete idle slots.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# SELECT slot_name, database, active, xmin, restart_lsn, confirmed_flush_lsn, synced FROM pg_replication_slots WHERE active = 'f';
 slot_name  | database | active | xmin | restart_lsn | confirmed_flush_lsn | synced
------------+----------+--------+------+-------------+---------------------+--------
 test_slot1 | postgres | f      |      | 0/A116D08   | 0/A116D40           | f
 test_slot2 | postgres | f      |      | 0/A116D40   | 0/A116D78           | f
(2 rows)

-- Delete based on how long the slot has been idle(e.g 5 minutes)
postgres=# SELECT slot_name, pg_drop_replication_slot(slot_name) FROM pg_replication_slots WHERE (extract(epoch from now() - inactive_since) / 60) &amp;gt; 5 AND active = 'f';
 slot_name  | pg_drop_replication_slot
------------+--------------------------
 test_slot1 |
 test_slot2 |
(2 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compared to other solutions, replication slots provide a smoother experience for handling replication and monitoring replication progress. &lt;/p&gt;

</description>
      <category>postgres</category>
      <category>replication</category>
      <category>slots</category>
      <category>wal</category>
    </item>
    <item>
      <title>HA Postgres with Patroni and Barman</title>
      <dc:creator>Brian Misachi</dc:creator>
      <pubDate>Tue, 19 Aug 2025 19:53:39 +0000</pubDate>
      <link>https://dev.to/misachi/ha-postgres-with-patroni-and-barman-ien</link>
      <guid>https://dev.to/misachi/ha-postgres-with-patroni-and-barman-ien</guid>
      <description>&lt;p&gt;In the last post we managed to set up a HA 3-node Postgres cluster. We set up physical replication where WAL records are streamed from the primary to be replayed on the standbys. We were also able to manually failover to one of the standbys when the primary was shutdown. A lot of manual processes were involved which can be time consuming, error prone and could potentially lead to data loss. &lt;/p&gt;

&lt;p&gt;This post will show how &lt;a href="https://patroni.readthedocs.io/en/latest/index.html" rel="noopener noreferrer"&gt;Patroni&lt;/a&gt; can be helpful in automation of HA Postgres cluster and also when managing failover when the cluster becomes unhealthy. &lt;a href="https://docs.pgbarman.org/release/3.14.1/index.html" rel="noopener noreferrer"&gt;Barman&lt;/a&gt; will be used for backup and recovery management. Taking it further, we will add monitoring of the cluster using tools like Grafana and Prometheus Exciting!!&lt;/p&gt;

&lt;p&gt;We will be using scripts from this &lt;a href="https://github.com/misachi/ha_postgres" rel="noopener noreferrer"&gt;repository&lt;/a&gt;. Clone the repository to your local filesystem and follow along. &lt;/p&gt;

&lt;p&gt;First create 3 nodes &lt;code&gt;testPG.1&lt;/code&gt;, &lt;code&gt;testPG.2&lt;/code&gt; and &lt;code&gt;testPG.3&lt;/code&gt;. You can choose to create as many nodes(containers) as your host can allow. The &lt;code&gt;run_pg.sh&lt;/code&gt; script can be used for this part. Each node created will be installed with Postgres server version 18beta1 and additionally install  Patroni. If this is the first time running the script, it will also create an ETCD v3 container node as the DCS for the cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ARCHIVE_DIR and BACKUP_DIR should be separate directories

$ export ARCHIVE_DIR=/path/to/archive # Replace
$ sudo chown -R 991:991 $ARCHIVE_DIR # Ensure Postgres user owns the directory
$ export BACKUP_DIR=/path/to/backup/data  # Replace
$ sudo chown -R 991:991 $BACKUP_DIR # Ensure Postgres user owns the directory
$ nohup ./run_pg.sh 1 5432 &amp;gt; /tmp/test.1 &amp;amp;  # nohup ./run_pg.sh &amp;lt;node number&amp;gt; &amp;lt;port&amp;gt; &amp;amp;
$ nohup ./run_pg.sh 2 5433 &amp;gt; /tmp/test.2 &amp;amp; # node 2
$ nohup ./run_pg.sh 3 5434 &amp;gt; /tmp/test.3 &amp;amp; # node 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. We now have a 3-node cluster with the above commands. Check the current state of the cluster and its member&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml list cluster1"
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Leader  | running   |  2 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Replica | streaming |  2 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  2 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Node1(testPG.1) is the primary server. Nodes 2 and 3 are standbys and are actively streaming changes from the primary. Since we have no data yet, none of the standbys are lagging behind the primary. &lt;/p&gt;

&lt;h3&gt;
  
  
  Failover
&lt;/h3&gt;

&lt;p&gt;When the cluster becomes unhealthy e.g the primary goes down for some reason, patroni processes in each node coordinate and eventually select a healthy standby node by leader election to become the new primary. This part rarely requires any human intervention, unless something extremely out of the normal has happened like the DCS is down or an issue with the standbys.&lt;/p&gt;

&lt;p&gt;Let's bring down the current primary to see how patroni handles failover.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ CMD=`docker exec testPG.1 bash -c "ps -C patroni | grep patroni"` &amp;amp;&amp;amp; docker exec testPG.1 bash -c "kill `echo $CMD | awk '{print $1}'`"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We send a SIGTERM signal to allow patroni process to cleanly shutdown Postgres server. Since this is the primary, Patroni performs a final checkpoint of any data not yet flushed to disk for writes done since the last checkpoint. This can be helpful in reducing the amount of WAL records to be replayed in recovery during the next restart.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ tail /tmp/test.1 -n 15
....
2025-08-18 11:34:37,128 INFO: no action. I am (node1), the leader with the lock
2025-08-18 11:34:40.692 UTC [40] LOG:  received fast shutdown request
2025-08-18 11:34:40.694 UTC [40] LOG:  aborting any active transactions
....
2025-08-18 11:34:40.696 UTC [44] LOG:  shutting down
2025-08-18 11:34:40.708 UTC [44] LOG:  checkpoint starting: shutdown immediate
2025-08-18 11:34:40.758 UTC [44] LOG:  checkpoint complete: wrote 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.051 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16383 kB, estimate=16383 kB; lsn=0/8000028, redo lsn=0/8000028
2025-08-18 11:34:40.822 UTC [40] LOG:  database system is shut down
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The node1 patroni process releases the leader lock allowing other nodes to compete and acquire the lock. The old primary is removed from the members list of the cluster. If we check the current list we get.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml list cluster1"
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Leader  | running   |  3 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  3 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice &lt;code&gt;node1&lt;/code&gt; is gone and we have a new leader, &lt;code&gt;node2&lt;/code&gt;. We can then proceed to fix node1 before we bring it back to the cluster.&lt;/p&gt;

&lt;p&gt;Once the patroni process in each node detects the change in the cluster(missing primary). The first healthy node to acquire the leader lock becomes the new primary while the remaining standbys begin to follow the new leader. In this case node2(testPG.2) becomes the new primary while node3(testPG.3) begins to follow the new leader.&lt;/p&gt;

&lt;p&gt;Now you can send a SIGKILL signal to terminate the node1 patroni process&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ CMD=`docker exec testPG.1 bash -c "ps -C patroni | grep patroni"` &amp;amp;&amp;amp; docker exec testPG.1 bash -c "kill -9 `echo $CMD | awk '{print $1}'`"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The old primary can be put back into the cluster after being fixed. When this is done, it first attempts to acquire the leader lock and then it detects the lock is already being held by another server, it quickly transitions to a standby and follows the current leader.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ nohup ./run_pg.sh 1 &amp;gt; /tmp/test.1 &amp;amp;  # Bring old primary back up
....
2025-08-18 13:31:03,430 INFO: Lock owner: node2; I am node1
2025-08-18 13:31:03,430 INFO: establishing a new patroni heartbeat connection to postgres
cp: cannot stat '/home/postgres/.tmp/00000004.history': No such file or directory
2025-08-18 13:31:03.432 UTC [970] LOG:  waiting for WAL to become available at 0/9000018
2025-08-18 13:31:03,491 INFO: no action. I am (node1), a secondary, and following a leader (node2)
cp: cannot stat '/home/postgres/.tmp/000000030000000000000009': No such file or directory
2025-08-18 13:31:08.433 UTC [996] LOG:  started streaming WAL from primary at 0/9000000 on timeline 3
2025-08-18 13:31:14,023 INFO: no action. I am (node1), a secondary, and following a leader (node2)
2025-08-18 13:31:23,934 INFO: no action. I am (node1), a secondary, and following a leader (node2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And now &lt;code&gt;node1&lt;/code&gt; has been added back to the cluster, as a standby.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml list cluster1"
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Replica | streaming |  3 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Leader  | running   |  3 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  3 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  SwitchOver
&lt;/h3&gt;

&lt;p&gt;In order to promote &lt;code&gt;node1&lt;/code&gt; back to the primary role we can use the &lt;code&gt;switchover&lt;/code&gt; command, given that this is a healthy cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml switchover cluster1 --leader node2 --candidate node1 --force"  # Using force to skip the prompts from patroni
Current cluster topology
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Replica | streaming |  3 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Leader  | running   |  3 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  3 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
2025-08-18 14:19:31.73068 Successfully switched over to "node1"
+ Cluster: cluster1 (7539602363152794219) ------------+----+-----------+----------------------+
| Member | Host       | Role    | State               | TL | Lag in MB | Tags                 |
+--------+------------+---------+---------------------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Leader  | running             |  3 |           | clonefrom: true      |
|        |            |         |                     |    |           | failover_priority: 1 |
+--------+------------+---------+---------------------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Replica | stopping            |    |   unknown | clonefrom: true      |
|        |            |         |                     |    |           | failover_priority: 1 |
+--------+------------+---------+---------------------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | in archive recovery |  3 |         0 | clonefrom: true      |
|        |            |         |                     |    |           | failover_priority: 1 |
+--------+------------+---------+---------------------+----+-----------+----------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cluster is still in an inconsistent state.&lt;br&gt;
The switching over process from the primary node(node2) logs looks like this. The output is truncated for clarity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ tail /tmp/test.2 -n 100
2025-08-18 14:19:28,597 INFO: received switchover request with leader=node2 candidate=node1 scheduled_at=None
2025-08-18 14:19:28,608 INFO: Got response from node1
....
2025-08-18 14:19:28,658 INFO: Lock owner: node2; I am node2
2025-08-18 14:19:28,759 INFO: switchover: demoting myself
2025-08-18 14:19:28,759 INFO: Demoting self (graceful)
2025-08-18 14:19:28.764 UTC [39] LOG:  checkpoint starting: immediate force wait
....
2025-08-18 14:19:31,033 INFO: Leader key released
2025-08-18 14:19:31,082 INFO: Lock owner: node1; I am node2
2025-08-18 14:19:31,082 INFO: switchover: demote in progress
....
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cluster is back to a consistent state and with original leader  &lt;code&gt;node1&lt;/code&gt; and nodes &lt;code&gt;node2&lt;/code&gt; and &lt;code&gt;node3&lt;/code&gt; as followers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml list cluster1"
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Leader  | running   |  4 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Replica | streaming |  4 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  4 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Having changed the leader multiple times, patroni also provides a way to view leadership changes within the cluster&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml history cluster1"
+----+-----------+------------------------------+----------------------------------+------------+
| TL |       LSN | Reason                       | Timestamp                        | New Leader |
+----+-----------+------------------------------+----------------------------------+------------+
|  1 | 117440672 | no recovery target specified | 2025-08-18T09:23:50.986078+00:00 | node1      |
|  2 | 134217888 | no recovery target specified | 2025-08-18T11:34:45.003395+00:00 | node2      |
|  3 | 167772320 | no recovery target specified | 2025-08-18T14:19:31.253098+00:00 | node1      |
+----+-----------+------------------------------+----------------------------------+------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result shows there have been 3 leader changes between &lt;code&gt;node1&lt;/code&gt; and &lt;code&gt;node2&lt;/code&gt; and the times the changes occurred. &lt;/p&gt;

&lt;p&gt;Patroni is an extremely handy solution for managing a highly available Postgres cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Backups
&lt;/h2&gt;

&lt;p&gt;A truly HA solution would require data to be backed up in a separate location such that in case you lose data or cannot recover fully from the cluster, the backups can instead be used for recovery. The solution you pick for backing up your data depends on your business requirements.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Any solution for backups should at least use the &lt;code&gt;pg_basebackup&lt;/code&gt; utility. Remember &lt;code&gt;pg_dump&lt;/code&gt; or &lt;code&gt;pg_dumpall&lt;/code&gt; are not backup tools&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Barman is an easy to use solution that can be used to manage backups. To set up barman, first create the required roles by executing the queries below on the primary node&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -U patroni_super -d postgres -c \"CREATE USER streaming_barman WITH REPLICATION ENCRYPTED PASSWORD 'streaming_barman'; CREATE USER barman WITH SUPERUSER ENCRYPTED PASSWORD 'barman';\""
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run the &lt;code&gt;./barman.sh&lt;/code&gt; command. When the script completes, a new container will be running, with barman installed in it. The script configures backups for the primary node. The &lt;code&gt;BACKUP_DIR&lt;/code&gt; directory is shared between the barman container and the primary node(node1) container, via a docker volume. We'll use the &lt;code&gt;BACKUP_DIR&lt;/code&gt; directory to recover the primary node, using the backups from barman. &lt;/p&gt;

&lt;p&gt;Check the current state of barman.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -t barman bash -c "source /var/lib/barman/.bashrc &amp;amp;&amp;amp; barman check node1"
Server node1:
        PostgreSQL: OK
        superuser or standard user with backup privileges: OK
        PostgreSQL streaming: OK
        wal_level: OK
        replication slot: OK
        directories: OK
        retention policy settings: OK
        backup maximum age: OK (no last_backup_maximum_age provided)
        backup minimum size: OK (22.9 MiB)
        wal maximum age: OK (no last_wal_maximum_age provided)
        wal size: OK (0 B)
        compression settings: OK
        failed backups: OK (there are 0 failed backups)
        minimum redundancy requirements: OK (have 1 non-incremental backups, expected at least 0)
        pg_basebackup: OK
        pg_basebackup compatible: OK
        pg_basebackup supports tablespaces mapping: OK
        systemid coherence: OK
        pg_receivexlog: OK
        pg_receivexlog compatible: OK
        receive-wal running: OK
        archive_mode: OK
        archive_command: OK
        continuous archiving: OK
        archiver errors: OK
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything seems to be working as expected.&lt;/p&gt;

&lt;p&gt;Check if there exists any backups. The &lt;code&gt;barman.sh&lt;/code&gt; script will create an initial backup before the setup is complete.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -t barman bash -c "source /var/lib/barman/.bashrc &amp;amp;&amp;amp; barman list-backups node1"
node1 20250818T183402 'first-backup' - F - Mon Aug 18 18:34:04 2025 - Size: 38.9 MiB - WAL Size: 0 B - WAITING_FOR_WALS
node1 20250818T183028 'first-backup' - F - Mon Aug 18 18:30:30 2025 - Size: 54.9 MiB - WAL Size: 16.0 MiB - WAITING_FOR_WALS
node1 20250817T174259 'first-backup' - F - Sun Aug 17 17:43:01 2025 - Size: 54.9 MiB - WAL Size: 16.0 MiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also, creating a new backup is as easy as running the command below. All the backups will be stored inside the &lt;code&gt;/var/lib/barman/node1/base&lt;/code&gt; in the barman node(container).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -t barman bash -c "source /var/lib/barman/.bashrc &amp;amp;&amp;amp; barman backup --name first-backup node1"
Starting backup using postgres method for server node1 in /var/lib/barman/node1/base/20250818T183402
Backup start at LSN: 0/C0000C8 (00000004000000000000000C, 000000C8)
Starting backup copy via pg_basebackup for 20250818T183402
Copy done (time: 1 second)
Finalising the backup.
....
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Restoring from backup
&lt;/h3&gt;

&lt;p&gt;The backup is only valid after testing it out and ensuring you can recover well from it. We can test this out on the primary node(node1).&lt;/p&gt;

&lt;p&gt;Create new table &lt;code&gt;foo&lt;/code&gt; inserting 100 records in it&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -U patroni_super
-d postgres -c \"CREATE TABLE foo(id SERIAL PRIMARY KEY, k INT NOT NULL); INSERT INTO foo(k) SELECT i FROM generate_series(1, 100) i;
\""
CREATE TABLE
INSERT 0 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run a new backup with barman&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -t barman bash -c "source /var/lib/barman/.bashrc &amp;amp;&amp;amp; barman backup --name first-backup node1"
Starting backup using postgres method for server node1 in /var/lib/barman/node1/base/20250818T185835
Backup start at LSN: 0/E02A6F0 (00000004000000000000000E, 0002A6F0)
Starting backup copy via pg_basebackup for 20250818T185835
Copy done (time: 7 seconds)
Finalising the backup.
Backup size: 22.9 MiB
Backup end at LSN: 0/10000060 (000000040000000000000010, 00000060)
Backup completed (start time: 2025-08-18 18:58:35.575776, elapsed time: 7 seconds)
Processing xlog segments from streaming for node1 (batch size: 3)
        00000004000000000000000E
        00000004000000000000000F
        000000040000000000000010
Processing xlog segments from file archival for node1
        00000004000000000000000E
        00000004000000000000000F
        00000004000000000000000F.00000028.backup
        000000040000000000000010
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let's put the cluster is some sort of maintenance mode to prevent automatic failover. With that done we can shutdown primary node and then delete the data directory. We'll use barman to recover the data(remember the BACKUP_DIR we used earlier).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml pause cluster1 --wait"
'pause' request sent, waiting until it is recognized by all nodes
Success: cluster management is paused
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cluster is paused.&lt;/p&gt;

&lt;p&gt;Stop the Postgres server and remove the data directory&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data/ stop"
waiting for server to shut down.... done
server stopped
$ docker exec testPG.1 bash -c "rm -rf /usr/local/pgsql/data/*"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;BACKUP_DIR&lt;/code&gt; directory should now be empty&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ ls -la $BACKUP_DIR
total 8
drwx------ 2     991     991 4096 Aug 18 22:35 .
drwxrwxrwx 4 vagrant vagrant 4096 Aug 17 20:01 ..
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's restore the data and test if the primary can fully recover when we bring it back up. We use the &lt;code&gt;latest&lt;/code&gt; backup we have from barman.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -t barman bash -c "source /var/lib/barman/.bashrc &amp;amp;&amp;amp; barman cron &amp;amp;&amp;amp; barman recover node1 latest /home/postgres/.backup"
Starting WAL archiving for server node1
Starting streaming archiver for server node1
Starting check-backup for backup 20250818T183402 of server node1
Processing xlog segments from file archival for node1
        000000020000000000000008
Starting local restore for server node1 using backup 20250818T185835
Destination directory: /home/postgres/.backup
Copying the base backup.
Copying required WAL segments.
Generating archive status files
Identify dangerous settings in destination directory.

IMPORTANT
These settings have been modified to prevent data losses

postgresql.conf line 4: archive_command = false
postgresql.conf line 27: recovery_target = None
postgresql.conf line 28: recovery_target_lsn = None
postgresql.conf line 29: recovery_target_name = None
postgresql.conf line 30: recovery_target_time = None
postgresql.conf line 31: recovery_target_timeline = None
postgresql.conf line 32: recovery_target_xid = None
....
Restore operation completed (start time: 2025-08-18 19:42:46.109241+00:00, elapsed time: less than one second)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The backup has been restored to the data location of the primary server. Now we can restart the primary server.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ nohup ./run_pg.sh 1 &amp;gt; /tmp/test.1 &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you get a "Postgresql is not running." warning, start the server with&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data/ start"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Primary is up and is still the leader of the cluster. Let's check if our data is intact.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -U patroni_super -d postgres -c \"SELECT count(*) FROM foo;\""
 count
-------
   100
(1 row)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yes. We have recovered all the data from the backups.&lt;/p&gt;

&lt;p&gt;But we are still in a maintenance state.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml list cluster1"
+ Cluster: cluster1 (7539602363152794219) --+----+-----------+----------------------+
| Member | Host       | Role    | State     | TL | Lag in MB | Tags                 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node1  | 172.17.0.3 | Leader  | running   |  4 |           | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node2  | 172.17.0.4 | Replica | streaming |  4 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
| node3  | 172.17.0.5 | Replica | streaming |  4 |         0 | clonefrom: true      |
|        |            |         |           |    |           | failover_priority: 1 |
+--------+------------+---------+-----------+----+-----------+----------------------+
 Maintenance mode: on
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Resume the cluster and exit the maintenance state below&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "patronictl -c patroni_config.yml resume cluster1 --wait"
'resume' request sent, waiting until it is recognized by all nodes
Success: cluster management is resumed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now automatic failover is back in place and any one of the healthy standbys can replace the primary, when needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring
&lt;/h2&gt;

&lt;p&gt;Finally, you can add monitoring and visualize the cluster by running the script &lt;code&gt;./grafana.sh&lt;/code&gt; and following the steps outlined &lt;a href="https://github.com/misachi/ha_postgres?tab=readme-ov-file#ha_postgres" rel="noopener noreferrer"&gt;here&lt;/a&gt;. &lt;br&gt;
You'll need to create a new role for the &lt;a href="https://github.com/prometheus-community/postgres_exporter" rel="noopener noreferrer"&gt;Postgres exporter&lt;/a&gt; to use when gathering metrics from Postgres to Prometheus. The exporter is configured to expose only the default metrics. If additional metrics are needed, check the &lt;a href="https://github.com/prometheus-community/postgres_exporter" rel="noopener noreferrer"&gt;flags&lt;/a&gt; on enabling some of the disabled collectors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -U patroni_super -d postgres -c \"CREATE USER prom_pg_exporter WITH SUPERUSER ENCRYPTED PASSWORD 'prom_pg_exporter';\""
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;grafana.sh&lt;/code&gt; script will install both Grafana for visualization and Prometheus to be used for collecting and storing metrics from Postgres. You can add as many dashboards as you see fit for your use.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>highavailability</category>
      <category>patroni</category>
      <category>barman</category>
    </item>
    <item>
      <title>High Availability Postgres</title>
      <dc:creator>Brian Misachi</dc:creator>
      <pubDate>Tue, 19 Aug 2025 19:52:53 +0000</pubDate>
      <link>https://dev.to/misachi/high-availability-postgres-4ahe</link>
      <guid>https://dev.to/misachi/high-availability-postgres-4ahe</guid>
      <description>&lt;p&gt;High Availability(HA) refers to a system's ability to be accessible to its users 100 per cent of the time(or as close as possible to 100). It is usually achieved by having proper redundancy in place to ensure when a component in the system fails, and with failover correctly implemented within the system, another healthy component can replace the faulty component and continue servicing requests with little to no human intervention. A HA system can be implemented by having a single Primary(leader) component and other Secondary components(followers) that provide the required redundancy. At any given time one of the secondary components can take the place of the primary if the system requires it. &lt;/p&gt;

&lt;p&gt;This post will focus on building a HA cluster with Postgres. The HA solutions offered by Postgres can be categorized into synchronous and asynchronous. In a synchronous solution, when the database receives a commit command from a user, it proceeds to write the changes made up to that point to its own permanent storage and then waits for confirmation from one or more configured standbys that the data has been flushed to their own permanent storage before returning success to the user. In an asynchronous solution, the database returns success once the data has been flushed only to its own permanent storage. Data will then be streamed to stand-by servers at a later time. So there is a high possibility of data loss when using an asynchronous solution. &lt;/p&gt;

&lt;p&gt;The way data is moved between the primary and the standbys is through the process of replication. Replication can be achieved synchronously or asynchronously. There are two types of replication available in Postgres.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION" rel="noopener noreferrer"&gt;Physical replication&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.postgresql.org/docs/current/logical-replication.html" rel="noopener noreferrer"&gt;Logical replication&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The main difference between these two is that in physical replication the exact WAL records and location is streamed to standbys. Physical replication is more performant when compared to logical replication since the WAL data is streamed as is with no modifications done. Logical replication deals with publishers and subscribers where modifications to data are formatted(as a change to a row with the DML type, old row value and new row value) and streamed to subscribers that then pull data from the publications they are subscribed to. The biggest advantage of logical replication is the ability to work across major versions of Postgres. &lt;/p&gt;

&lt;p&gt;The post will focus on asynchronous (physical) replication. The secondary servers will be operated as hot standbys. Meaning they can accept read queries and also when healthy they can be candidates to replace the primary server. Docker containers will be used for isolation(mimicking different physical nodes) for all servers. Each container represents a separate node. All the containers will be running Postgres version 18beta1 compiled from source. &lt;/p&gt;

&lt;p&gt;Create Postgres image by running the &lt;code&gt;./build.sh&lt;/code&gt; script below&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#! /bin/bash

set -ex

IMG=postgres/test-0.0.1
IMG_ID=`docker images ${IMG} -q`
PG_TAG=REL_18_BETA1

if [ "${IMG_ID}" = "" ]; then
  if [ ! -d "postgres-${PG_TAG}" ]; then
    wget https://github.com/postgres/postgres/archive/refs/tags/${PG_TAG}.tar.gz &amp;amp;&amp;amp; tar -xzf ${PG_TAG}.tar.gz
  fi
  ID=991
  USR=postgres
  USR_HOME=/home/postgres

  cat &amp;gt; Dockerfile &amp;lt;&amp;lt; EOF
FROM ubuntu:latest
RUN groupadd -g ${ID} ${USR} &amp;amp;&amp;amp; useradd -r -u ${ID} -g ${USR} ${USR}
ADD postgres-${PG_TAG} ${USR_HOME}
WORKDIR ${USR_HOME}
RUN chown -R ${USR}:${USR} ${USR_HOME}
RUN apt-get update &amp;amp;&amp;amp; apt-get install -y g++ zlib1g-dev make curl tar gzip perl liblz4-dev libreadline-dev flex bison libicu-dev liburing-dev
RUN apt-get install --reinstall -y pkg-config &amp;amp;&amp;amp; ./configure  --with-liburing --enable-debug --with-lz4 &amp;amp;&amp;amp; make -j4 &amp;amp;&amp;amp; make all &amp;amp;&amp;amp; make install
RUN echo "export PATH=/usr/local/pgsql/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" &amp;gt;&amp;gt; /etc/bash.bashrc &amp;amp;&amp;amp; \
      chown -R ${USR}:${USR} /usr/local/pgsql
USER ${USR}
EOF

  docker build -t ${IMG}:latest .
  rm Dockerfile
  rm -rf postgres-${PG_TAG} ${PG_TAG}.tar.gz
else
  echo "Image ${IMG} already exists with ID ${IMG_ID}"
fi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the image built, create the first container for running the primary. We will call it &lt;code&gt;testPG.1&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ export ARCHIVE_DIR=~/mydir # Replace
$ mkdir -p $ARCHIVE_DIR
$ sudo chown -R 991:991 $ARCHIVE_DIR # Ensure Postgres user owns the directory
$ docker run -d --name testPG.1 -p 5432:5432 --mount type=bind,source=${ARCHIVE_DIR},target=/home/postgres/.tmp --restart=on-failure postgres/test-0.0.1:latest bash -c 'tail /dev/null -f'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Remember to update &lt;code&gt;ARCHIVE_DIR&lt;/code&gt; to a valid directory in your local filesystem.&lt;/p&gt;

&lt;p&gt;Once first node is up and running, a few configurations need to be changed&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec -it testPG.1 bash  # Attach to a tty so commands can be executed inside the container
postgres@e67b763ec2fb:~$ echo "export PGDATA=/usr/local/pgsql/data" &amp;gt;&amp;gt; ~/.bashrc &amp;amp;&amp;amp; source ~/.bashrc  
postgres@e67b763ec2fb:~$ pg_ctl -l logfile initdb  # Only run this if the data directory does not already exist
postgres@e67b763ec2fb:~$ echo "host    replication repl 172.17.0.0/16 md5" &amp;gt;&amp;gt; $PGDATA/pg_hba.conf  # Add replication user to the `pg_hba.conf` file. Replace address 172.17.0.0 appropriately. Use the command "ifconfig docker0 | grep -w "inet" | awk '{print $2}' | sed 's/[0-9].[0-9]$/0.0/'" to get right interface address on your system
postgres@e67b763ec2fb:~$ cat &amp;gt;&amp;gt; $PGDATA/postgresql.conf &amp;lt;&amp;lt; EOF
wal_level = replica  # Allow physical replication
archive_mode = on  # Turn on WAL archiving
archive_command = 'test ! -f /home/postgres/.tmp/%f &amp;amp;&amp;amp; cp %p /home/postgres/.tmp/%f'  # Send old WAL files to archive location
restore_command = 'cp /home/postgres/.tmp/%f %p'  # Restore WALs from archive
wal_keep_size = 512  # Amount of WALs in MBs to keep before removal
listen_addresses = '*'  # Allow all. Not safe at all
EOF
postgres@e67b763ec2fb:~$ pg_ctl -l logfile start
postgres@e67b763ec2fb:~$ psql -c "CREATE USER repl WITH REPLICATION ENCRYPTED PASSWORD 'repl';" 
postgres@e67b763ec2fb:~$ exit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now our soon to be primary node should be up and running.&lt;/p&gt;

&lt;p&gt;Next, create another container as the second node. We will call it &lt;code&gt;testPG.2&lt;/code&gt;. Remember to use the same &lt;code&gt;ARCHIVE_DIR&lt;/code&gt; location.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ export ARCHIVE_DIR=/mydir
$ docker run -d --name testPG.2 -p 5433:5432 --mount type=bind,source=${ARCHIVE_DIR},target=/home/postgres/.tmp --restart=on-failure postgres/test-0.0.1:latest bash -c 'tail /dev/null -f'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In &lt;code&gt;testPG.2&lt;/code&gt;  node, remove the data directory if it already exists then use the &lt;code&gt;pg_basebackup&lt;/code&gt; tool to copy data from the primary server(testPG.1) and begin the replication process.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ echo "export PGDATA=/usr/local/pgsql/data" &amp;gt;&amp;gt; ~/.bashrc &amp;amp;&amp;amp; source ~/.bashrc
$ docker exec testPG.2 bash -c "/usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile stop &amp;amp;&amp;amp; rm -rf $PGDATA"

# If you get a "pg_ctl: directory "/usr/local/pgsql/data" does not exist" message, that is fine. The directory hasn't yet been created

$ SERV_IP=`docker inspect --format='{{ .NetworkSettings.IPAddress }}' testPG.1` &amp;amp;&amp;amp; docker exec -u root testPG.2 bash -c "echo '$SERV_IP:5432:replication:repl:repl' &amp;gt; /home/postgres/.pgpass &amp;amp;&amp;amp; chmod 0600 /home/postgres/.pgpass &amp;amp;&amp;amp; chown postgres /home/postgres/.pgpass"

$ SERV_IP=`docker inspect --format='{{ .NetworkSettings.IPAddress }}' testPG.1` &amp;amp;&amp;amp; docker exec -t testPG.2 bash -c "/usr/local/pgsql/bin/pg_basebackup -D $PGDATA -h $SERV_IP -p 5432 -Xs -R -P -U repl &amp;amp;&amp;amp; /usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile start"
23724/23724 kB (100%), 1/1 tablespace
waiting for server to start.... done
server started 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server immediately goes into standby mode and starts receiving WAL records from the primary(testPG.1). Check processes running under testPG.1&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "ps -faux | grep postgres"
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres     276  0.0  0.4 208844 24484 ?        Ss   17:46   0:00 /usr/local/pgsql/bin/postgres
....
postgres     285  0.0  0.0 208976  4384 ?        Ss   17:46   0:00  \_ postgres: archiver last was 000000010000000000000005.00000060.backup
postgres     286  0.0  0.1 210424  5248 ?        Ss   17:46   0:00  \_ postgres: logical replication launcher
postgres     319  0.0  0.1 210684  9256 ?        Ss   17:48   0:00  \_ postgres: walsender repl 172.17.0.1(47148) streaming 0/6000168
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is a single WAL sender process streaming records to the standby server.&lt;br&gt;
In the &lt;code&gt;testPG.2&lt;/code&gt; node, a WAL receiver process is running which receives streamed WAL records from the primary and re-applies the changes on the standby server.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.2 bash -c "ps -faux | grep postgres"
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
....
postgres      31  0.0  0.4 208844 24588 ?        Ss   17:48   0:00 /usr/local/pgsql/bin/postgres
....
postgres      37  0.0  0.1 209596  6040 ?        Ss   17:48   0:00  \_ postgres: startup recovering 000000010000000000000006
postgres      46  0.0  0.0 209452  4800 ?        Ss   17:48   0:00  \_ postgres: walreceiver streaming 0/6000168
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Currently, we have one primary server and one standby server. The primary services both read and write request while the standby is limited to read-only requests. Read requests are mostly SELECT-like queries from users while write requests involve modifications to data(DDL, DML queries)&lt;/p&gt;

&lt;p&gt;We can easily add more standbys to our cluster as needed to help with load distribution and reduce chances of a service outage when both servers(primary and standby) become unavailable at the same time. Adding another standby(testPG.3) is straightforward and similar to what we did when setting up &lt;code&gt;testPG.2&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker run -d --name testPG.3 -p 5434:5432 --mount type=bind,source=${ARCHIVE_DIR},target=/home/postgres/.tmp --restart=on-failure postgres/test-0.0.1:latest bash -c 'tail /dev/null -f'

$ SERV_IP=`docker inspect --format='{{ .NetworkSettings.IPAddress }}' testPG.1` &amp;amp;&amp;amp; docker exec -u root testPG.3 bash -c "echo '$SERV_IP:5432:replication:repl:repl' &amp;gt; /home/postgres/.pgpass &amp;amp;&amp;amp; chmod 0600 /home/postgres/.pgpass &amp;amp;&amp;amp; chown postgres /home/postgres/.pgpass"

$ SERV_IP=`docker inspect --format='{{ .NetworkSettings.IPAddress }}' testPG.1` &amp;amp;&amp;amp; docker exec -t testPG.3 bash -c "/usr/local/pgsql/bin/pg_basebackup -D $PGDATA -h $SERV_IP -p 5432 -Xs -R -P -U repl &amp;amp;&amp;amp; /usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile start"
23724/23724 kB (100%), 1/1 tablespace
waiting for server to start.... done
server started 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the second standby server &lt;code&gt;testPG.3&lt;/code&gt; up and running, the primary server &lt;code&gt;testPG.1&lt;/code&gt; should now have two WAL sender processes streaming data to the standbys&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "ps -faux | grep postgres"
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
....
postgres      17  0.0  0.4 208844 24460 ?        Ss   10:12   0:00 /usr/local/pgsql/bin/postgres
....
postgres      34  0.0  0.2 211036 12732 ?        Ss   10:13   0:00  \_ postgres: walsender repl 172.17.0.1(60918) streaming 0/8000060
postgres      82  0.0  0.1 210568  7852 ?        Ss   10:22   0:00  \_ postgres: walsender repl 172.17.0.1(57804) streaming 0/8000060
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Failover
&lt;/h2&gt;

&lt;p&gt;In a highly available system, when the primary fails, one of the healthy standbys needs replace it in order to minimize service disruptions. In our case, one of the standby gets promoted to the new primary and starts servicing incoming write queries from users. Let's promote the first standby server &lt;code&gt;testPG.2&lt;/code&gt;. Later we can make the second standby &lt;code&gt;testPG.3&lt;/code&gt; start following the new primary(testPG.2). We'll then bring up the old primary server &lt;code&gt;testPG.1&lt;/code&gt; as a standby, following the new leader.&lt;/p&gt;

&lt;p&gt;First, shutdown the the current primary(testPG.1)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile stop"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we do not have a primary in our cluster. We need to act quick to promote a healthy standby to take its place and start accepting writes.&lt;br&gt;
Next, promote the first standby(testPG.2)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.2 bash -c "/usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile promote"
waiting for server to promote.... done
server promoted
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have a new primary, which is good. But now our second standby still does not recognize the new primary. We need to make it follow the new leader. So we update its connection string to the primary. Get the current connection information on &lt;code&gt;testPG.3&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.3 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"SHOW primary_conninfo;\""
                                                                                                                                              primary_conninfo

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
 user=repl password=repl channel_binding=disable host=172.17.0.1 port=5432 sslmode=disable sslnegotiation=postgres sslcompression=0 sslcertmode=disable sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=disa
ble krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable
(1 row)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The port still references the old primary. We need to change that. Copy the entire string and update the port to &lt;code&gt;5433&lt;/code&gt; for the &lt;code&gt;testPG.2&lt;/code&gt; server.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ DOCKER_BRIDGE_INTERFACE=`ifconfig docker0 | grep -w "inet" | awk '{print $2}'` &amp;amp;&amp;amp; \
    docker exec testPG.3 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"ALTER SYSTEM SET primary_conninfo = '\
        user=repl password=repl channel_binding=disable host=$DOCKER_BRIDGE_INTERFACE \
        port=5433 sslmode=disable sslnegotiation=postgres sslcompression=0 \
        sslcertmode=disable sslsni=1 ssl_min_protocol_version=TLSv1.2 \
        gssencmode=disable krbsrvname=postgres gssdelegation=0 \
        target_session_attrs=any load_balance_hosts=disable';\""
ALTER SYSTEM

$ docker exec testPG.3 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"SELECT pg_reload_conf();\"" # Reload configurations
 pg_reload_conf
----------------
 t
(1 row)
$ docker exec testPG.3 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"SHOW primary_conninfo;\""
                                                                                                                                              primary_conninfo

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
 user=repl password=repl channel_binding=disable host=172.17.0.1 port=5434 sslmode=disable sslnegotiation=postgres sslcompression=0 sslcertmode=disable sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=disable krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable
(1 row)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The standby is now following the new leader. Data is being streamed from the new primary(testPG.2) via the 172.17.0.1:54356 socket to the standby, which is what we want.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.2 bash -c "ps -faux | grep postgres"
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
....
postgres      17  0.0  0.4 208844 24732 ?        Ss   10:13   0:00 /usr/local/pgsql/bin/postgres
....
postgres    1166  0.0  0.1 210568  7800 ?        Ss   11:27   0:00  \_ postgres: walsender repl 172.17.0.1(54356) streaming 0/90001E0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can bring back the old primary(testPG.1) and add it to the cluster as a standby server.&lt;br&gt;
Before restarting the server, ensure an empty &lt;code&gt;standby.signal&lt;/code&gt; file exists in the data directory. This ensures it starts up as a standby and then we &lt;br&gt;
can connect to it to update it connection string. The &lt;code&gt;standby.signal&lt;/code&gt; file was not created earlier for the standbys as it is part of what &lt;code&gt;pg_basebackup&lt;/code&gt; automatically does for us after it is done copying data to a new location.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "touch $PGDATA/standby.signal &amp;amp;&amp;amp; /usr/local/pgsql/bin/pg_ctl -D $PGDATA -l logfile start"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server should now be up and running as standby. But it still does not know how to reach the new primary(testPG.2) server.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"SHOW primary_conninfo;\""
 primary_conninfo
------------------

(1 row)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The connection string is empty. This is because &lt;code&gt;testPG.1&lt;/code&gt; server was the initial primary and did not need to stream data from another server hence no connection string was required. The connection string needs to be updated with valid key-values&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ DOCKER_BRIDGE_INTERFACE=`ifconfig docker0 | grep -w "inet" | awk '{print $2}'` &amp;amp;&amp;amp; \
    docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"ALTER SYSTEM SET primary_conninfo = '\
        user=repl password=repl channel_binding=disable host=$DOCKER_BRIDGE_INTERFACE \
        port=5433 sslmode=disable sslnegotiation=postgres sslcompression=0 \
        sslcertmode=disable sslsni=1 ssl_min_protocol_version=TLSv1.2 \
        gssencmode=disable krbsrvname=postgres gssdelegation=0 \
        target_session_attrs=any load_balance_hosts=disable';\""
ALTER SYSTEM

$ docker exec testPG.1 bash -c "/usr/local/pgsql/bin/psql -h localhost -c \"SELECT pg_reload_conf();\"" # Reload configurations
 pg_reload_conf
----------------
 t
(1 row)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;testPG.1&lt;/code&gt; now follows the new leader, &lt;code&gt;testPG.2&lt;/code&gt; and is able to receive streamed WAL records as shown below with the WAL receiver process running and connected to the primary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ docker exec testPG.1 bash -c "ps -faux | grep postgres"
....
postgres     226  0.0  0.4 208844 24736 ?        Ss   11:39   0:00 /usr/local/pgsql/bin/postgres
....
postgres     232  0.0  0.1 210620  6308 ?        Ss   11:39   0:00  \_ postgres: startup recovering 000000020000000000000009
postgres     836  0.0  0.0 210488  4728 ?        Ss   11:47   0:00  \_ postgres: walreceiver streaming 0/90001E0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have successfully built a "3-node" cluster with one primary and two standbys using docker containers. We have also managed to failover to a standby when the primary was unreachable. We were able to reconfigure remaining standbys to follow the new primary.&lt;br&gt;
The most notable part of the whole process is the manual interventions required especially for failover. Tools such as &lt;a href="https://patroni.readthedocs.io/en/latest/index.html" rel="noopener noreferrer"&gt;Patroni&lt;/a&gt; or &lt;a href="https://www.repmgr.org/" rel="noopener noreferrer"&gt;Repmgr&lt;/a&gt; exist, that provide solutions to make the process of setting up and managing HA less painful. HA using Patroni will be covered in another &lt;a href="https://dev.to/misachi/ha-postgres-with-patroni-and-barman-ien"&gt;post&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>highavailability</category>
    </item>
    <item>
      <title>Fun with Postgres Recovery</title>
      <dc:creator>Brian Misachi</dc:creator>
      <pubDate>Wed, 09 Jul 2025 23:14:59 +0000</pubDate>
      <link>https://dev.to/misachi/fun-with-postgres-recovery-f5e</link>
      <guid>https://dev.to/misachi/fun-with-postgres-recovery-f5e</guid>
      <description>&lt;p&gt;I recently had an issue with a test Postgres server that didn't shut down cleanly and could not restart because it was stuck in recovery. I tried all the quick tricks I could get from browsing the internet but all of them failed. I ended up using &lt;a href="https://www.postgresql.org/docs/current/app-pgresetwal.html" rel="noopener noreferrer"&gt;pg_resetwal&lt;/a&gt; to get to a point where the database could be able to start up again. This meant losing some data that had not been written to disk before the crash. I decided to look into how recovery really works for Postgres, which ended up in yet another extension &lt;a href="https://github.com/misachi/pg_wal_recovery" rel="noopener noreferrer"&gt;pg_wal_recovery&lt;/a&gt;.  In this post, I'll demonstrate a simple table recovery using Postgres. I'll be using &lt;code&gt;/usr/local/pgsql/data&lt;/code&gt; as my Postgres data location. If this is different for you, then you can change it.&lt;/p&gt;

&lt;p&gt;First the definition of Recovery[I googled :) ]: Data recovery is the process of restoring a database to a consistent state after a crash or power loss. Changes to data are usually stored separately as a sequential log. The actual data pages(table data) can be written to permanent storage later. When a crash occurs but before data is flushed to disk, the database can be in an inconsistent state whereby there are changes in the log that are not reflected in the table file. During the next restart, the database needs to "fix" this inconsistency when it reads the log and replays every record from the last REDO location in order to restore the database to a consistent state. &lt;/p&gt;

&lt;p&gt;Postgres uses the &lt;a href="https://www.postgresql.org/docs/current/wal-intro.html" rel="noopener noreferrer"&gt;WAL&lt;/a&gt; to log changes made to tables before the data is flushed to permanent storage during a &lt;a href="https://www.cybertec-postgresql.com/en/postgresql-what-is-a-checkpoint/" rel="noopener noreferrer"&gt;checkpoint&lt;/a&gt; or by the background writer. The WAL plays a very important role in the operation of database activities such as replication, point-in-time-recovery(PITR), recovery, backups and others.&lt;/p&gt;

&lt;p&gt;Postgres uses a control file &lt;code&gt;PGDATA/global/pg_control&lt;/code&gt; to store state information required by various parts of the database. This is where details like checkpoint location, redo location, the database state value(enum), the next transaction ID to be assigned, next OID and many more are stored. A copy of the file data is stored in shared memory for easy access and is routinely synced with the data on disk. When the database starts up it reads the control file and checks values of REDO and Checkpoint locations. If these two values are the same and the database was shut down cleanly, then normal startup process continues. If not, the recovery process follows. See &lt;code&gt;StartupProcessMain&lt;/code&gt; in &lt;code&gt;src/backend/postmaster/startup.c&lt;/code&gt; and &lt;code&gt;StartupXLOG&lt;/code&gt; in &lt;code&gt;src/backend/access/transam/xlog.c&lt;/code&gt;. Also see &lt;code&gt;InitWalRecovery&lt;/code&gt; and &lt;code&gt;PerformWalRecovery&lt;/code&gt; in &lt;code&gt;transam/xlogrecovery.c&lt;/code&gt; for the low-level implementations. &lt;/p&gt;

&lt;p&gt;The control file data can be read using the &lt;a href="https://www.postgresql.org/docs/current/app-pgcontroldata.html" rel="noopener noreferrer"&gt;pg_controldata&lt;/a&gt; tool. A sample output is as follows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres@f63dffa121c7:~$ pg_controldata -D /usr/local/pgsql/data/
pg_control version number:            1800
Catalog version number:               202505071
Database system identifier:           7504776984629272596
Database cluster state:               in production
pg_control last modified:             Tue Jul  8 21:38:34 2025
Latest checkpoint location:           16/18031820
Latest checkpoint's REDO location:    16/18031820
Latest checkpoint's REDO WAL file:    000000010000001600000018
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          1:1847483885
Latest checkpoint's NextOID:          443131
.....
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this case, both REDO and Checkpoint point to the same location in the WAL file and the database is up and running as shown in the value of &lt;code&gt;Database cluster state: in production&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now if we run the &lt;code&gt;CHECKPOINT&lt;/code&gt; command, the checkpoint and redo positions in the control file will change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# CHECKPOINT;
CHECKPOINT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The control file data has changed&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pg_control version number:            1800
Catalog version number:               202505071
Database system identifier:           7504776984629272596
Database cluster state:               in production
pg_control last modified:             Tue Jul  8 22:06:08 2025
Latest checkpoint location:           16/18031928
Latest checkpoint's REDO location:    16/180318D0
Latest checkpoint's REDO WAL file:    000000010000001600000018
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          1:1847483885
.....
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looking at how much data has changed before and after the checkpoint command we get&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# select '16/18031928'::pg_lsn - '16/18031820'::pg_lsn as checkpoint_diff_bytes;
 checkpoint_diff_bytes
-----------------------
                   264
(1 row)

postgres=# select '16/180318D0'::pg_lsn - '16/18031820'::pg_lsn as redo_diff_bytes;
 redo_diff_bytes
-----------------
             176
(1 row)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can see which records have added to the WAL when the checkpoint command was called using the &lt;code&gt;pg_waldump&lt;/code&gt; tool&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres@f63dffa121c7:~$ pg_waldump --start=16/18031820 /usr/local/pgsql/data/pg_wal/000000010000001600000018
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 16/18031820, prev 16/180317E8, desc: CHECKPOINT_SHUTDOWN redo 16/18031820; tli 1; prev tli 1; fpw true; wal_level replica; xid 1:1847483885; oid 443131; multi 1; offset 0; oldest xid 2147483649 in DB 5; oldest multi 1 in DB 5; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 16/18031898, prev 16/18031820, desc: RUNNING_XACTS nextXid 1847483885 latestCompletedXid 1847483884 oldestRunningXid 1847483885
rmgr: XLOG        len (rec/tot):     30/    30, tx:          0, lsn: 16/180318D0, prev 16/18031898, desc: CHECKPOINT_REDO wal_level replica
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 16/180318F0, prev 16/180318D0, desc: RUNNING_XACTS nextXid 1847483885 latestCompletedXid 1847483884 oldestRunningXid 1847483885
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 16/18031928, prev 16/180318F0, desc: CHECKPOINT_ONLINE redo 16/180318D0; tli 1; prev tli 1; fpw true; wal_level replica; xid 1:1847483885; oid 443131; multi 1; offset 0; oldest xid 2147483649 in DB 5; oldest multi 1 in DB 5; oldest/newest commit timestamp xid: 0/0; oldest running xid 1847483885; online
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 16/180319A0, prev 16/18031928, desc: RUNNING_XACTS nextXid 1847483885 latestCompletedXid 1847483884 oldestRunningXid 1847483885
pg_waldump: error: error in WAL record at 16/180319A0: invalid record length at 16/180319D8: expected at least 24, got 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No actual data records have been added. The redo record is now at 16/180318D0 from 16/18031820 position while the checkpoint record is now at 16/18031928 from 16/18031820. Now we proceed to the demo.&lt;/p&gt;

&lt;p&gt;Open a client sessions and create a table &lt;code&gt;foo&lt;/code&gt; and insert some data to it then, do a "dirty" shutdown of the server before any checkpoint is ran&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# DROP TABLE IF EXISTS foo;
NOTICE:  table "foo" does not exist, skipping
DROP TABLE
postgres=# CREATE TABLE foo(id SERIAL, k INT NOT NULL);
CREATE TABLE
postgres=# INSERT INTO foo SELECT i, i*100 FROM generate_series(1, 100) i;
INSERT 0 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In another terminal session, kill the server using the &lt;code&gt;--mode=immediate&lt;/code&gt; option&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres@f63dffa121c7:~$ pg_ctl -D /usr/local/pgsql/data -l ~/logfile stop --mode=immediate
waiting for server to shut down.... done
server stopped
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Table &lt;code&gt;foo&lt;/code&gt; should now have 100 records since the transaction committed, right? Not really. &lt;/p&gt;

&lt;p&gt;On commit, a user can only be sure that data has been stored in the WAL but not in the table files. Before the database can be restarted, we need to do two things: copy the control file and copy the WAL segment file to  another location. Create a directory named &lt;code&gt;global&lt;/code&gt; inside the new location and copy the control file into the &lt;code&gt;global&lt;/code&gt; directory. Then copy the last active WAL segment file(as shown in &lt;code&gt;Latest checkpoint's REDO WAL file&lt;/code&gt; row when you run the command &lt;code&gt;pg_controldata -D /usr/local/pgsql/data/&lt;/code&gt;) to the same location. Ensure the postgres user has read/write access the the location. I'll use the &lt;code&gt;/tmp&lt;/code&gt; directory for this case. The final structure should look like this&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres@f63dffa121c7:~$ cp /usr/local/pgsql/data/global/pg_control /tmp/global/
postgres@f63dffa121c7:~$ cp /usr/local/pgsql/data/pg_wal/000000010000001600000018 /tmp/
postgres@f63dffa121c7:~$ ls /tmp/
000000010000001600000018  global
postgres@f63dffa121c7:~$ ls /tmp/global/
pg_control
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now clear the WAL data using the &lt;a href="https://www.postgresql.org/docs/current/app-pgresetwal.html" rel="noopener noreferrer"&gt;pg_resetwal&lt;/a&gt; tool. This is because, by default, when Postgres starts up it will first check if it needs to go into recovery mode and it will use the WAL to recover any un-checkpointed data. Manual recovery will be done later using the &lt;code&gt;pg_wal_recovery&lt;/code&gt; extension.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres@f63dffa121c7:~$ pg_resetwal -D /usr/local/pgsql/data/ -f
Write-ahead log reset
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;pg_resetwal&lt;/strong&gt; should never be used in a production database. It has destructive side effects that can lead to data loss. Its usage should be limited to situations where the user knows what they are doing and they are aware of the risks involved.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Looking at the control data now, the checkpoint and redo locations match. If there were any WAL segment files in the &lt;code&gt;pg_wal&lt;/code&gt; directory prior to running the command, they have been removed. Also, a new WAL segment file has been created &lt;code&gt;000000010000001600000019&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres@f63dffa121c7:~$ pg_controldata -D /usr/local/pgsql/data/
pg_control version number:            1800
Catalog version number:               202505071
Database system identifier:           7504776984629272596
Database cluster state:               shut down
pg_control last modified:             Wed Jul  9 10:46:20 2025
Latest checkpoint location:           16/19000028
Latest checkpoint's REDO location:    16/19000028
Latest checkpoint's REDO WAL file:    000000010000001600000019
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          1:1847483887
Latest checkpoint's NextOID:          451323
.....
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the database can be restarted since we know it won't go into automatic  recovery mode.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres@f63dffa121c7:~$ pg_ctl -D /usr/local/pgsql/data -l ~/logfile start
waiting for server to start.... done
server started
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once it is up, we can check the &lt;code&gt;foo&lt;/code&gt; table to find out if there is any data in it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# SELECT * FROM foo;
 id | k
----+---
(0 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It returned no rows. Does it mean the data inserted before has been lost? Well....Not really.&lt;br&gt;
The data was committed so we can be sure that, at the very least, the data has been logged in the WAL. We can check this using the &lt;code&gt;pg_wal_recovery&lt;/code&gt; extension.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# SELECT * FROM wal_list_records('/tmp');
WARNING:  invalid record length at 16/18065410: expected at least 24, got 0
      wal_file_name       |     wal_type      | wal_record
--------------------------+-------------------+-------------
 000000010000001600000018 | CHECKPOINT_REDO   | 16/1805D4E0
 000000010000001600000018 | RUNNING_XACTS     | 16/1805D500
 000000010000001600000018 | CHECKPOINT_ONLINE | 16/1805D538
 000000010000001600000018 | RUNNING_XACTS     | 16/1805D5B0
 000000010000001600000018 | FPI_FOR_HINT      | 16/1805D5E8
 000000010000001600000018 | FPI_FOR_HINT      | 16/1805E0C0
 000000010000001600000018 | FPI_FOR_HINT      | 16/1805F880
 000000010000001600000018 | RUNNING_XACTS     | 16/18060D00
 000000010000001600000018 | INSERT+INIT       | 16/18060D38
 000000010000001600000018 | INSERT            | 16/18060D78
 000000010000001600000018 | INSERT            | 16/18060DB8
 000000010000001600000018 | INSERT            | 16/18060DF8
 000000010000001600000018 | INSERT            | 16/18060E38
 .....
 000000010000001600000018 | INSERT            | 16/18062550
 000000010000001600000018 | INSERT            | 16/18062590
 000000010000001600000018 | INSERT            | 16/180625D0
 000000010000001600000018 | INSERT            | 16/18062610
 000000010000001600000018 | COMMIT            | 16/18062650
 000000010000001600000018 | INSERT            | 16/18062678
 000000010000001600000018 | INSERT_LEAF       | 16/18063CD8
 000000010000001600000018 | INSERT            | 16/18064268
 000000010000001600000018 | INSERT_LEAF       | 16/180644B8
 000000010000001600000018 | INPLACE           | 16/180644F8
 000000010000001600000018 | COMMIT            | 16/18065388
(115 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The extension read WAL records starting from &lt;code&gt;16/1805D4E0&lt;/code&gt;, the last REDO location in the control file that was copied earlier. There are 100 WAL records from offset &lt;code&gt;16/18060D38&lt;/code&gt; to &lt;code&gt;16/18062610&lt;/code&gt; in the WAL segment file. This matches the number of inserts that was done just before the database was shutdown. We can check this using &lt;a href="https://www.postgresql.org/docs/current/pgwaldump.html" rel="noopener noreferrer"&gt;pg_waldump&lt;/a&gt; tool which provides way more information about the WAL records&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres@f63dffa121c7:~$ pg_waldump --start=16/18060D38 --end=16/18062610  /tmp/000000010000001600000018
rmgr: Heap        len (rec/tot):     63/    63, tx: 1847483887, lsn: 16/18060D38, prev 16/18060D00, desc: INSERT+INIT off: 1, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
rmgr: Heap        len (rec/tot):     63/    63, tx: 1847483887, lsn: 16/18060D78, prev 16/18060D38, desc: INSERT off: 2, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
rmgr: Heap        len (rec/tot):     63/    63, tx: 1847483887, lsn: 16/18060DB8, prev 16/18060D78, desc: INSERT off: 3, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
.....
rmgr: Heap        len (rec/tot):     63/    63, tx: 1847483887, lsn: 16/18062590, prev 16/18062550, desc: INSERT off: 98, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
rmgr: Heap        len (rec/tot):     63/    63, tx: 1847483887, lsn: 16/180625D0, prev 16/18062590, desc: INSERT off: 99, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last thing to check before replaying the WAL records is the data file(table &lt;code&gt;foo&lt;/code&gt;) to confirm if there is any data present. For this, the &lt;a href="https://www.postgresql.org/docs/9.0/pageinspect.html" rel="noopener noreferrer"&gt;pageinspect&lt;/a&gt; extension comes in handy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# SELECT lp, t_ctid, t_xmin, t_xmax FROM heap_page_items(get_raw_page('foo', 0)) WHERE lp_len &amp;gt; 0;
 lp | t_ctid | t_xmin | t_xmax
----+--------+--------+--------
(0 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we query the content of the first page(zero-based) in the data file. The &lt;code&gt;lp&lt;/code&gt;(line pointer) is the item number of the record within the page. &lt;code&gt;t_ctid&lt;/code&gt; is the location of the record within the data file e.g (0,3) can be read as third item on the first page. &lt;code&gt;t_xmin&lt;/code&gt; and &lt;code&gt;t_xmax&lt;/code&gt; are the transaction IDs of the inserting and deleting transactions respectively.&lt;/p&gt;

&lt;p&gt;The table is empty. We have to replay the WAL in order to get the data into the table. The &lt;code&gt;pg_wal_recovery&lt;/code&gt; extension will be used for this purpose. It will output the last record replayed in the WAL when done.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# select * FROM wal_recover('/tmp');
WARNING:  invalid record length at 16/18065410: expected at least 24, got 0
 wal_type | wal_record
----------+-------------
 COMMIT   | 16/18065388
(1 row)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "lost" data should now appear in the data files. Again, we can run the query using the helper functions from the &lt;code&gt;pageinspect&lt;/code&gt; extension&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# SELECT lp, t_ctid, t_xmin, t_xmax FROM heap_page_items(get_raw_page('foo', 0)) WHERE lp_len &amp;gt; 0;
 lp  | t_ctid  |   t_xmin   | t_xmax
-----+---------+------------+--------
   1 | (0,1)   | 1847483887 |      0
   2 | (0,2)   | 1847483887 |      0
   3 | (0,3)   | 1847483887 |      0
.....
  97 | (0,97)  | 1847483887 |      0
  98 | (0,98)  | 1847483887 |      0
  99 | (0,99)  | 1847483887 |      0
 100 | (0,100) | 1847483887 |      0
(100 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The data is now present in the table file. In this case, all 100 records fit in a single page(each page is 8kb in size). After a restart, querying the &lt;code&gt;foo&lt;/code&gt; table now returns all 100 records. The data has been restored.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# SELECT * FROM foo;
 id  |   k
-----+-------
   1 |   100
   2 |   200
.....
  99 |  9900
 100 | 10000
(100 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this post I demonstrated how to perform a simple data recovery using the WAL. In real life systems, the chances of having to do recovery like this, are very low. Postgres already ships with a robust recovery infrastructure that has been battle tested for many years, so automatic recovery would handle such for you anyway. Building the extension made me go deeper into the lower-level parts of recovery in Postgres. The extension is for educational purposes only and would not be suitable for a production database or any database that has data you care about.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>recovery</category>
    </item>
    <item>
      <title>Tracking Postgres "fsyncs" with bpftrace</title>
      <dc:creator>Brian Misachi</dc:creator>
      <pubDate>Mon, 19 May 2025 21:12:56 +0000</pubDate>
      <link>https://dev.to/misachi/tracking-postgres-fsyncs-with-bpftrace-4ofi</link>
      <guid>https://dev.to/misachi/tracking-postgres-fsyncs-with-bpftrace-4ofi</guid>
      <description>&lt;p&gt;Data systems such as Postgres make strong guarantees when it comes to durability. The aim is to reduce the chances of data loss at any cost. The &lt;code&gt;fsync&lt;/code&gt; and &lt;code&gt;fdatasync&lt;/code&gt; system calls are often used to provide this guarantee to the user that data has been safely flushed to the storage device. The user can then be certain that in any case of power failure they would not lose their data. &lt;/p&gt;

&lt;p&gt;However, these calls are not cheap. An application would block while waiting for the system calls to return. &lt;a href="https://github.com/sirupsen/napkin-math?tab=readme-ov-file#numbers" rel="noopener noreferrer"&gt;Napkin math&lt;/a&gt; latency numbers puts it at around 1ms. I did a test for &lt;code&gt;fsync&lt;/code&gt; system call on my laptop just to compare whether I'd get the same latency values. The &lt;a href="https://bpftrace.org/" rel="noopener noreferrer"&gt;bpftrace&lt;/a&gt; script I used looks like this&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
BEGIN
{
    printf("Tracing fsync latency... Hit Ctrl-C to end.\n");
}

tracepoint:syscalls:sys_enter_fsync,
tracepoint:syscalls:sys_enter_fdatasync
{
    @start[tid] = nsecs;
}

tracepoint:syscalls:sys_exit_fsync,
tracepoint:syscalls:sys_exit_fdatasync 
/@start[tid]/
{
    cat("/proc/%d/cmdline", pid);  // Who is making the call?
    printf(" ==&amp;gt; %d\n", (nsecs - @start[tid]));
    delete(@start, tid);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script for making the &lt;code&gt;fsync&lt;/code&gt; call is written in &lt;a href="https://go.dev/" rel="noopener noreferrer"&gt;Golang&lt;/a&gt; here&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;package main

import (
    "log"
    "os"
)

func main()  {
    buf := []byte("hello world")
    file, err := os.OpenFile("/tmp/testfile.txt", os.O_CREATE|os.O_WRONLY, 0777)

    if err != nil {
        log.Fatalf("OpenFile: %v", err)
    }
    defer file.Close()

    file.Write(buf)
    file.Sync()  // Flush to disk
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After running the Go script a couple of times, the results I got were as follows:&lt;/p&gt;

&lt;p&gt;Run &lt;code&gt;bpftrace --unsafe fysnc_lat.bt&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Attaching 5 probes...
Tracing fsync latency... Hit Ctrl-C to end.
 ==&amp;gt; 1095741
 ==&amp;gt; 967711
 ==&amp;gt; 1086844
 ==&amp;gt; 1095359
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not too far from the napkin value. The measured values are in nanoseconds.&lt;/p&gt;

&lt;p&gt;Having done this, I wanted to know how often Postgres does flush data to disk. My assumption at this point would be that a single transaction makes at least one fsync call to ensure durabilty(for every commit a WAL record is generated and flushed to disk). I wanted to find out if it holds using Postgres. If 1 fsync call takes 1ms, then you'd expect to do around 1000 transactions per second for a single thread scenario. This is mostly an over-simplification because transactions do more than just making fsync calls, so the latency would be over 1ms in a real system.&lt;/p&gt;

&lt;p&gt;The bpftrace script to track fsync calls made by Postgres&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BEGIN
{
    printf("Tracing Postgres fsync calls... Hit Ctrl-C to end.\n");
    printf("%-6s %6s %s\n", "PID", "FD", "ProcName &amp;lt; - &amp;gt; Filename");
}

tracepoint:syscalls:sys_enter_fsync,tracepoint:syscalls:sys_enter_fdatasync
/comm == "postgres"/ {
    @procs[pid] = count();
    @sum["total"] = count();

    if (!@files_fsyncd[args-&amp;gt;fd]) {
        @files_fsyncd[args-&amp;gt;fd] = 1;  // Mark FD as seen

        printf("\n%-6d %6d ", pid, args-&amp;gt;fd);
        cat("/proc/%d/cmdline", pid);
        printf(" &amp;lt; - &amp;gt; ");
        system("readlink /proc/%d/fd/%d", pid, args-&amp;gt;fd); // Get filename from fd
    }
}

END {
    clear(@files_fsyncd);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script fires when a Postgres process makes an &lt;code&gt;fsync&lt;/code&gt; or &lt;code&gt;fdatasync&lt;/code&gt; call. It will display the process name(for example the checkpointer, walwriter etc), process ID, total number of calls made by all Postgres processes, the name of the file and the file descriptor number. &lt;/p&gt;

&lt;p&gt;The stats from running a single client, using sysbench for a write-only workload for about 600 seconds is as follows&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SQL statistics:
    queries performed:
        read:                            0
        write:                           1435372
        other:                           717686
        total:                           2153058
    transactions:                        358843 (598.07 per sec.)
    queries:                             2153058 (3588.42 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The client performs 358843 transactions while doing about 2153058 queries. That is around 6 queries per transaction.&lt;/p&gt;

&lt;p&gt;Here is the result from running the bpftrace script &lt;code&gt;bpftrace --unsafe fsync_files.bt&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Attaching 4 probes...
Tracing fsync calls... Hit Ctrl-C to end.
PID        FD ProcName &amp;lt; - &amp;gt; Filename
6202       14 postgres: sbtest sbtest 192.168.33.10(50984) idle &amp;lt; - &amp;gt; /usr/local/pgsql/data/pg_wal/0000000100000010000000ED

3094        6 postgres: walwriter  &amp;lt; - &amp;gt; /usr/local/pgsql/data/pg_wal/0000000100000010000000ED

3091        7 postgres: checkpointer  &amp;lt; - &amp;gt;
6212        8 postgres: autovacuum worker sbtest &amp;lt; - &amp;gt; /usr/local/pgsql/data/pg_wal/000000010000001100000001

3091       12 postgres: checkpointer  &amp;lt; - &amp;gt; /usr/local/pgsql/data/base/16416/2841

3091       16 postgres: checkpointer  &amp;lt; - &amp;gt; /usr/local/pgsql/data/base/16416/24741

3091        9 postgres: checkpointer  &amp;lt; - &amp;gt; /usr/local/pgsql/data/base/16416/2840

3091       13 postgres: checkpointer  &amp;lt; - &amp;gt; /usr/local/pgsql/data/base/16416/24730

3091       11 postgres: checkpointer  &amp;lt; - &amp;gt; /usr/local/pgsql/data/base/16416/2840_vm

3091       18 postgres: checkpointer  &amp;lt; - &amp;gt; /usr/local/pgsql/data/pg_wal

3091       17 postgres: checkpointer  &amp;lt; - &amp;gt; /usr/local/pgsql/data/base/16416/24759

3091       10 postgres: checkpointer  &amp;lt; - &amp;gt; /usr/local/pgsql/data/base/16416/2840_fsm

3091       15 postgres: checkpointer  &amp;lt; - &amp;gt; /usr/local/pgsql/data/base/16416/24730_vm

3091       19 postgres: checkpointer  &amp;lt; - &amp;gt;
^C


@procs[6253]: 1
@procs[6257]: 1
@procs[6255]: 1
@procs[6221]: 1
@procs[6223]: 1
@procs[6251]: 1
@procs[6259]: 1
@procs[6212]: 1
@procs[6216]: 1
@procs[6262]: 1
@procs[3091]: 152
@procs[3094]: 253
@procs[6202]: 358898
@sum[total]: 359313
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The client process with PID 6202 is reported as having made 358898 fsync calls. This is around the same number reported by sysbench above. Every transaction command made a fsync call. But the actual number of transactions per second is about 598 when the ideal number is 1000. What could be the reason for the difference? I had a few thoughts(there could be more...)&lt;/p&gt;

&lt;p&gt;One reason could be due to the number of queries being performed in a single transaction(avg 6) causing some overhead in query processing. Also, there are other processes competing for shared resources with this client. The &lt;code&gt;checkpointer&lt;/code&gt;, &lt;code&gt;walwriter&lt;/code&gt;, &lt;code&gt;autovaccum&lt;/code&gt;, &lt;code&gt;bgwriter&lt;/code&gt;  are the background  processes that Postgres runs to perform other essential services. The checkpointer and walwriter(PIDs 3094 and 3091) made a combined 405 fsync calls. Checkpointing will flush all data that is in memory to disk. This can be a lot of data especially if there are many dirty buffers still in memory. Flushing lots of data at once can be very slow and increase response times in the system.&lt;/p&gt;

&lt;p&gt;Other factors such as query processing depending on the number of queries in the transaction, locks, memory IO, network etc can increase transaction processing times. The transaction processing time, therefore, is not just the cost of a fsync call and would likely take more than 1ms. In a highly concurrent system, an optimization the database would use is to group multiple commits and only issue one fsync call to flush all the data at once to disk which reduces the number of IOs.&lt;/p&gt;

</description>
      <category>database</category>
      <category>postgres</category>
      <category>bpftrace</category>
    </item>
    <item>
      <title>Postgres: Ordered Queries and the Planner</title>
      <dc:creator>Brian Misachi</dc:creator>
      <pubDate>Thu, 01 May 2025 22:52:20 +0000</pubDate>
      <link>https://dev.to/misachi/postgres-ordered-queries-and-the-planner-1151</link>
      <guid>https://dev.to/misachi/postgres-ordered-queries-and-the-planner-1151</guid>
      <description>&lt;p&gt;Most SQL queries require the results of the particular query to be ordered in a some way. The returned data can still be implicitly ordered by the primary key(if present) if the user does not apply their own ordering to the query. In Postgres and other SQL like databases, the way to order the result set from a query is by using the &lt;code&gt;ORDER BY&lt;/code&gt; clause.&lt;/p&gt;

&lt;p&gt;Sorting rows has an additional cost attached to it, which can lead to interesting choice of plans to execute a query. The planner generates multiple paths for query execution with each path representing a different way to get the results for the query. The path with the lowest cost is checked if it produces already sorted results, if not it would have to be sorted and the final costs re-compared against other paths. If it is already sorted, it would remain the cheapest path and used to execute the query.&lt;/p&gt;

&lt;p&gt;B+tree indexes are already sorted by default. By default, tables with a primary key are sorted on this field. When building index paths the planner considers both forward and backward scan paths on the index thus sorting in ascending or descending order does not have significant difference. Sorting on an indexed field in most cases performs better than non-indexed fields.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# CREATE TABLE bar(id1 INT, id2 INT, id3 INT, id4 INT, descr TEXT);  -- new table
postgres=# INSERT INTO bar SELECT i, i*3, i+i, i*2, 'hello' || i FROM generate_series(1, 10000000) i; -- 10M records
postgres=# CREATE INDEX idx_id1_id2_id3 ON bar(id1, id2, id3);

postgres=# explain analyze select * from bar where id1 &amp;gt;= 1000000 and id1 &amp;lt; 2000001 order by id1; -- there is an index in id1
                                                                QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_id1_id2_id3 on bar  (cost=0.43..168152.90 rows=998886 width=28) (actual time=0.046..139.920 rows=1000001.00 loo
ps=1)
   Index Cond: ((id1 &amp;gt;= 1000000) AND (id1 &amp;lt; 2000001))
   Index Searches: 1
   Buffers: shared hit=11188
 Planning Time: 0.126 ms
 Execution Time: 187.269 ms
(6 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At a minimum 2 IOs are issued. The index is scanned then there is a heap lookup to get rows that are not already in the index, with no explicit sort step. Low CPU usage because there are no comparisons being done because the returned data is already ordered on the sort key.&lt;/p&gt;

&lt;p&gt;Using a non-indexed field&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# explain analyze select * from bar where id1 &amp;gt;= 1000000 and id1 &amp;lt; 2000001 order by id4;
                                                                   QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=291593.20..294090.41 rows=998886 width=28) (actual time=404.509..509.841 rows=1000001.00 loops=1)
   Sort Key: id4
   Sort Method: external merge  Disk: 38208kB
   Buffers: shared hit=11188, temp read=9547 written=9564
   -&amp;gt;  Index Scan using idx_id1_id2_id3 on bar  (cost=0.43..168152.90 rows=998886 width=28) (actual time=0.033..142.248 rows=1000001.
00 loops=1)
         Index Cond: ((id1 &amp;gt;= 1000000) AND (id1 &amp;lt; 2000001))
         Index Searches: 1
         Buffers: shared hit=11188
 Planning Time: 0.112 ms
 Execution Time: 565.804 ms
(10 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Other than 2 IOs being issued for each index, there is an explicit sort step at the top level leading to higher CPU usage, for key comparisons. Also, every query only has &lt;code&gt;work_mem&lt;/code&gt; amount of memory for sorting purposes. If the data returned does not fit the allocated &lt;code&gt;work_mem&lt;/code&gt; memory, additional IOs are issued in the sort step to read in 9547 pages from disk and 9564 pages written out to disk. &lt;/p&gt;

&lt;p&gt;In an attempt to reduce the disk IOs during sorting, we can increase the &lt;code&gt;work_mem&lt;/code&gt; size(Remember &lt;code&gt;work_mem&lt;/code&gt; is per query, so be careful)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# SET work_mem = 102400;
SET
postgres=# SHOW work_mem;
 work_mem
----------
 100MB
(1 row)

postgres=# explain analyze select * from bar where id1 &amp;gt;= 1000000 and id1 &amp;lt; 2000001 order by id4;
                                                                    QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
-------------
 Sort  (cost=209251.10..211748.31 rows=998886 width=28) (actual time=245.392..296.690 rows=1000001.00 loops=1)
   Sort Key: id4
   Sort Method: quicksort  Memory: 71452kB
   Buffers: shared hit=10089
   -&amp;gt;  Bitmap Heap Scan on bar  (cost=21199.02..109712.31 rows=998886 width=28) (actual time=26.995..129.239 rows=1000001.00 loops=1)
         Recheck Cond: ((id1 &amp;gt;= 1000000) AND (id1 &amp;lt; 2000001))
         Heap Blocks: exact=7354
         Buffers: shared hit=10089
         -&amp;gt;  Bitmap Index Scan on idx_id1_id2_id3  (cost=0.00..20949.30 rows=998886 width=0) (actual time=25.790..25.791 rows=1000001
.00 loops=1)
               Index Cond: ((id1 &amp;gt;= 1000000) AND (id1 &amp;lt; 2000001))
               Index Searches: 1
               Buffers: shared hit=2735
 Planning Time: 0.141 ms
 Execution Time: 348.114 ms
(14 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is some improvement in execution time as the sorting is done entirely in memory but also the plan changes to an in-memory &lt;code&gt;bitmap scan&lt;/code&gt;. It is not advisable to 'blindly' increase the &lt;code&gt;work_mem&lt;/code&gt; since it could mislead the planner in choosing an inferior plan and also does lead to sub-optimal usage of memory -- each query is allocated &lt;code&gt;work_mem&lt;/code&gt; sized chunk of memory. It is used here only for demo purposes.&lt;/p&gt;

&lt;p&gt;Using both indexed and non-indexed fields&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# explain analyze select * from bar where id1 &amp;gt;= 1000000 and id1 &amp;lt; 2000001 order by id1, id4;
                                                                   QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Incremental Sort  (cost=0.61..208698.77 rows=998886 width=28) (actual time=0.061..293.947 rows=1000001.00 loops=1)
   Sort Key: id1, id4
   Presorted Key: id1
   Full-sort Groups: 31251  Sort Method: quicksort  Average Memory: 26kB  Peak Memory: 26kB
   Buffers: shared hit=10089
   -&amp;gt;  Index Scan using idx_id1_id2_id3 on bar  (cost=0.43..163748.90 rows=998886 width=28) (actual time=0.041..144.617 rows=1000001.
00 loops=1)
         Index Cond: ((id1 &amp;gt;= 1000000) AND (id1 &amp;lt; 2000001))
         Index Searches: 1
         Buffers: shared hit=10089
 Planning Time: 0.112 ms
 Execution Time: 341.344 ms
(11 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The planner opts for &lt;code&gt;incremental sort&lt;/code&gt; plan where the order-by clause is only partially sorted on a few columns, but not all. In this case, the sort key is on 2 fields &lt;code&gt;id1&lt;/code&gt; and &lt;code&gt;id4&lt;/code&gt;. There is an index on &lt;code&gt;id1&lt;/code&gt;, so it is already sorted on this column. Instead of doing a full sort on both columns, the pre-sorted rows need only be sorted again on the &lt;code&gt;id4&lt;/code&gt; column. Therefore, saving some CPU cycles and performs better than a case where both fields would have be sorted. There have been &lt;a href="https://www.postgresql.org/message-id/flat/d2f06ddc-a8d4-48ea-893b-a95255c632b9%40loxodata.com" rel="noopener noreferrer"&gt;past cases&lt;/a&gt; where incremental sort has produced plans with issues. Incremental sort can be turned off with the &lt;code&gt;SET enable_incremental_sort = off&lt;/code&gt; command.&lt;/p&gt;

&lt;p&gt;Well, it helps to sort data on already ordered columns. In other cases incremental sort plan can help with partially ordered columns in an order-by clause. A helpful way to build indexes would be to have a &lt;a href="https://www.postgresql.org/docs/current/indexes-multicolumn.html" rel="noopener noreferrer"&gt;multiple column index&lt;/a&gt; or even &lt;a href="https://www.postgresql.org/docs/current/indexes-index-only-scans.html" rel="noopener noreferrer"&gt;covering indexes&lt;/a&gt;, with the first key column used to limit data returned while next columns used to aid sorting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT * FROM test WHERE fld1 &amp;gt;= [start_limit] AND fld1 &amp;lt; [end_limit]  ORDER BY fld2; -- index on both fld1 and fld2; CREATE INDEX idx_f1_f2 ON test(fld1, fld2);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unused indexes can lead to unnecessary bloat and have &lt;a href="https://www.percona.com/blog/postgresql-indexes-can-hurt-you-negative-effects-and-the-costs-involved/" rel="noopener noreferrer"&gt;negative effects on performance&lt;/a&gt;. It is important to only add useful columns to the index to keep the index as small as possible. Test and measure to determine what works best for your case.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>queryperformance</category>
    </item>
    <item>
      <title>Postgres: Index Scans</title>
      <dc:creator>Brian Misachi</dc:creator>
      <pubDate>Sun, 06 Apr 2025 17:35:47 +0000</pubDate>
      <link>https://dev.to/misachi/postgres-index-scans-552g</link>
      <guid>https://dev.to/misachi/postgres-index-scans-552g</guid>
      <description>&lt;p&gt;Using an index to help improve query performance is common practice. An index is a structure organized in such a way to help make it cheaper to access data records stored on disk. One example of such a structure is the b+tree. In Postgres case, the data stored on disk is referred to as the &lt;code&gt;heap&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Having an index on a table does not guarantee that it will be used to retrieve data for a query. It is important to first test and measure the effect of adding an index in order to find out if it helps improve performance for a target query. In the worst case, blindly adding an index to every query performance problem encountered may further degrade performance since indexes do not come cheap as well. There can be cases where the planner might choose a different, non-index, plan to execute a query even though an index exists that matches the query filter and/or the order-by clause. The best tool to use for this is the &lt;code&gt;EXPLAIN&lt;/code&gt; command. To demonstrate a trivial scenario:&lt;/p&gt;

&lt;p&gt;First we create a table and an index&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# CREATE TABLE foo(id1 INT, id2 INT, id3 INT, id4 INT, descr TEXT);  -- new table
postgres=# INSERT INTO foo SELECT i, i*3, i+i, i*2, 'hello' || i FROM generate_series(1, 10000000) i; -- 10M records
postgres=# CREATE INDEX idx_id1_id2_id3 ON foo(id1, id2, id3);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run &lt;code&gt;EXPLAIN&lt;/code&gt; on a query&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM foo WHERE id1 &amp;gt; 1000;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The path selected by the planner is shown below&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Seq Scan on foo  (cost=0.00..198530.00 rows=9999027 width=28) (actual time=0.283..2260.973 rows=9999000 loops=1)
   Filter: (id1 &amp;gt; 1000)
   Rows Removed by Filter: 1000
   Buffers: shared hit=16274 read=57256
 Planning:
   Buffers: shared read=4
 Planning Time: 0.386 ms
 Execution Time: 2906.885 ms
(8 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The query is looking for all the records where &lt;code&gt;id1&lt;/code&gt; field value is greater than 1000. &lt;/p&gt;

&lt;p&gt;Let's dig into the details of the plan: the planner picks a sequential scan as the cheapest path to executing the query. The sequential scan plan has zero startup cost and total cost of executing the query is estimated to be 198530.00. Number of rows to be fetched is estimated at 9999027 while the real number of rows retrieved is 9999000(very close). A curious mind might wonder why an index scan is ignored when there is an index on &lt;code&gt;id1&lt;/code&gt; field. When using an index, the database needs to read the index structure, then if the data to be returned is not in the index, read the relevant data pages from disk. This means, the database would potentially have to perform 2 disk IOs for each index entry, one in the index file then another in the heap to get the data. This can be really expensive when there are many rows to be returned. It adds on to the cost of an index scan. For a sequential scan it only needs a single disk read if the page is not in the buffer. Often times the index might be entirely in memory if it fits, hence speeding up index lookups.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;EXPLAIN&lt;/code&gt; command can only show the cheapest plan that was chosen by the planner, out of all the plans that were considered. Meaning if the user needs to investigate why a particular plan was not chosen, they will be limited in what they can get from the &lt;code&gt;EXPLAIN&lt;/code&gt; command. This is because the planner throws away other plans it considers inferior and such plans are never made available during debugging. I made an extension &lt;a href="https://github.com/misachi/pg_all_plans" rel="noopener noreferrer"&gt;PG_ALL_PLANS&lt;/a&gt; to show all plans considered by the planner. Using the extension with our query above&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# SELECT * FROM show_all_plans('EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM foo WHERE id1 &amp;gt; 1000');
                                                                  query_plans
-----------------------------------------------------------------------------------------------------------------------------------------------
 -------------------------------Plan 1-------------------------------
 Seq Scan on foo  (cost=0.00..198530.00 rows=9999027 width=28) (actual time=0.250..2603.390 rows=9999000 loops=1)
   Filter: (id1 &amp;gt; 1000)
   Rows Removed by Filter: 1000
   Buffers: shared hit=16182 read=57348
 Planning:
   Buffers: shared hit=47 read=4 dirtied=3
 Planning Time: 0.939 ms
 Execution Time: 3263.912 ms

 -------------------------------Plan 2-------------------------------
 Index Scan using idx_id1_id2_id3 on foo  (cost=0.43..498815.28 rows=9999027 width=28) (actual time=0.444..3309.395 rows=9999000 loops=1)
   Index Cond: (id1 &amp;gt; 1000)
   Buffers: shared hit=2 read=111835 written=8120
 Planning:
   Buffers: shared hit=16229 read=57352 dirtied=3
 Planning Time: 0.947 ms
 Execution Time: 3981.410 ms

 -------------------------------Plan 3-------------------------------
 Bitmap Heap Scan on foo  (cost=231288.89..429813.47 rows=9999027 width=28) (actual time=1319.351..3678.529 rows=9999000 loops=1)
   Recheck Cond: (id1 &amp;gt; 1000)
   Rows Removed by Index Recheck: 35
   Heap Blocks: exact=40497 lossy=33026
   Buffers: shared read=111837
   -&amp;gt;  Bitmap Index Scan on idx_id1_id2_id3  (cost=0.00..228789.14 rows=9999027 width=0) (actual time=1301.783..1301.783 rows=9999000 loops=1)
         Index Cond: (id1 &amp;gt; 1000)
         Buffers: shared read=38314
 Planning:
   Buffers: shared hit=16231 read=169190 dirtied=3 written=8120
 Planning Time: 0.951 ms
 Execution Time: 4338.259 ms

 -------------------------------Plan 4-------------------------------
 Gather  (cost=1000.00..1126516.03 rows=9999027 width=28) (actual time=0.524..1658.062 rows=9999000 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=16228 read=57302
   -&amp;gt;  Parallel Seq Scan on foo  (cost=0.00..125613.33 rows=4166261 width=28) (actual time=0.118..917.219 rows=3333000 loops=3)
         Filter: (id1 &amp;gt; 1000)
         Rows Removed by Filter: 333
         Buffers: shared hit=16228 read=57302
 Planning:
   Buffers: shared hit=16231 read=281027 dirtied=3 written=8120
 Planning Time: 0.952 ms
 Execution Time: 2434.354 ms

(47 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are 4 plans displayed now in addition to the plain sequential scan plan. The additional plans are thrown away by the planner. From this, we can see that index scan is more expensive, has a higher startup cost(0.43) and total cost(498815.28) for retrieving all the tuples, than the sequential scan. One interesting part is that the Bitmap Heap Scan has a lower total cost(429813.470 than the index scan but is considered inferior due to its higher startup cost(231288.89). The extension can be a great tool to dig deeper into the choices made by the planner when executing a query.&lt;/p&gt;

&lt;p&gt;We can attempt to persuade the planner to pick the index scan plan by limiting the number of rows returned between 1000 to 1.5M. This helps reduce the selectivity of the plan thus only a small number of rows are returned.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# SELECT * FROM show_all_plans('EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM foo WHERE id1 &amp;gt; 1000 AND id1 &amp;lt; 1500000');
                                                                query_plans
--------------------------------------------------------------------------------------------------------------------------------------------
 -------------------------------Plan 1-------------------------------
 Index Scan using idx_id1_id2_id3 on foo  (cost=0.43..187892.55 rows=1498365 width=28) (actual time=0.023..548.997 rows=1498999 loops=1)
   Index Cond: ((id1 &amp;gt; 1000) AND (id1 &amp;lt; 1500000))
   Buffers: shared hit=2 read=16768
 Planning:
   Buffers: shared read=4
 Planning Time: 0.170 ms
 Execution Time: 651.542 ms

 -------------------------------Plan 2-------------------------------
 Seq Scan on foo  (cost=0.00..223530.00 rows=1498365 width=28) (actual time=0.204..2333.970 rows=1498999 loops=1)
   Filter: ((id1 &amp;gt; 1000) AND (id1 &amp;lt; 1500000))
   Rows Removed by Filter: 8501001
   Buffers: shared hit=10696 read=62834
 Planning:
   Buffers: shared hit=2 read=16772
 Planning Time: 0.174 ms
 Execution Time: 2432.406 ms

 -------------------------------Plan 3-------------------------------
 Bitmap Heap Scan on foo  (cost=38406.68..205106.49 rows=1498365 width=28) (actual time=236.667..539.696 rows=1498999 loops=1)
   Recheck Cond: ((id1 &amp;gt; 1000) AND (id1 &amp;lt; 1500000))
   Heap Blocks: exact=11023
   Buffers: shared hit=171 read=16599
   -&amp;gt;  Bitmap Index Scan on idx_id1_id2_id3  (cost=0.00..38032.08 rows=1498365 width=0) (actual time=233.444..233.445 rows=1498999 loops=1)
         Index Cond: ((id1 &amp;gt; 1000) AND (id1 &amp;lt; 1500000))
         Buffers: shared hit=171 read=5576
 Planning:
   Buffers: shared hit=10698 read=79606
 Planning Time: 0.182 ms
 Execution Time: 638.010 ms

 -------------------------------Plan 4-------------------------------
 Gather  (cost=1000.00..286866.50 rows=1498365 width=28) (actual time=0.518..2226.432 rows=1498999 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=11023 read=62507
   -&amp;gt;  Parallel Seq Scan on foo  (cost=0.00..136030.00 rows=624319 width=28) (actual time=0.093..805.573 rows=499666 loops=3)
         Filter: ((id1 &amp;gt; 1000) AND (id1 &amp;lt; 1500000))
         Rows Removed by Filter: 2833667
         Buffers: shared hit=11023 read=62507
 Planning:
   Buffers: shared hit=10869 read=96205
 Planning Time: 0.182 ms
 Execution Time: 2349.584 ms

(46 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the planner chooses the index scan as it has a lower total cost(187892.55) than the sequential plan(223530.00). The index scan potentially doing 2 IOs per entry still manages to be cheaper than the sequential scan. It does this majorly due to the reduced number of rows it has to scan(1498365 compared to 9999027 rows) hence reduced number of page reads. Also, the sequential scan reads a large number of rows which it end ups throwing away when filtering the results(8501001 rows).&lt;/p&gt;

&lt;p&gt;It is good practice to log and analyze slow queries to figure out the performance bottlenecks before adding an index or any other techniques. Always check your indexes if they are being &lt;a href="https://www.cybertec-postgresql.com/en/get-rid-of-your-unused-indexes/" rel="noopener noreferrer"&gt;used at all&lt;/a&gt;. Indexes occupy disk space and can degrade performance especially for write intensive workloads, so it would make more sense to only add an index to a relation if it is used in the execution of the relevant query. &lt;/p&gt;

</description>
      <category>database</category>
      <category>postgres</category>
      <category>index</category>
    </item>
    <item>
      <title>Making Postgres better with OrioleDB</title>
      <dc:creator>Brian Misachi</dc:creator>
      <pubDate>Fri, 28 Feb 2025 23:10:38 +0000</pubDate>
      <link>https://dev.to/misachi/making-postgres-better-with-orioledb-49cp</link>
      <guid>https://dev.to/misachi/making-postgres-better-with-orioledb-49cp</guid>
      <description>&lt;p&gt;One of the biggest advantages Postgres has over lots of other DBMSs out there is the fact that it allows external improvements or new features, without requiring any interactions with the core codebase, via extensions. There are many extensions each with a unique purpose(I have also written two trivial extensions &lt;a href="https://github.com/misachi/pg_wal_ext" rel="noopener noreferrer"&gt;pg_wal_ext&lt;/a&gt; and &lt;a href="https://github.com/misachi/pg_table_bloat" rel="noopener noreferrer"&gt;pg_table_bloat&lt;/a&gt; )&lt;/p&gt;

&lt;p&gt;One extension that especially stands out is &lt;a href="https://www.orioledb.com/" rel="noopener noreferrer"&gt;OrioleDB&lt;/a&gt;. It provides an alternative storage engine to  Postgres. Postgres comes with only one storage engine based on the heap. This is different from MySQL which comes with several storage engines: innodb, myissam etc.&lt;/p&gt;

&lt;p&gt;Postgres heap based storage engine works well in most cases but it also presents a number of issues, some of which are: &lt;a href="https://www.cs.cmu.edu/~pavlo/blog/2023/04/the-part-of-postgresql-we-hate-the-most.html" rel="noopener noreferrer"&gt;bloat introduced by how updates are handled&lt;/a&gt;, &lt;a href="https://www.postgresql.org/docs/17/sql-vacuum.html" rel="noopener noreferrer"&gt;the need for garbage collection(vacuum)&lt;/a&gt;, &lt;a href="https://www.cybertec-postgresql.com/en/transaction-id-wraparound-a-walk-on-the-wild-side/" rel="noopener noreferrer"&gt;transaction wraparound&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;OrioleDB comes with the promise of solving the issues presented by the Postgres heap. I tested it out and the results were pretty good. I'll describe the process I used and the results below.&lt;/p&gt;

&lt;p&gt;First you need a patched up version of Postgres: &lt;a href="https://github.com/orioledb/postgres/archive/refs/tags/patches16_33.tar.gz" rel="noopener noreferrer"&gt;16(tag:patches16_33)&lt;/a&gt; or &lt;a href="https://github.com/orioledb/postgres/archive/refs/tags/patches17_5.tar.gz" rel="noopener noreferrer"&gt;17(tag:patches17_5)&lt;/a&gt;. The steps are available on &lt;a href="https://github.com/orioledb/orioledb?tab=readme-ov-file#build-from-source" rel="noopener noreferrer"&gt;github&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The test is for a read only workload on 16 tables each with 25M records(approx 94GB for each setup). I used a server with 64GB RAM, 20cores and 500GB NVME SSD(ext4 filesystem). I use my setup &lt;a href="https://github.com/misachi/postgres-scripts" rel="noopener noreferrer"&gt;scripts&lt;/a&gt; to build and install Postgres. The benchmarking tool used is &lt;a href="https://github.com/akopytov/sysbench?tab=readme-ov-file#building-and-installing-from-source" rel="noopener noreferrer"&gt;sysbench&lt;/a&gt;. The OrioleDB used is built from source from the main branch commit &lt;code&gt;0c484c4&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The install steps for OrioleDB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Install dependencies
apt update
apt install python3 python3-dev python3-pip python3-setuptools python3-testresources libzstd1 libzstd-dev libssl-dev libcurl4-openssl-dev

git clone https://github.com/orioledb/orioledb
cd orioledb
git reset --hard 0c484c4 # optional
make USE_PGXS=1 ORIOLEDB_PATCHSET_VERSION=5
echo "shared_preload_libraries = 'orioledb.so'" &amp;gt;&amp;gt; /usr/local/pgsql/data/postgresql.conf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The buffer pool for both setups is set to 16GB i.e &lt;code&gt;shared_buffers&lt;/code&gt; for heap and &lt;code&gt;orioledb.main_buffers&lt;/code&gt; for Orioledb.&lt;/p&gt;

&lt;p&gt;To run the actual tests, I used &lt;a href="https://github.com/misachi/sysbench-graphing-tests" rel="noopener noreferrer"&gt;https://github.com/misachi/sysbench-graphing-tests&lt;/a&gt;. Once everything has been setup, running &lt;code&gt;./run.threads &amp;lt;run number&amp;gt; pgsql&lt;/code&gt; should execute the tests. The script ensures the database is warmed up before running the tests.&lt;/p&gt;

&lt;p&gt;I did 2 test runs for each setup: 2 runs of Postgres heap and 2 runs of Postgres with OrioleDB extension. The results are as shown below&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8pynmoo67y2h0i5ois15.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8pynmoo67y2h0i5ois15.png" alt="result" width="800" height="799"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Runs labelled &lt;code&gt;out/res2&lt;/code&gt; and &lt;code&gt;out/res3&lt;/code&gt; show results for running normal Postgres with heap based storage engine while &lt;code&gt;res4&lt;/code&gt; and &lt;code&gt;res5&lt;/code&gt; show results for Postgres with OrioleDB.&lt;/p&gt;

&lt;p&gt;The impressive bit, OrioleDB is able to outperform Postgres heap while using less CPU and memory resources. This can be attributed partly to OrioleDB's lock-free page reads thus reduced contention and index-organized tables.&lt;/p&gt;

&lt;p&gt;Memory usage(MB) for Postgres heap&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvs2zkbrur47doz1v4rom.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvs2zkbrur47doz1v4rom.png" alt="heap" width="800" height="796"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;CPU usage for Postgres heap&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fao426cjbdg4mvwvkhwm6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fao426cjbdg4mvwvkhwm6.png" alt="heap" width="800" height="792"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Memory usage(MB) for OrioleDB&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fydwjnukzvlo7r5i3r4vi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fydwjnukzvlo7r5i3r4vi.png" alt="Orioledb" width="800" height="801"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;CPU usage for OrioleDB&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjqyse13lpg312l4mxijv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjqyse13lpg312l4mxijv.png" alt="Orioledb" width="800" height="804"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>storageengines</category>
    </item>
    <item>
      <title>Overwriting data</title>
      <dc:creator>Brian Misachi</dc:creator>
      <pubDate>Thu, 14 Nov 2024 09:56:32 +0000</pubDate>
      <link>https://dev.to/misachi/page-sized-writes-58m6</link>
      <guid>https://dev.to/misachi/page-sized-writes-58m6</guid>
      <description>&lt;p&gt;Ensuring durability when performing writes is not as trivial as &lt;a href="https://dev.to/misachi/durable-writes-4j5f"&gt;it may seem&lt;/a&gt;. If not handled well, it might lead to data loss. There is another thing that needs to be taken of when performing writes: page alignment.&lt;/p&gt;

&lt;p&gt;On most linux based operating systems, data is operated on in units of pages. The typical size of a single page is 4kilobytes(4096bytes), although, the physical storage device(e.g the SSD) might be operating on different sized physical blocks. All requests(read/write) are handled in page sizes(4kb) by the kernel. For example, a request to read 10bytes will eventually result in a read of 4096bytes. The excess 4086bytes is wasted if not needed.&lt;/p&gt;

&lt;p&gt;With this understanding, I set out to test the effects of writing data that is not aligned to page size(less than 4kb). Turns out overwriting data with unaligned writes(small) can be very slow. It's important to mention, the tests were done on a limited machine -- 5gb memory, 80gb disk, 8cores Ubuntu jammy vm. The system was installed with EXT4 filesystem on kernel version 5.15.0-124-generic.&lt;/p&gt;

&lt;p&gt;There was no major differences when writing new data to a file for both aligned and unaligned writes(~2.1secs). Overwriting existing data in a file gave more interesting results. I used the code &lt;a href="https://github.com/misachi/file_io" rel="noopener noreferrer"&gt;here&lt;/a&gt; for the tests.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;There is a 5 second sleep time added in both tests. This I added to get enough profile time as the tests ended before the profiler could start. With this, 5 seconds can be shaved off the real time for both tests. Meaning 6seconds real time for aligned writes and 2minutes 27seconds real time for unaligned writes. I'm sure there is a better way to do it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Aligned writes results(run with &lt;code&gt;#define IS_ALIGNED_PAGE 1&lt;/code&gt;)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ time make stalls

gcc -g main.c -o ~/.tmp/file_io/stalls.o
~/.tmp/file_io/stalls.o
Time Taken: 6.195574 seconds

real    0m10.954s
user    0m0.315s
sys     0m6.003s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looking at the result for aligned writes, it takes 6.3s in total to complete. Of this, 0.3s is used in user land with 6s used by the system in kernel space(syscalls). In order to make sense of the results, I profiled and collected the stacks traces for analysis. The call stacks can be visualized in the &lt;a href="https://www.brendangregg.com/flamegraphs.html" rel="noopener noreferrer"&gt;flamegraphs&lt;/a&gt; as shown below&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhypeiam6hoe8oxkzbegt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhypeiam6hoe8oxkzbegt.png" alt="page aligned write flamegraph" width="800" height="328"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Of the 7974536us duration about 29% of it is used for servicing fsync calls. To the far right, the thin tall tower, that is the visualized stacks for the write calls. We can see the time is mostly dominated by fsync calls. Ignore the futex section which is mostly time consumed during sleep as explained above. &lt;/p&gt;

&lt;p&gt;Pages are being written to disk with the &lt;code&gt;ext4_writepages&lt;/code&gt; function -- the custom filesystem function for writing dirty pages back to disk. Since we are writing in page size(4kb) units, there is no need to read the pages from disk before updating them, then writing them back to disk again. The section for issuing page writes is highlighted below&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fikbu7bhs0iqcrqw4ecvl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fikbu7bhs0iqcrqw4ecvl.png" alt="page aligned write flamegraph" width="800" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Unaligned writes results(run with &lt;code&gt;#define IS_ALIGNED_PAGE 0&lt;/code&gt;)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ time make stalls

gcc -g main.c -o ~/.tmp/file_io/stalls.o
~/.tmp/file_io/stalls.o
Time Taken: 33.140095 seconds

real    2m32.503s
user    0m1.326s
sys     0m31.955s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;About 1minute 50seconds cannot be accounted for. This is probably time spent blocking while performing read I/O. Let's see if profiling data supports the statement.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8v0au22y2a208aswz94e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8v0au22y2a208aswz94e.png" alt="unaligned write flamegraph" width="800" height="272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For unaligned writes the stacks are dominated by page reads. This happens since we are issuing writes in sizes smaller than the page size. The filesystem checks this and has to read the full page from disk and update it in memory before writing it back to disk leading to write stalls. This can be seen from the function &lt;code&gt;__wait_on_buffer&lt;/code&gt;. Writes wait on page blocks being read causing increased latency.&lt;/p&gt;

&lt;p&gt;To the far left, the thin tall tower, that is the stack used for the fsync call. Fsync calls will flush dirty pages to disk. &lt;br&gt;
The wait'ing on reads can be clearly seen as shown&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqpddb3qaf7jzydv9ik0s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqpddb3qaf7jzydv9ik0s.png" alt="unaligned write flamegraph" width="800" height="269"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The filesystem code attempts to write to the page but checks whether it is writing a full page or just a portion of the page. If not a full page a page read request is issued. Depending on page size and disk block size, this can be a single request(pagesize == blocksize) or multiple requests(pagesize &amp;gt; blocksize). The code is shown below for the &lt;code&gt;ext4_block_write_begin&lt;/code&gt; function. Parts of the function code have been removed to highlight important bits.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;static int ext4_block_write_begin(struct page *page, loff_t pos, unsigned len,
                  get_block_t *get_block)
{
    /* Code removed here for clarity */

    for (bh = head, block_start = 0; bh != head || !block_start;
        block++, block_start = block_end, bh = bh-&amp;gt;b_this_page) {

        /* Code removed here for clarity */

        if (!buffer_mapped(bh)) {
            WARN_ON(bh-&amp;gt;b_size != blocksize);
            err = get_block(inode, block, bh, 1);
            if (err)
                break;
            if (buffer_new(bh)) {
                if (PageUptodate(page)) {
                    clear_buffer_new(bh);
                    set_buffer_uptodate(bh);
                    mark_buffer_dirty(bh);
                    continue;
                }
                if (block_end &amp;gt; to || block_start &amp;lt; from)
                    zero_user_segments(page, to, block_end,
                               block_start, from);
                continue;
            }
        }
        if (PageUptodate(page)) {
            set_buffer_uptodate(bh);
            continue;
        }
        if (!buffer_uptodate(bh) &amp;amp;&amp;amp; !buffer_delay(bh) &amp;amp;&amp;amp;
            !buffer_unwritten(bh) &amp;amp;&amp;amp;
            (block_start &amp;lt; from || block_end &amp;gt; to)) {
            ext4_read_bh_lock(bh, 0, false);
            wait[nr_wait++] = bh;
        }
    }
    /*
     * If we issued read requests, let them complete.
     */
    for (i = 0; i &amp;lt; nr_wait; i++) {
        wait_on_buffer(wait[i]);
        if (!buffer_uptodate(wait[i]))
            err = -EIO;
    }

    /* Code removed here for clarity */

    return err;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The check for size is done at this section&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if (!buffer_uptodate(bh) &amp;amp;&amp;amp; !buffer_delay(bh) &amp;amp;&amp;amp;
            !buffer_unwritten(bh) &amp;amp;&amp;amp;
            (block_start &amp;lt; from || block_end &amp;gt; to)) 
{
    ext4_read_bh_lock(bh, 0, false);
    wait[nr_wait++] = bh;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If it is smaller than required size, a disk read request is initiated in the function call &lt;code&gt;ext4_read_bh_lock(bh, 0, false)&lt;/code&gt;&lt;br&gt;
Later on, the read request is waited on if it hasn't completed in &lt;code&gt;wait_on_buffer&lt;/code&gt;. All pending read requests for disk blocks for the particular page are waited on before processing can continue. See code snippet below&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for (i = 0; i &amp;lt; nr_wait; i++) {
    wait_on_buffer(wait[i]);
    if (!buffer_uptodate(wait[i]))
        err = -EIO;
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code is from kernel v5.15.121.&lt;/p&gt;

&lt;p&gt;The key takeaway is: Overwriting data can be inefficient and take longer if it isn't done in chunks that match the size of a system's memory pages (typically 4kb). Working in smaller or larger units than the page size can result in more processing overhead, slowing down system performance.&lt;/p&gt;

&lt;p&gt;I've also developed a basic &lt;a href="https://github.com/misachi/ext0fs" rel="noopener noreferrer"&gt;filesystem&lt;/a&gt; that can help you get familiar with some filesystem concepts.&lt;/p&gt;

</description>
      <category>linux</category>
      <category>filesystems</category>
    </item>
    <item>
      <title>Durable writes</title>
      <dc:creator>Brian Misachi</dc:creator>
      <pubDate>Wed, 16 Oct 2024 18:08:24 +0000</pubDate>
      <link>https://dev.to/misachi/durable-writes-4j5f</link>
      <guid>https://dev.to/misachi/durable-writes-4j5f</guid>
      <description>&lt;p&gt;When writing data to a file, one of the things you have to deal with is ensuring the data is reliably persisted to a non-volatile storage device. Storage devices like magnetic disks(HDD), SSDs, persistent memory(NV RAM) etc. are used to offer some form of durable storage, such that once data is stored in the device it is "guaranteed" to exist even after an outage(e.g. power loss). The storage device can be local to your system or on a network to be accessed remotely.&lt;/p&gt;

&lt;p&gt;When updating a file using the &lt;code&gt;write&lt;/code&gt; system call, an easy mistake to make is to assume the data is immediately "saved" to permanent storage. Is this assumption correct? Well...it depends :). I will attempt to answer the question by describing what happens when a user attempts to &lt;code&gt;write&lt;/code&gt; to a file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Abstractions everywhere...
&lt;/h2&gt;

&lt;p&gt;Multiple caching layers exist between the user applications and the storage device. Caching can start from the language library like glibc all the way to the write back cache in the storage device. This is mostly for performance reasons. These caches temporarily store data before sending it to the storage devices. &lt;br&gt;
Performing I/O is expensive and can easily take up to hundreds of milliseconds or even seconds(in some slower disks). This can degrade performance for user applications, as the CPU has to wait for data transfer to and from disk storage. This leads to wasted CPU cycles as the CPU will sit idle waiting for I/O to complete. Caching data is an optimization used to keep data, in memory, closer to the CPU thus reducing latency and ensuring the CPU is kept busy doing meaningful work.&lt;/p&gt;

&lt;p&gt;There are various levels of caching that the system provides as shown in the figure below&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgmbvdukurwoo8h8mxvm0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgmbvdukurwoo8h8mxvm0.png" alt="Cache layers" width="609" height="504"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As depicted in the diagram above, caching enables fast reads(reduced I/O to retrieve data from slow storage) and also improves write performance since the user does not need to wait for actual writes to the storage device to complete. Writes to disk will be taken care of by the kernel in the background threads(bdflush et al.). Something interesting is that even writing the data to the storage device might not guarantee durability. How? Some storage devices have an internal writeback cache where data is stored for a short period before being flushed to disk. The heuristic used to flush the data might differ for different disk vendors and is opaque to the kernel: meaning even if the kernel issues a sync command to the device it can't guarantee that the data is stored on permanent storage when the write returns.&lt;/p&gt;

&lt;p&gt;With this, a scenario can occur where a user issues a write and the system experiences a power loss before the data reaches the storage device. At this point, the data could still be in one of the cache layers and there is a possibility for data loss. This means that after the system is recovered, some of the data the user thought was "stored to durable storage" might not be available.&lt;/p&gt;

&lt;h2&gt;
  
  
  What now?
&lt;/h2&gt;

&lt;p&gt;This does not mean there is nothing we can do to reduce the chances of data loss. The scenario above mostly describes buffered writes. One possible way to reduce the possibility of data loss is using unbuffered writes, bypassing the page cache and interfacing "directly" with the underlying storage device. I use "directly" loosely here since this is not usually the case as you still need to go through the filesystem when opening/writing to a file. Unbuffered writes can be done by &lt;a href="https://man7.org/linux/man-pages/man2/open.2.html" rel="noopener noreferrer"&gt;opening&lt;/a&gt; the file  with &lt;code&gt;O_SYNC&lt;/code&gt; and &lt;code&gt;O_DIRECT&lt;/code&gt; flags.&lt;br&gt;
The type of filesystem being used also matters -- &lt;a href="https://opensource.com/article/17/5/introduction-ext4-filesystem" rel="noopener noreferrer"&gt;ext4 &lt;/a&gt; filesystem behavior may be different from &lt;a href="https://openzfs.org/wiki/Main_Page" rel="noopener noreferrer"&gt;zfs&lt;/a&gt;, for example when handling the flags above and general handling of writes(&lt;code&gt;O_DIRECT&lt;/code&gt; flag is not yet supported by some filesystems).&lt;/p&gt;

&lt;p&gt;Calling &lt;code&gt;fsync&lt;/code&gt; or &lt;code&gt;fdatasysnc&lt;/code&gt; ensures data is flushed at least to the storage device. The difference is &lt;code&gt;fsync&lt;/code&gt; flushes both data and metadata to the underlying storage device. The data being what the user wants to write to the file while the metadata is the details about the file such as &lt;a href="https://en.wikipedia.org/wiki/Inode#:~:text=The%20inode%20(index%20node)%20is,locations%20of%20the%20object's%20data." rel="noopener noreferrer"&gt;inode&lt;/a&gt; data which can be file modification time, file size etc. Because &lt;code&gt;fdatasync&lt;/code&gt; only flushes the file data, it performs better than &lt;code&gt;fsync&lt;/code&gt; with the caveat that it might cause failures during recovery due to missing or incorrect metadata. &lt;/p&gt;

&lt;p&gt;However the calls to &lt;code&gt;fsync&lt;/code&gt; and &lt;code&gt;fdatasync&lt;/code&gt;, synchronous system call,  should be used sparingly for performance reasons since the application has to wait for the data to be flushed out to disk. One way to optimize this is to batch writes and sync at intervals, therefore combining multiple write/sync calls into a single syscall.&lt;/p&gt;

&lt;p&gt;So, back to the question above: Is it correct to assume data is written to the storage device after a &lt;code&gt;write&lt;/code&gt; call? The answer still remains...it depends. &lt;/p&gt;

</description>
      <category>linux</category>
      <category>filesystems</category>
    </item>
    <item>
      <title>Debugging Postgres</title>
      <dc:creator>Brian Misachi</dc:creator>
      <pubDate>Tue, 08 Oct 2024 08:43:35 +0000</pubDate>
      <link>https://dev.to/misachi/debugging-postgres-3cll</link>
      <guid>https://dev.to/misachi/debugging-postgres-3cll</guid>
      <description>&lt;p&gt;Debugging Postgres can be daunting at first, due to the the size and complexity of the project. Luckily, there are tools that can help make the process bearable. One such tool is the debugger: &lt;em&gt;A debugger or debugging tool is a computer program used to test and debug other programs (the "target" program)&lt;/em&gt; [&lt;a href="https://en.wikipedia.org/wiki/Debugger" rel="noopener noreferrer"&gt;Wikipedia&lt;/a&gt;].&lt;/p&gt;

&lt;p&gt;This post is about using a debugger to achieve two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Exploring the Postgres codebase&lt;/li&gt;
&lt;li&gt;Tracking bugs in Postgres&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;First, you'll need to setup Postgres as described &lt;a href="https://dev.to/misachi/developing-with-postgres-4ej4"&gt;here&lt;/a&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  Exploring the Postgres codebase
&lt;/h2&gt;

&lt;p&gt;When working on a large project such as Postgres, you might want to follow particular code paths in order to understand how some functionality is implemented from start to end. For example, I want to follow the function calls for insertions in the database -- Which parts of the code are called from the point of issuing am INSERT query to when I get back the query results.&lt;br&gt;
Debuggers come in handy for this particular use case as compared to looking through the codebase in your editor on your own, trying to figure out which function called another.&lt;/p&gt;

&lt;p&gt;I will be using the GNU Debugger(GDB) for this post. You can use any debugger you are most comfortable with.&lt;/p&gt;

&lt;p&gt;Let's get started by creating a dummy database and table to insert data into&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# -- Creating test database
postgres=# CREATE DATABASE testdb;
CREATE DATABASE
postgres=# -- Connect to testdb
postgres=# \c testdb
You are now connected to database "testdb" as user "postgres".
testdb=# -- Create test table for inserting to
testdb=# CREATE TABLE test_tbl(id SERIAL PRIMARY KEY, name TEXT);
CREATE TABLE
testdb=#
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we need to get the process id of our client process with this command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ps -faux | grep postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This should display something that looks like the following&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres      11  0.0  0.3 161752 18028 ?        Ss   06:50   0:00 /usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data
postgres     359  0.0  0.4 162000 20628 ?        Ss   11:32   0:00  \_ postgres: checkpointer
postgres     360  0.0  0.0 161904  4368 ?        Ss   11:32   0:00  \_ postgres: background writer
postgres     361  0.0  0.1 161752  7600 ?        Ss   11:32   0:00  \_ postgres: walwriter
postgres     362  0.0  0.1 163384  6896 ?        Ss   11:32   0:00  \_ postgres: autovacuum launcher
postgres     363  0.0  0.1 163332  5360 ?        Ss   11:32   0:00  \_ postgres: logical replication launcher
postgres     373  0.0  0.3 164484 17728 ?        Ss   11:49   0:00  \_ postgres: postgres testdb [local] idle
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From the above, the Postgres process we are interested in is the last one &lt;code&gt;postgres testdb [local] idle&lt;/code&gt; with the Process ID(PID) being 373. &lt;br&gt;
The other processes shown are simply Postgres backends tasked with performing various background functions e.g checkpointing data, writing WAL data from buffers to disk, vacuuming bloated tables etc -- We are not interested in these for this post.&lt;/p&gt;

&lt;p&gt;I can now start to trace an insert statement. I will have 2 terminals open. One for running the client(psql) and another for hooking the debugger to the client process.&lt;br&gt;
In the terminal for running the debugger, I will run the following command &lt;code&gt;gdb &amp;lt;command&amp;gt;&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gdb /usr/local/pgsql/bin/postgres
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This should launch gdb using the Postgres executable. This will open up the gdb prompt on to the screen.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/pgsql/bin/postgres...
(gdb) 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Attach to the PID we got earlier -- 373.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(gdb) attach 373
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the terminal running the client(psql), I will run the INSERT command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;testdb=# -- Inserting an apple :)
testdb=# INSERT INTO test_tbl(name) VALUES ('apples');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The client will then freeze as the debugger takes control.&lt;/p&gt;

&lt;p&gt;I will switch to the debugger terminal and begin stepping through the code. First I need to determine which function to break/stop at. For this example I will use &lt;code&gt;heap_insert&lt;/code&gt; function as the breaking point since I am doing an INSERT:  so I know for sure I will need to insert data into the heap(Think of heap as disk space where table data is stored in Postgres.)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(gdb) b heap_insert
Breakpoint 1 at 0x56040e35e7b1: file heapam.c, line 1819.
(gdb) c
Continuing.

Breakpoint 1, heap_insert (relation=0x7f1a6e5ca360, tup=0x560410b68e10, cid=0, options=0, bistate=0x0) at heapam.c:1819
1819    {
(gdb) bt
#0  heap_insert (relation=0x7f1a6e5ca360, tup=0x560410b68e10, cid=0, options=0, bistate=0x0) at heapam.c:1819
#1  0x000056040e36d1b3 in heapam_tuple_insert (relation=0x7f1a6e5ca360, slot=0x560410b68d08, cid=0, options=0, bistate=0x0) at heapam_handler.c:252
#2  0x000056040e5feb8c in table_tuple_insert (rel=0x7f1a6e5ca360, slot=0x560410b68d08, cid=0, options=0, bistate=0x0) at ../../../src/include/access/tableam.h:1400
#3  0x000056040e60090d in ExecInsert (context=0x7ffd763931e0, resultRelInfo=0x560410b680b8, slot=0x560410b68d08, canSetTag=true, inserted_tuple=0x0, insert_destrel=0x0) at nodeModifyTable.c:1133
#4  0x000056040e604acd in ExecModifyTable (pstate=0x560410b67eb0) at nodeModifyTable.c:3790
#5  0x000056040e5c8a49 in ExecProcNodeFirst (node=0x560410b67eb0) at execProcnode.c:464
#6  0x000056040e5bcd58 in ExecProcNode (node=0x560410b67eb0) at ../../../src/include/executor/executor.h:273
#7  0x000056040e5bf68f in ExecutePlan (estate=0x560410b67c68, planstate=0x560410b67eb0, use_parallel_mode=false, operation=CMD_INSERT, sendTuples=false, numberTuples=0, direction=ForwardScanDirection,
    dest=0x560410b64ee8, execute_once=true) at execMain.c:1670
#8  0x000056040e5bd2ec in standard_ExecutorRun (queryDesc=0x560410b5a038, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:365
#9  0x000056040e5bd173 in ExecutorRun (queryDesc=0x560410b5a038, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:309
#10 0x000056040e844858 in ProcessQuery (plan=0x560410b64d98, sourceText=0x560410a67bb8 "INSERT INTO test_tbl(name) VALUES ('apples');", params=0x0, queryEnv=0x0, dest=0x560410b64ee8, qc=0x7ffd76393640)
    at pquery.c:160
#11 0x000056040e8461f5 in PortalRunMulti (portal=0x560410ae4fe8, isTopLevel=true, setHoldSnapshot=false, dest=0x560410b64ee8, altdest=0x560410b64ee8, qc=0x7ffd76393640) at pquery.c:1277
#12 0x000056040e84578b in PortalRun (portal=0x560410ae4fe8, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x560410b64ee8, altdest=0x560410b64ee8, qc=0x7ffd76393640) at pquery.c:791
#13 0x000056040e83eaa8 in exec_simple_query (query_string=0x560410a67bb8 "INSERT INTO test_tbl(name) VALUES ('apples');") at postgres.c:1272
#14 0x000056040e8437d0 in PostgresMain (dbname=0x560410a9ee80 "testdb", username=0x560410a9ee68 "postgres") at postgres.c:4652
#15 0x000056040e77bf0b in BackendRun (port=0x560410a911e0) at postmaster.c:4439
#16 0x000056040e77b797 in BackendStartup (port=0x560410a911e0) at postmaster.c:4167
#17 0x000056040e777eb3 in ServerLoop () at postmaster.c:1781
#18 0x000056040e777767 in PostmasterMain (argc=3, argv=0x560410a635b0) at postmaster.c:1465
#19 0x000056040e63bb69 in main (argc=3, argv=0x560410a635b0) at main.c:198
(gdb)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set a breakpoint with &lt;code&gt;b heap_insert&lt;/code&gt;. Next, allow execution to continue until our breakpoint is triggered, with &lt;code&gt;c&lt;/code&gt;. Finally, we want to know how execution got to this point(the break-point) with &lt;code&gt;bt&lt;/code&gt; for backtrace.&lt;/p&gt;

&lt;p&gt;Execution starts from the &lt;code&gt;main&lt;/code&gt; function through the executor functions(Exec*) to the table and heap functions. The source files and line numbers for all call sites are also shown.&lt;br&gt;
As you can see, this is a powerful tool to help find your way through complex codebases very quickly.&lt;/p&gt;
&lt;h2&gt;
  
  
  Tracking bugs in Postgres
&lt;/h2&gt;

&lt;p&gt;When a client process(e.g psql) encounters a problem it may crash or issue a warning, even though this depends on the severity of the problem. &lt;br&gt;
I will try to investigate a Segmentation fault(SIGSEGV) error.&lt;/p&gt;

&lt;p&gt;I will be using the &lt;a href="https://github.com/misachi/pg_wal_ext" rel="noopener noreferrer"&gt;PG_WAL_EXT extension&lt;/a&gt; for this example. In the &lt;code&gt;pg_xlog_records&lt;/code&gt; function, I will comment out the allocation for the reader object defined in &lt;code&gt;XLogReaderState *xlog_reader&lt;/code&gt;. This will throw a segmentation error when the object(xlog_reader) is touched since there will be no memory allocated to it&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// xlog_reader = XLogReaderAllocate(wal_segment_size, directory,
    //                                  XL_ROUTINE(.page_read = &amp;amp;read_xlog_page,
    //                                             .segment_open = wal_segment_open,
    //                                             .segment_close = wal_segment_close),
    //                                  &amp;amp;private);

    // if (xlog_reader == NULL)
    //     ereport(ERROR,
    //             (errcode(ERRCODE_OUT_OF_MEMORY),
    //              errmsg("out of memory"),
    //              errdetail("Failed while allocating a WAL reading processor.")));
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This should trigger a segfault when the query is ran. The client connection will be restored and a crash message shown as below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;postgres=# select * from pg_xlog_records('/usr/local/pgsql/data/pg_wal/000000010000000000000001');
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
postgres=#
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To debug this I will open up the debugger again with the command format &lt;code&gt;gdb &amp;lt;command&amp;gt; -c &amp;lt;core-file&amp;gt;&lt;/code&gt; as below&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gdb /usr/local/pgsql/bin/postgres -c /tmp/core.postgres.1728338434.415
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;/usr/local/pgsql/bin/postgres&lt;/code&gt; is the path to the Postgres executable and &lt;code&gt;/tmp/core.postgres.1728338434.415&lt;/code&gt; being the path to the core dumped file.&lt;/p&gt;

&lt;p&gt;The output from the command above looks something like this&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F68qojcjj3790qgjx8rx7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F68qojcjj3790qgjx8rx7.png" alt="Command Output" width="800" height="181"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first frame(highlighted) shows the file(pg_wal.c) and exact line(line 194) with the error. &lt;/p&gt;

&lt;p&gt;This is an trivial example meant to show how to go about debugging Postgres. Other issues might not be as straight forward to debug and may require looking at multiple frames and following deep call stacks to determine the root cause of an issue with the Postgres server.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
    </item>
  </channel>
</rss>
