Brian Misachi

Posted on Jul 9

Fun with Postgres Recovery

#postgres #recovery

I recently had an issue with a test Postgres server that didn't shut down cleanly and could not restart because it was stuck in recovery. I tried all the quick tricks I could get from browsing the internet but all of them failed. I ended up using pg_resetwal to get to a point where the database could be able to start up again. This meant losing some data that had not been written to disk before the crash. I decided to look into how recovery really works for Postgres, which ended up in yet another extension pg_wal_recovery. In this post, I'll demonstrate a simple table recovery using Postgres. I'll be using /usr/local/pgsql/data as my Postgres data location. If this is different for you, then you can change it.

First the definition of Recovery[I googled :) ]: Data recovery is the process of restoring a database to a consistent state after a crash or power loss. Changes to data are usually stored separately as a sequential log. The actual data pages(table data) can be written to permanent storage later. When a crash occurs but before data is flushed to disk, the database can be in an inconsistent state whereby there are changes in the log that are not reflected in the table file. During the next restart, the database needs to "fix" this inconsistency when it reads the log and replays every record from the last REDO location in order to restore the database to a consistent state.

Postgres uses the WAL to log changes made to tables before the data is flushed to permanent storage during a checkpoint or by the background writer. The WAL plays a very important role in the operation of database activities such as replication, point-in-time-recovery(PITR), recovery, backups and others.

Postgres uses a control file PGDATA/global/pg_control to store state information required by various parts of the database. This is where details like checkpoint location, redo location, the database state value(enum), the next transaction ID to be assigned, next OID and many more are stored. A copy of the file data is stored in shared memory for easy access and is routinely synced with the data on disk. When the database starts up it reads the control file and checks values of REDO and Checkpoint locations. If these two values are the same and the database was shut down cleanly, then normal startup process continues. If not, the recovery process follows. See StartupProcessMain in src/backend/postmaster/startup.c and StartupXLOG in src/backend/access/transam/xlog.c. Also see InitWalRecovery and PerformWalRecovery in transam/xlogrecovery.c for the low-level implementations.

The control file data can be read using the pg_controldata tool. A sample output is as follows.

postgres@f63dffa121c7:~$ pg_controldata -D /usr/local/pgsql/data/
pg_control version number:            1800
Catalog version number:               202505071
Database system identifier:           7504776984629272596
Database cluster state:               in production
pg_control last modified:             Tue Jul  8 21:38:34 2025
Latest checkpoint location:           16/18031820
Latest checkpoint's REDO location:    16/18031820
Latest checkpoint's REDO WAL file:    000000010000001600000018
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          1:1847483885
Latest checkpoint's NextOID:          443131
.....

In this case, both REDO and Checkpoint point to the same location in the WAL file and the database is up and running as shown in the value of Database cluster state: in production.

Now if we run the CHECKPOINT command, the checkpoint and redo positions in the control file will change.

postgres=# CHECKPOINT;
CHECKPOINT

The control file data has changed

pg_control version number:            1800
Catalog version number:               202505071
Database system identifier:           7504776984629272596
Database cluster state:               in production
pg_control last modified:             Tue Jul  8 22:06:08 2025
Latest checkpoint location:           16/18031928
Latest checkpoint's REDO location:    16/180318D0
Latest checkpoint's REDO WAL file:    000000010000001600000018
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          1:1847483885
.....

Looking at how much data has changed before and after the checkpoint command we get

postgres=# select '16/18031928'::pg_lsn - '16/18031820'::pg_lsn as checkpoint_diff_bytes;
 checkpoint_diff_bytes
-----------------------
                   264
(1 row)

postgres=# select '16/180318D0'::pg_lsn - '16/18031820'::pg_lsn as redo_diff_bytes;
 redo_diff_bytes
-----------------
             176
(1 row)

We can see which records have added to the WAL when the checkpoint command was called using the pg_waldump tool

postgres@f63dffa121c7:~$ pg_waldump --start=16/18031820 /usr/local/pgsql/data/pg_wal/000000010000001600000018
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 16/18031820, prev 16/180317E8, desc: CHECKPOINT_SHUTDOWN redo 16/18031820; tli 1; prev tli 1; fpw true; wal_level replica; xid 1:1847483885; oid 443131; multi 1; offset 0; oldest xid 2147483649 in DB 5; oldest multi 1 in DB 5; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 16/18031898, prev 16/18031820, desc: RUNNING_XACTS nextXid 1847483885 latestCompletedXid 1847483884 oldestRunningXid 1847483885
rmgr: XLOG        len (rec/tot):     30/    30, tx:          0, lsn: 16/180318D0, prev 16/18031898, desc: CHECKPOINT_REDO wal_level replica
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 16/180318F0, prev 16/180318D0, desc: RUNNING_XACTS nextXid 1847483885 latestCompletedXid 1847483884 oldestRunningXid 1847483885
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 16/18031928, prev 16/180318F0, desc: CHECKPOINT_ONLINE redo 16/180318D0; tli 1; prev tli 1; fpw true; wal_level replica; xid 1:1847483885; oid 443131; multi 1; offset 0; oldest xid 2147483649 in DB 5; oldest multi 1 in DB 5; oldest/newest commit timestamp xid: 0/0; oldest running xid 1847483885; online
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 16/180319A0, prev 16/18031928, desc: RUNNING_XACTS nextXid 1847483885 latestCompletedXid 1847483884 oldestRunningXid 1847483885
pg_waldump: error: error in WAL record at 16/180319A0: invalid record length at 16/180319D8: expected at least 24, got 0

No actual data records have been added. The redo record is now at 16/180318D0 from 16/18031820 position while the checkpoint record is now at 16/18031928 from 16/18031820. Now we proceed to the demo.

Open a client sessions and create a table foo and insert some data to it then, do a "dirty" shutdown of the server before any checkpoint is ran

postgres=# DROP TABLE IF EXISTS foo;
NOTICE:  table "foo" does not exist, skipping
DROP TABLE
postgres=# CREATE TABLE foo(id SERIAL, k INT NOT NULL);
CREATE TABLE
postgres=# INSERT INTO foo SELECT i, i*100 FROM generate_series(1, 100) i;
INSERT 0 100

In another terminal session, kill the server using the --mode=immediate option

postgres@f63dffa121c7:~$ pg_ctl -D /usr/local/pgsql/data -l ~/logfile stop --mode=immediate
waiting for server to shut down.... done
server stopped

Table foo should now have 100 records since the transaction committed, right? Not really.

On commit, a user can only be sure that data has been stored in the WAL but not in the table files. Before the database can be restarted, we need to do two things: copy the control file and copy the WAL segment file to another location. Create a directory named global inside the new location and copy the control file into the global directory. Then copy the last active WAL segment file(as shown in Latest checkpoint's REDO WAL file row when you run the command pg_controldata -D /usr/local/pgsql/data/) to the same location. Ensure the postgres user has read/write access the the location. I'll use the /tmp directory for this case. The final structure should look like this

postgres@f63dffa121c7:~$ cp /usr/local/pgsql/data/global/pg_control /tmp/global/
postgres@f63dffa121c7:~$ cp /usr/local/pgsql/data/pg_wal/000000010000001600000018 /tmp/
postgres@f63dffa121c7:~$ ls /tmp/
000000010000001600000018  global
postgres@f63dffa121c7:~$ ls /tmp/global/
pg_control

Now clear the WAL data using the pg_resetwal tool. This is because, by default, when Postgres starts up it will first check if it needs to go into recovery mode and it will use the WAL to recover any un-checkpointed data. Manual recovery will be done later using the pg_wal_recovery extension.

postgres@f63dffa121c7:~$ pg_resetwal -D /usr/local/pgsql/data/ -f
Write-ahead log reset

pg_resetwal should never be used in a production database. It has destructive side effects that can lead to data loss. Its usage should be limited to situations where the user knows what they are doing and they are aware of the risks involved.

Looking at the control data now, the checkpoint and redo locations match. If there were any WAL segment files in the pg_wal directory prior to running the command, they have been removed. Also, a new WAL segment file has been created 000000010000001600000019.

postgres@f63dffa121c7:~$ pg_controldata -D /usr/local/pgsql/data/
pg_control version number:            1800
Catalog version number:               202505071
Database system identifier:           7504776984629272596
Database cluster state:               shut down
pg_control last modified:             Wed Jul  9 10:46:20 2025
Latest checkpoint location:           16/19000028
Latest checkpoint's REDO location:    16/19000028
Latest checkpoint's REDO WAL file:    000000010000001600000019
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          1:1847483887
Latest checkpoint's NextOID:          451323
.....

Now the database can be restarted since we know it won't go into automatic recovery mode.

postgres@f63dffa121c7:~$ pg_ctl -D /usr/local/pgsql/data -l ~/logfile start
waiting for server to start.... done
server started

Once it is up, we can check the foo table to find out if there is any data in it.

postgres=# SELECT * FROM foo;
 id | k
----+---
(0 rows)

It returned no rows. Does it mean the data inserted before has been lost? Well....Not really.
The data was committed so we can be sure that, at the very least, the data has been logged in the WAL. We can check this using the pg_wal_recovery extension.

postgres=# SELECT * FROM wal_list_records('/tmp');
WARNING:  invalid record length at 16/18065410: expected at least 24, got 0
      wal_file_name       |     wal_type      | wal_record
--------------------------+-------------------+-------------
 000000010000001600000018 | CHECKPOINT_REDO   | 16/1805D4E0
 000000010000001600000018 | RUNNING_XACTS     | 16/1805D500
 000000010000001600000018 | CHECKPOINT_ONLINE | 16/1805D538
 000000010000001600000018 | RUNNING_XACTS     | 16/1805D5B0
 000000010000001600000018 | FPI_FOR_HINT      | 16/1805D5E8
 000000010000001600000018 | FPI_FOR_HINT      | 16/1805E0C0
 000000010000001600000018 | FPI_FOR_HINT      | 16/1805F880
 000000010000001600000018 | RUNNING_XACTS     | 16/18060D00
 000000010000001600000018 | INSERT+INIT       | 16/18060D38
 000000010000001600000018 | INSERT            | 16/18060D78
 000000010000001600000018 | INSERT            | 16/18060DB8
 000000010000001600000018 | INSERT            | 16/18060DF8
 000000010000001600000018 | INSERT            | 16/18060E38
 .....
 000000010000001600000018 | INSERT            | 16/18062550
 000000010000001600000018 | INSERT            | 16/18062590
 000000010000001600000018 | INSERT            | 16/180625D0
 000000010000001600000018 | INSERT            | 16/18062610
 000000010000001600000018 | COMMIT            | 16/18062650
 000000010000001600000018 | INSERT            | 16/18062678
 000000010000001600000018 | INSERT_LEAF       | 16/18063CD8
 000000010000001600000018 | INSERT            | 16/18064268
 000000010000001600000018 | INSERT_LEAF       | 16/180644B8
 000000010000001600000018 | INPLACE           | 16/180644F8
 000000010000001600000018 | COMMIT            | 16/18065388
(115 rows)

The extension read WAL records starting from 16/1805D4E0, the last REDO location in the control file that was copied earlier. There are 100 WAL records from offset 16/18060D38 to 16/18062610 in the WAL segment file. This matches the number of inserts that was done just before the database was shutdown. We can check this using pg_waldump tool which provides way more information about the WAL records

postgres@f63dffa121c7:~$ pg_waldump --start=16/18060D38 --end=16/18062610  /tmp/000000010000001600000018
rmgr: Heap        len (rec/tot):     63/    63, tx: 1847483887, lsn: 16/18060D38, prev 16/18060D00, desc: INSERT+INIT off: 1, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
rmgr: Heap        len (rec/tot):     63/    63, tx: 1847483887, lsn: 16/18060D78, prev 16/18060D38, desc: INSERT off: 2, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
rmgr: Heap        len (rec/tot):     63/    63, tx: 1847483887, lsn: 16/18060DB8, prev 16/18060D78, desc: INSERT off: 3, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
.....
rmgr: Heap        len (rec/tot):     63/    63, tx: 1847483887, lsn: 16/18062590, prev 16/18062550, desc: INSERT off: 98, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
rmgr: Heap        len (rec/tot):     63/    63, tx: 1847483887, lsn: 16/180625D0, prev 16/18062590, desc: INSERT off: 99, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0

The last thing to check before replaying the WAL records is the data file(table foo) to confirm if there is any data present. For this, the pageinspect extension comes in handy.

postgres=# SELECT lp, t_ctid, t_xmin, t_xmax FROM heap_page_items(get_raw_page('foo', 0)) WHERE lp_len > 0;
 lp | t_ctid | t_xmin | t_xmax
----+--------+--------+--------
(0 rows)

Here we query the content of the first page(zero-based) in the data file. The lp(line pointer) is the item number of the record within the page. t_ctid is the location of the record within the data file e.g (0,3) can be read as third item on the first page. t_xmin and t_xmax are the transaction IDs of the inserting and deleting transactions respectively.

The table is empty. We have to replay the WAL in order to get the data into the table. The pg_wal_recovery extension will be used for this purpose. It will output the last record replayed in the WAL when done.

postgres=# select * FROM wal_recover('/tmp');
WARNING:  invalid record length at 16/18065410: expected at least 24, got 0
 wal_type | wal_record
----------+-------------
 COMMIT   | 16/18065388
(1 row)

The "lost" data should now appear in the data files. Again, we can run the query using the helper functions from the pageinspect extension

postgres=# SELECT lp, t_ctid, t_xmin, t_xmax FROM heap_page_items(get_raw_page('foo', 0)) WHERE lp_len > 0;
 lp  | t_ctid  |   t_xmin   | t_xmax
-----+---------+------------+--------
   1 | (0,1)   | 1847483887 |      0
   2 | (0,2)   | 1847483887 |      0
   3 | (0,3)   | 1847483887 |      0
.....
  97 | (0,97)  | 1847483887 |      0
  98 | (0,98)  | 1847483887 |      0
  99 | (0,99)  | 1847483887 |      0
 100 | (0,100) | 1847483887 |      0
(100 rows)

The data is now present in the table file. In this case, all 100 records fit in a single page(each page is 8kb in size). After a restart, querying the foo table now returns all 100 records. The data has been restored.

postgres=# SELECT * FROM foo;
 id  |   k
-----+-------
   1 |   100
   2 |   200
.....
  99 |  9900
 100 | 10000
(100 rows)

In this post I demonstrated how to perform a simple data recovery using the WAL. In real life systems, the chances of having to do recovery like this, are very low. Postgres already ships with a robust recovery infrastructure that has been battle tested for many years, so automatic recovery would handle such for you anyway. Building the extension made me go deeper into the lower-level parts of recovery in Postgres. The extension is for educational purposes only and would not be suitable for a production database or any database that has data you care about.

DEV Community

Fun with Postgres Recovery

Top comments (0)