I recently had an issue with a test Postgres server that didn't shut down cleanly and could not restart because it was stuck in recovery. I tried all the quick tricks I could get from browsing the internet but all of them failed. I ended up using pg_resetwal to get to a point where the database could be able to start up again. This meant losing some data that had not been written to disk before the crash. I decided to look into how recovery really works for Postgres, which ended up in yet another extension pg_wal_recovery. In this post, I'll demonstrate a simple table recovery using Postgres. I'll be using /usr/local/pgsql/data
as my Postgres data location. If this is different for you, then you can change it.
First the definition of Recovery[I googled :) ]: Data recovery is the process of restoring a database to a consistent state after a crash or power loss. Changes to data are usually stored separately as a sequential log. The actual data pages(table data) can be written to permanent storage later. When a crash occurs but before data is flushed to disk, the database can be in an inconsistent state whereby there are changes in the log that are not reflected in the table file. During the next restart, the database needs to "fix" this inconsistency when it reads the log and replays every record from the last REDO location in order to restore the database to a consistent state.
Postgres uses the WAL to log changes made to tables before the data is flushed to permanent storage during a checkpoint or by the background writer. The WAL plays a very important role in the operation of database activities such as replication, point-in-time-recovery(PITR), recovery, backups and others.
Postgres uses a control file PGDATA/global/pg_control
to store state information required by various parts of the database. This is where details like checkpoint location, redo location, the database state value(enum), the next transaction ID to be assigned, next OID and many more are stored. A copy of the file data is stored in shared memory for easy access and is routinely synced with the data on disk. When the database starts up it reads the control file and checks values of REDO and Checkpoint locations. If these two values are the same and the database was shut down cleanly, then normal startup process continues. If not, the recovery process follows. See StartupProcessMain
in src/backend/postmaster/startup.c
and StartupXLOG
in src/backend/access/transam/xlog.c
. Also see InitWalRecovery
and PerformWalRecovery
in transam/xlogrecovery.c
for the low-level implementations.
The control file data can be read using the pg_controldata tool. A sample output is as follows.
postgres@f63dffa121c7:~$ pg_controldata -D /usr/local/pgsql/data/
pg_control version number: 1800
Catalog version number: 202505071
Database system identifier: 7504776984629272596
Database cluster state: in production
pg_control last modified: Tue Jul 8 21:38:34 2025
Latest checkpoint location: 16/18031820
Latest checkpoint's REDO location: 16/18031820
Latest checkpoint's REDO WAL file: 000000010000001600000018
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 1:1847483885
Latest checkpoint's NextOID: 443131
.....
In this case, both REDO and Checkpoint point to the same location in the WAL file and the database is up and running as shown in the value of Database cluster state: in production
.
Now if we run the CHECKPOINT
command, the checkpoint and redo positions in the control file will change.
postgres=# CHECKPOINT;
CHECKPOINT
The control file data has changed
pg_control version number: 1800
Catalog version number: 202505071
Database system identifier: 7504776984629272596
Database cluster state: in production
pg_control last modified: Tue Jul 8 22:06:08 2025
Latest checkpoint location: 16/18031928
Latest checkpoint's REDO location: 16/180318D0
Latest checkpoint's REDO WAL file: 000000010000001600000018
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 1:1847483885
.....
Looking at how much data has changed before and after the checkpoint command we get
postgres=# select '16/18031928'::pg_lsn - '16/18031820'::pg_lsn as checkpoint_diff_bytes;
checkpoint_diff_bytes
-----------------------
264
(1 row)
postgres=# select '16/180318D0'::pg_lsn - '16/18031820'::pg_lsn as redo_diff_bytes;
redo_diff_bytes
-----------------
176
(1 row)
We can see which records have added to the WAL when the checkpoint command was called using the pg_waldump
tool
postgres@f63dffa121c7:~$ pg_waldump --start=16/18031820 /usr/local/pgsql/data/pg_wal/000000010000001600000018
rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 16/18031820, prev 16/180317E8, desc: CHECKPOINT_SHUTDOWN redo 16/18031820; tli 1; prev tli 1; fpw true; wal_level replica; xid 1:1847483885; oid 443131; multi 1; offset 0; oldest xid 2147483649 in DB 5; oldest multi 1 in DB 5; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 16/18031898, prev 16/18031820, desc: RUNNING_XACTS nextXid 1847483885 latestCompletedXid 1847483884 oldestRunningXid 1847483885
rmgr: XLOG len (rec/tot): 30/ 30, tx: 0, lsn: 16/180318D0, prev 16/18031898, desc: CHECKPOINT_REDO wal_level replica
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 16/180318F0, prev 16/180318D0, desc: RUNNING_XACTS nextXid 1847483885 latestCompletedXid 1847483884 oldestRunningXid 1847483885
rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: 16/18031928, prev 16/180318F0, desc: CHECKPOINT_ONLINE redo 16/180318D0; tli 1; prev tli 1; fpw true; wal_level replica; xid 1:1847483885; oid 443131; multi 1; offset 0; oldest xid 2147483649 in DB 5; oldest multi 1 in DB 5; oldest/newest commit timestamp xid: 0/0; oldest running xid 1847483885; online
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: 16/180319A0, prev 16/18031928, desc: RUNNING_XACTS nextXid 1847483885 latestCompletedXid 1847483884 oldestRunningXid 1847483885
pg_waldump: error: error in WAL record at 16/180319A0: invalid record length at 16/180319D8: expected at least 24, got 0
No actual data records have been added. The redo record is now at 16/180318D0 from 16/18031820 position while the checkpoint record is now at 16/18031928 from 16/18031820. Now we proceed to the demo.
Open a client sessions and create a table foo
and insert some data to it then, do a "dirty" shutdown of the server before any checkpoint is ran
postgres=# DROP TABLE IF EXISTS foo;
NOTICE: table "foo" does not exist, skipping
DROP TABLE
postgres=# CREATE TABLE foo(id SERIAL, k INT NOT NULL);
CREATE TABLE
postgres=# INSERT INTO foo SELECT i, i*100 FROM generate_series(1, 100) i;
INSERT 0 100
In another terminal session, kill the server using the --mode=immediate
option
postgres@f63dffa121c7:~$ pg_ctl -D /usr/local/pgsql/data -l ~/logfile stop --mode=immediate
waiting for server to shut down.... done
server stopped
Table foo
should now have 100 records since the transaction committed, right? Not really.
On commit, a user can only be sure that data has been stored in the WAL but not in the table files. Before the database can be restarted, we need to do two things: copy the control file and copy the WAL segment file to another location. Create a directory named global
inside the new location and copy the control file into the global
directory. Then copy the last active WAL segment file(as shown in Latest checkpoint's REDO WAL file
row when you run the command pg_controldata -D /usr/local/pgsql/data/
) to the same location. Ensure the postgres user has read/write access the the location. I'll use the /tmp
directory for this case. The final structure should look like this
postgres@f63dffa121c7:~$ cp /usr/local/pgsql/data/global/pg_control /tmp/global/
postgres@f63dffa121c7:~$ cp /usr/local/pgsql/data/pg_wal/000000010000001600000018 /tmp/
postgres@f63dffa121c7:~$ ls /tmp/
000000010000001600000018 global
postgres@f63dffa121c7:~$ ls /tmp/global/
pg_control
Now clear the WAL data using the pg_resetwal tool. This is because, by default, when Postgres starts up it will first check if it needs to go into recovery mode and it will use the WAL to recover any un-checkpointed data. Manual recovery will be done later using the pg_wal_recovery
extension.
postgres@f63dffa121c7:~$ pg_resetwal -D /usr/local/pgsql/data/ -f
Write-ahead log reset
pg_resetwal should never be used in a production database. It has destructive side effects that can lead to data loss. Its usage should be limited to situations where the user knows what they are doing and they are aware of the risks involved.
Looking at the control data now, the checkpoint and redo locations match. If there were any WAL segment files in the pg_wal
directory prior to running the command, they have been removed. Also, a new WAL segment file has been created 000000010000001600000019
.
postgres@f63dffa121c7:~$ pg_controldata -D /usr/local/pgsql/data/
pg_control version number: 1800
Catalog version number: 202505071
Database system identifier: 7504776984629272596
Database cluster state: shut down
pg_control last modified: Wed Jul 9 10:46:20 2025
Latest checkpoint location: 16/19000028
Latest checkpoint's REDO location: 16/19000028
Latest checkpoint's REDO WAL file: 000000010000001600000019
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 1:1847483887
Latest checkpoint's NextOID: 451323
.....
Now the database can be restarted since we know it won't go into automatic recovery mode.
postgres@f63dffa121c7:~$ pg_ctl -D /usr/local/pgsql/data -l ~/logfile start
waiting for server to start.... done
server started
Once it is up, we can check the foo
table to find out if there is any data in it.
postgres=# SELECT * FROM foo;
id | k
----+---
(0 rows)
It returned no rows. Does it mean the data inserted before has been lost? Well....Not really.
The data was committed so we can be sure that, at the very least, the data has been logged in the WAL. We can check this using the pg_wal_recovery
extension.
postgres=# SELECT * FROM wal_list_records('/tmp');
WARNING: invalid record length at 16/18065410: expected at least 24, got 0
wal_file_name | wal_type | wal_record
--------------------------+-------------------+-------------
000000010000001600000018 | CHECKPOINT_REDO | 16/1805D4E0
000000010000001600000018 | RUNNING_XACTS | 16/1805D500
000000010000001600000018 | CHECKPOINT_ONLINE | 16/1805D538
000000010000001600000018 | RUNNING_XACTS | 16/1805D5B0
000000010000001600000018 | FPI_FOR_HINT | 16/1805D5E8
000000010000001600000018 | FPI_FOR_HINT | 16/1805E0C0
000000010000001600000018 | FPI_FOR_HINT | 16/1805F880
000000010000001600000018 | RUNNING_XACTS | 16/18060D00
000000010000001600000018 | INSERT+INIT | 16/18060D38
000000010000001600000018 | INSERT | 16/18060D78
000000010000001600000018 | INSERT | 16/18060DB8
000000010000001600000018 | INSERT | 16/18060DF8
000000010000001600000018 | INSERT | 16/18060E38
.....
000000010000001600000018 | INSERT | 16/18062550
000000010000001600000018 | INSERT | 16/18062590
000000010000001600000018 | INSERT | 16/180625D0
000000010000001600000018 | INSERT | 16/18062610
000000010000001600000018 | COMMIT | 16/18062650
000000010000001600000018 | INSERT | 16/18062678
000000010000001600000018 | INSERT_LEAF | 16/18063CD8
000000010000001600000018 | INSERT | 16/18064268
000000010000001600000018 | INSERT_LEAF | 16/180644B8
000000010000001600000018 | INPLACE | 16/180644F8
000000010000001600000018 | COMMIT | 16/18065388
(115 rows)
The extension read WAL records starting from 16/1805D4E0
, the last REDO location in the control file that was copied earlier. There are 100 WAL records from offset 16/18060D38
to 16/18062610
in the WAL segment file. This matches the number of inserts that was done just before the database was shutdown. We can check this using pg_waldump tool which provides way more information about the WAL records
postgres@f63dffa121c7:~$ pg_waldump --start=16/18060D38 --end=16/18062610 /tmp/000000010000001600000018
rmgr: Heap len (rec/tot): 63/ 63, tx: 1847483887, lsn: 16/18060D38, prev 16/18060D00, desc: INSERT+INIT off: 1, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
rmgr: Heap len (rec/tot): 63/ 63, tx: 1847483887, lsn: 16/18060D78, prev 16/18060D38, desc: INSERT off: 2, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
rmgr: Heap len (rec/tot): 63/ 63, tx: 1847483887, lsn: 16/18060DB8, prev 16/18060D78, desc: INSERT off: 3, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
.....
rmgr: Heap len (rec/tot): 63/ 63, tx: 1847483887, lsn: 16/18062590, prev 16/18062550, desc: INSERT off: 98, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
rmgr: Heap len (rec/tot): 63/ 63, tx: 1847483887, lsn: 16/180625D0, prev 16/18062590, desc: INSERT off: 99, flags: 0x00, blkref #0: rel 1663/5/443132 blk 0
The last thing to check before replaying the WAL records is the data file(table foo
) to confirm if there is any data present. For this, the pageinspect extension comes in handy.
postgres=# SELECT lp, t_ctid, t_xmin, t_xmax FROM heap_page_items(get_raw_page('foo', 0)) WHERE lp_len > 0;
lp | t_ctid | t_xmin | t_xmax
----+--------+--------+--------
(0 rows)
Here we query the content of the first page(zero-based) in the data file. The lp
(line pointer) is the item number of the record within the page. t_ctid
is the location of the record within the data file e.g (0,3) can be read as third item on the first page. t_xmin
and t_xmax
are the transaction IDs of the inserting and deleting transactions respectively.
The table is empty. We have to replay the WAL in order to get the data into the table. The pg_wal_recovery
extension will be used for this purpose. It will output the last record replayed in the WAL when done.
postgres=# select * FROM wal_recover('/tmp');
WARNING: invalid record length at 16/18065410: expected at least 24, got 0
wal_type | wal_record
----------+-------------
COMMIT | 16/18065388
(1 row)
The "lost" data should now appear in the data files. Again, we can run the query using the helper functions from the pageinspect
extension
postgres=# SELECT lp, t_ctid, t_xmin, t_xmax FROM heap_page_items(get_raw_page('foo', 0)) WHERE lp_len > 0;
lp | t_ctid | t_xmin | t_xmax
-----+---------+------------+--------
1 | (0,1) | 1847483887 | 0
2 | (0,2) | 1847483887 | 0
3 | (0,3) | 1847483887 | 0
.....
97 | (0,97) | 1847483887 | 0
98 | (0,98) | 1847483887 | 0
99 | (0,99) | 1847483887 | 0
100 | (0,100) | 1847483887 | 0
(100 rows)
The data is now present in the table file. In this case, all 100 records fit in a single page(each page is 8kb in size). After a restart, querying the foo
table now returns all 100 records. The data has been restored.
postgres=# SELECT * FROM foo;
id | k
-----+-------
1 | 100
2 | 200
.....
99 | 9900
100 | 10000
(100 rows)
In this post I demonstrated how to perform a simple data recovery using the WAL. In real life systems, the chances of having to do recovery like this, are very low. Postgres already ships with a robust recovery infrastructure that has been battle tested for many years, so automatic recovery would handle such for you anyway. Building the extension made me go deeper into the lower-level parts of recovery in Postgres. The extension is for educational purposes only and would not be suitable for a production database or any database that has data you care about.
Top comments (0)