PostgreSQL Point-in-Time Recovery with pgBackRest

#postgres #tutorial #devops #database

PostgreSQL Point-in-Time Recovery with pgBackRest

pg_dump gives you a snapshot at the moment you ran it. If your last dump was 6 hours ago and someone accidentally deletes a production table, those 6 hours are gone. Even with hourly dumps, you lose everything between the last dump and the incident. For a database processing thousands of transactions per minute, that gap is devastating. Point-in-time recovery (PITR) eliminates that gap -- restoring your database to any specific second by replaying the write-ahead log on top of a base backup.

How PITR Works

Two mechanisms combine:

Base backups -- periodic snapshots of all database files
WAL archiving -- continuous streaming of every WAL segment to a backup repository

The WAL records every change made to the database. By replaying WAL segments from a base backup forward to a target timestamp, you reconstruct the exact state at that moment. If the last archived WAL is 30 seconds old, your maximum data loss is 30 seconds -- not 6 hours.

pgBackRest is the standard tool for this. It handles base backups (full, incremental, differential), WAL archiving, retention, verification, and recovery -- with parallel compression, encryption, and remote repository support.

Detecting Whether You're Protected

Check if WAL archiving is even enabled:

SELECT name, setting
FROM pg_settings
WHERE name IN (
    'archive_mode',
    'archive_command',
    'archive_timeout',
    'wal_level'
);

You need archive_mode = on and wal_level = replica (or logical). If archive_mode is off, PITR is impossible.

Check for archiving failures:

SELECT
    archived_count,
    failed_count,
    last_archived_wal,
    last_archived_time,
    last_failed_wal,
    last_failed_time,
    now() - last_archived_time AS archive_lag
FROM pg_stat_archiver;

A non-zero failed_count or archive_lag greater than a few minutes means the pipeline is broken. WAL segments are accumulating on the primary and will eventually fill the disk.

Verify backup freshness:

pgbackrest info --stanza=main
pgbackrest verify --stanza=main

Setting Up pgBackRest

Install and Configure

# Debian/Ubuntu
sudo apt-get install pgbackrest

# RHEL/Rocky
sudo dnf install pgbackrest

Create /etc/pgbackrest/pgbackrest.conf:

[main]
pg1-path=/var/lib/postgresql/18/main

[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=2
repo1-retention-diff=7
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=your-secure-encryption-passphrase

process-max=4
compress-type=zst
compress-level=6

log-level-console=info
log-level-file=detail

Configure PostgreSQL

Add to postgresql.conf:

wal_level = replica
archive_mode = on
archive_command = 'pgbackrest --stanza=main archive-push %p'
archive_timeout = 60

The archive_timeout = 60 forces a WAL switch every 60 seconds even if the segment isn't full. This caps maximum data loss at 60 seconds.

Restart PostgreSQL, then initialize the stanza:

pgbackrest --stanza=main stanza-create
pgbackrest --stanza=main check

Schedule Backups

# Full backup (weekly)
pgbackrest --stanza=main --type=full backup

# Differential (daily -- changes since last full)
pgbackrest --stanza=main --type=diff backup

# Incremental (every 6 hours -- changes since last any backup)
pgbackrest --stanza=main --type=incr backup

Cron schedule:

0 2 * * 0  pgbackrest --stanza=main --type=full backup
0 2 * * 1-6  pgbackrest --stanza=main --type=diff backup
0 */6 * * *  pgbackrest --stanza=main --type=incr backup

Performing Recovery

When disaster strikes, restore to a specific timestamp:

sudo systemctl stop postgresql

pgbackrest --stanza=main --type=time \
    --target="2026-02-28 14:30:00+00" \
    --target-action=promote \
    restore

sudo systemctl start postgresql

Set --target to just before the incident. --target-action=promote opens the database for read-write after recovery.

You can also restore to a named restore point:

-- Create before a risky operation
SELECT pg_create_restore_point('before_schema_migration');

pgbackrest --stanza=main --type=name \
    --target="before_schema_migration" \
    --target-action=promote \
    restore

Test Recovery Regularly

This is the most critical step. Schedule monthly recovery tests to a standby server:

pgbackrest --stanza=main --type=time \
    --target="2026-02-28 12:00:00+00" \
    --target-action=promote \
    --pg1-path=/var/lib/postgresql/18/test_recovery \
    restore

Verify the restored database contains expected data at the target timestamp. If recovery fails, fix the configuration before you need it in an emergency. Record the actual recovery time -- that's your real RTO.

Prevention

Build PITR into infrastructure from day one. Every production PostgreSQL database should have WAL archiving before its first production write.

Monitor three metrics continuously:

Archive lag -- alert if > 5 minutes
Failed archive count -- any non-zero value requires investigation
Backup age -- alert if exceeding your backup interval + buffer

An untested backup is not a backup. Test quarterly at minimum. Document the procedure, the expected recovery time, and who executes it. Run end-to-end: restore, replay WAL, verify data, record duration.

Store backups off-host. A backup on the same disk is destroyed by the same failure. Use S3, Azure Blob, GCS, or a separate server. Enable encryption.

Originally published at mydba.dev/blog/postgres-point-in-time-recovery