beefed.ai

Posted on Mar 31 • Originally published at beefed.ai

Backup and Disaster Recovery Strategies for PostgreSQL

#database

Defining RTO, RPO and backup objectives
Implementing base backups and WAL archiving for reliable PITR
Using pgBackRest and Barman for automated, verifiable backups
Snapshots and storage-consistent backups: practical caveats
Testing restores, validating backups, and runbook discipline
Practical recovery checklist and runbook templates
Sources

Backups are only valuable when you can restore them reliably, quickly, and to the right point in time. Everything that follows focuses on making recoveries deterministic: measurable objectives, archive-complete base backups, tooling that verifies itself, and disciplined restore drills.

You operate clusters with production SLAs, live data growth, and shared storage that sometimes misbehaves. Symptoms I see most often: base backups that look complete but miss WAL segments, archive_command silently returning success even though files never arrive, snapshot workflows that omit the WAL directory, and runbooks that only exist in three people's heads. Those symptoms produce long, uncertain restores and embarrassing post-mortems — not merely invoices for extra storage.

Defining RTO, RPO and backup objectives

Recovery Time Objective (RTO) — the maximum tolerable downtime for an application or system component; the clock starts at the incident detection/notification and ends when the system meets its operational acceptance criteria. This is the common NIST definition used in enterprise continuity planning.
Recovery Point Objective (RPO) — the point in time to which data must be recovered after an outage (i.e., the maximum acceptable data loss measured in time). This determines how frequently your recovery points (backups / WAL archives / replicas) must be created.

Translate RTO/RPO into technical constraints and acceptance criteria:

RPO drives how you capture data: frequent logical dumps, base backups + WAL shipping, streaming replication, or synchronous replication.
RTO drives how you restore: automated restore tooling, pre-seeded warm standbys, or manual restores from cold data.

Practical mapping examples (illustrative, not prescriptive):

RPO = minutes → WAL shipping + streaming replication (near-zero data loss requires synchronous or synchronous-like patterns).
RPO = hours → frequent base backups or WAL-archive windows measured against business tolerance.
RTO = minutes → warm standby with automated promotion and DNS/traffic automation.
RTO = hours → scripted restore to clean host plus validated PITR procedures.

Make these objectives explicit in the runbook and tie them to acceptance tests (what constitutes “service restored” — connection health, query latency, or business process tests).

NIST CSRC Glossary: Recovery Time Objective. NIST CSRC Glossary: Recovery Point Objective.

Implementing base backups and WAL archiving for reliable PITR

Point-in-time recovery (PITR) depends on two pillars: a base backup and a complete WAL archive starting at that base backup. PostgreSQL’s continuous-archiving model uses the WAL to roll a file-system-level backup forward to a chosen time or LSN. The manual on continuous archiving describes the model and tradeoffs (you must keep WAL back to the base backup).

Key elements and concrete steps

Base backups:
- Use pg_basebackup for cluster-level binary base backups that are easy to automate and consume the replication protocol. pg_basebackup ensures PostgreSQL enters/out of backup mode and supports fetching WAL as part of the backup.
- Example (tar-formatted, include WAL via streaming):
```
sudo -u postgres pg_basebackup -D /var/lib/pgsql/backups/base \
  -Ft -z -P -X stream --max-rate=100M
```
-X stream pushes WAL via the replication stream so the backup is immediately usable for PITR.
WAL archiving:
- Set wal_level = replica (or higher), archive_mode = on, and configure an archive_command that copies completed WAL segments to durable storage. Monitor archive_timeout and WAL archive arrival.
- Minimal postgresql.conf snippet:
```
wal_level = replica
max_wal_senders = 4
archive_mode = on
archive_command = 'test ! -f /mnt/archive/%f && cp %p /mnt/archive/%f'
archive_timeout = 60
```
PostgreSQL will only call archive_command for completed WAL segments; the command must return non-zero only on failure.
Important runtime details:
- pg_basebackup runs over the replication protocol and requires a user with REPLICATION privileges and pg_hba.conf access.
- When relying on filesystem snapshots you must either (a) create an atomic snapshot that includes the entire data directory and WAL directory, or (b) bracket the snapshot with pg_start_backup() / pg_stop_backup() (or their newer pg_backup_start/pg_backup_stop) so PostgreSQL writes the correct backup metadata. Cloud snapshot guidance frequently demonstrates this sequence.
- Recovery will replay all WAL segments from the base backup to the recovery target — long WAL history equals longer replay times. Account for replay time when sizing RTO.

Important: WAL archiving works only when archiving is complete and monitored; an unmonitored archive_command that returns success will not save you. Use tooling that validates WAL arrival.

Using pgBackRest and Barman for automated, verifiable backups

Hand-rolled scripts scale poorly. Two mature, widely-used automation frameworks are pgBackRest and Barman; both support base backups, WAL archiving, PITR, and verification hooks — but they converge on different operational models and integrations.

Comparison at a glance

Feature	pgBackRest	Barman
Repository types (posix, S3, GCS, Azure)	S3/GCS/Azure/posix (multi-repo support)	posix, cloud snapshots integration; WAL incoming + storage catalog
WAL integration	`archive-push` / `archive-get` + `archive_command = 'pgbackrest --stanza=X archive-push %p'`	`barman-wal-archive` utility or rsync/ssh in `archive_command`
Incremental/differential support	Incremental/differential, merge/expire logic, retention controls	File-level/incremental options, snapshot support; retention policy configuration
Verification & checks	`pgbackrest check`, `info`, `verify`, auto-check on stanza-create/backup	`barman check`, `barman check-backup`, `barman list-backups`, `barman recover`
PITR support	Full restore + `--type=time	lsn `options; generates` restore_command` entries

Core pgBackRest workflow (practical):

Configure pgbackrest.conf and repository paths on backup host.
Configure PostgreSQL archive_command:

   archive_command = 'pgbackrest --stanza=demo archive-push %p'
   archive_mode = on
   wal_level = replica

This lets PostgreSQL hand WAL segments to pgBackRest’s archive-push.

Create stanza and validate:

   sudo -u postgres pgbackrest --stanza=demo stanza-create
   sudo -u postgres pgbackrest --stanza=demo check
   sudo -u postgres pgbackrest --stanza=demo info

Use check regularly in automated monitoring to detect missing WALs before a restore is needed.

Core Barman workflow (practical):

Barman expects WALs via archive_command (rsync/barman-wal-archive) or streaming; it offers barman check, barman backup, barman list-backups, barman recover, and a cron/cron-style maintenance process. Example archive_command for Barman:

  archive_command = 'barman-wal-archive backup pg %p'
  archive_mode = on
  wal_level = replica

Barman’s check-backup verifies that WALs required for a base backup are present.

Retention and expiration:

pgBackRest exposes fine-grained repo-retention-* settings to expire backups and WAL segments safely; WAL required for retained backups will be preserved. Use expire to enforce retention during maintenance windows.
Barman uses retention_policy and wal_retention_policy and its cron process to manage retention and stale backups.

Caveat: do not treat retention as the same as a backup copy — retention controls when old backups/WAL are expired; maintain separate immutable offsite copies for long-term archival if regulation requires it.

Snapshots and storage-consistent backups: practical caveats

Snapshots (LVM, EBS, ZFS, or cloud volume snapshots) can be fast and space-efficient, but correctness depends on atomicity and inclusion:

A filesystem snapshot is acceptable for PostgreSQL if it captures the entire data directory (including all tablespaces and WAL) atomically at a single point in time; in that case PostgreSQL crash recovery semantics make the snapshot usable without pg_start_backup/pg_stop_backup. For many common snapshot mechanisms this atomicity is not guaranteed. [6search1]
Cloud providers' snapshot workflows typically recommend bracketing snapshot creation with the PostgreSQL backup API (e.g., SELECT pg_backup_start(...) / SELECT pg_backup_stop() or pg_start_backup() / pg_stop_backup() in older versions) to ensure a recoverable base with a consistent WAL boundary; many cloud guides (AWS FSx, GCP snapshots) demonstrate exactly that sequence. [6search0]

Snapshot workflow example (safe pattern):

-- on primary (Postgres 14 or earlier)
SELECT pg_start_backup('snap-2025-12-20', true);
/* trigger snapshot at hypervisor or cloud provider here */
-- ensure snapshot completed on storage side
SELECT pg_stop_backup();

On PostgreSQL 15+, the low-level API name changed to pg_backup_start/pg_backup_stop and the semantics differ slightly (session needs to remain open). Consult your Postgres version documentation when scripting.

Operational rules:

When the snapshot mechanism is known to be atlas-atomic across the entire data area, snapshot-only workflows are feasible.
When atomicity is uncertain, always use the backup API to create the backup label and ensure WALs from the start of the base backup are archived.
Keep archive_command and WAL ingestion monitored and test the ability to read WAL segments from the snapshot timeline (some object stores support repository time-slicing for recovery).

Testing restores, validating backups, and runbook discipline

A backup that hasn't been restored is not a backup — it’s a hope. Prove recoverability frequently and measure results.

Automated verifications (examples)

pgBackRest:
- pgbackrest --stanza=demo check → validates stanza, WAL push capability, and archiving path.
- pgbackrest --stanza=demo info → shows backup catalog and WAL coverage.
- Periodically perform a full restore in an isolated environment:
```
sudo -u postgres pgbackrest --stanza=demo --delta \
  --type=time --target="2025-12-01 10:00:00+00" --target-action=promote restore
```
pgBackRest will generate restore_command entries in postgresql.auto.conf so PostgreSQL can fetch WAL via pgbackrest --stanza=demo archive-get %f "%p".
Barman:
- barman check <server> and barman check-backup <server> <backup_id> to confirm required WAL segments exist for a base backup.
- Restore to a staging host:
```
barman recover pg latest /var/lib/postgresql/recover
# then start Postgres and validate
```

Restore test protocol (do this often for critical systems)

Prepare an isolated staging host with equal-major OS/PG versions and storage layout.
Confirm the latest backup is complete: pgbackrest --stanza=... info or barman list-backups.
Restore a full backup and perform a PITR to a non-destructive checkpoint (e.g., a recent time).
Start PostgreSQL in recovery and run a short acceptance test suite:
- user-facing API health checks
- SQL integrity checks: row counts, checksum queries, and a sample of business transactions validated against pre-captured hashes
Measure:
- Time from start to “DB accepts connections” (RTO candidate)
- Time to run acceptance tests
- WAL replay throughput (MB/s) and total WAL replay time
Record results and failure modes; update runbook entries and playbooks.

Automate the test and schedule it according to criticality: many teams run lightweight restores weekly and full restores quarterly or monthly for production; more critical services require more frequent full drills.

Runbook discipline: what a production restore playbook must contain (minimum)

Owner and escalation contacts (names, roles, phones/pagers)
RTO and RPO definitions and acceptance criteria for each service
Exact commands to validate backups (commands + expected outputs)
Exact commands to restore (with variables for stanza, backup_id, target_time)
Verification checklist (connectivity tests, sample queries, application smoke tests)
Post-restore cleanup and retention steps
Post-mortem and update checklist (who writes the after-action report, where to store it)

Callout: Treat the runbook as code: version it, keep it with your runbook repository, and ensure multiple people can follow it successfully.

Practical recovery checklist and runbook templates

Below are compact artifacts you can copy into your operations documentation and adapt.

Minimal nightly verification (example):

[ ] pgbackrest --stanza=prod info shows a successful full backup within retention window.
[ ] pgbackrest --stanza=prod check exits success (or alert).
[ ] Confirm latest base backup has backup_label and related .backup file in archive.
[ ] Confirm archive_command return-code monitoring is integrated with alerting (Nagios/Prometheus).
[ ] Sample WAL restore test (weekly): restore to staging host and run smoke tests (see restore protocol above).

Sample restore runbook (skeleton, fill variables and secrets out-of-band)

# Recovery runbook: PostgreSQL (production)
meta:
  db_stanza: prod
  expected_pg_version: 16
  rto_target_minutes: 120
  rpo_target_minutes: 15
contacts:
  oncall: alice@example.com
  dba: dba_team_pager
prereqs:
  - staging host with same PG major version
  - network routes open between repo and staging
steps:
  - name: validate-backup
    cmd: "sudo -u postgres pgbackrest --stanza=${db_stanza} info"
    success: "shows last backup state 'full' and 'ok'"
  - name: restore-base
    cmd: |
      sudo -u postgres pgbackrest --stanza=${db_stanza} --delta \
        --type=time --target="${TARGET_TIME}" --target-action=promote restore
    timeout: 3600
  - name: start-postgres
    cmd: "sudo systemctl start postgresql"
    wait_for: "port 5432 reachable"
  - name: smoke-tests
    cmd: "psql -U smoke -d appdb -c 'SELECT count(*) FROM important_table;'"
    success: "expected counts within tolerance"
postmortem:
  - collect logs: /var/log/postgresql, pgbackrest logs
  - record timings: start_time, pg_ready_time, smoke_completed_time
  - update runbook with deviations

Quick checklist for PITR restore with pgBackRest (commands)

# verify backups and WAL coverage
sudo -u postgres pgbackrest --stanza=prod info
sudo -u postgres pgbackrest --stanza=prod check

# restore to target time
sudo -u postgres pgbackrest --stanza=prod --delta \
  --type=time --target="2025-12-01 12:34:00+00" \
  --target-action=promote restore
# start and validate
sudo systemctl start postgresql
psql -U appuser -d appdb -c "SELECT 1;"

Run a Barman PITR (commands)

# list backups
barman list-backups prod

# verify backup WAL coverage (auto-invoked by cron, but can be run manually)
barman check prod
barman check-backup prod <backup_id>

# recover latest to directory
barman recover prod latest /var/lib/postgresql/recover

Testing frequency & metrics to capture

Capture: restore duration (seconds), WAL replay speed (MB/s), data verification duration, and correctness results (pass/fail).
Typical cadence: lightweight verification (catalog checks + check) every night; targeted PITR (staging restore) monthly for high-impact clusters, quarterly for lower-impact — adjust according to your RTO/RPO and regulation. Record and track metrics across drills so SLAs are demonstrable.

A final operational point: automation and monitoring matter more than boutique settings. Use check and info outputs in automated health checks, export them to your monitoring stack, and ensure alert thresholds exist for failed archiving, missing WALs, or retention exhaustion.

Recoverability is a measurable property; instrument it, test it, measure it, and codify it into runbooks and schedules.

Sources

NIST CSRC — Recovery Time Objective - Definition and authoritative context for RTO used in continuity planning.

NIST CSRC — Recovery Point Objective - Definition and authoritative context for RPO that determines backup frequency and tolerable data loss.

PostgreSQL: Continuous Archiving and Point-in-Time Recovery (PITR) - Explanation of how base backups plus WAL archives enable point-in-time recovery, tradeoffs for replay duration, and the backup history file behavior.

PostgreSQL: pg_basebackup documentation - How pg_basebackup works, its options (-X stream, compression, progress), and replication/permission requirements.

pgBackRest — User Guide & Command Reference - Stanza creation, archive-push/archive-get usage, check, info, restore examples and repository/retention configuration.

Barman Manual — Backup and Recovery Manager for PostgreSQL - Barman commands (barman check, barman backup, barman recover, barman check-backup), barman-wal-archive guidance, and snapshot integrations.

PostgreSQL: Write Ahead Log — archive_command and archiving parameters - Runtime-config settings: wal_level, archive_mode, archive_command, archive_timeout, and caveats about archiver behavior.

pgBackRest — Configuration (retention options) - repo-retention-full, repo-retention-archive and how expiration interacts with WAL retention for safe PITR.

AWS Storage Blog — FSx for OpenZFS and PostgreSQL snapshot guidance - Example snapshot workflow using PostgreSQL backup API around a storage snapshot (demonstrates pg_backup_start / pg_backup_stop usage in cloud snapshot contexts).