- Defining RTO, RPO and backup objectives
- Implementing base backups and WAL archiving for reliable PITR
- Using pgBackRest and Barman for automated, verifiable backups
- Snapshots and storage-consistent backups: practical caveats
- Testing restores, validating backups, and runbook discipline
- Practical recovery checklist and runbook templates
- Sources
Backups are only valuable when you can restore them reliably, quickly, and to the right point in time. Everything that follows focuses on making recoveries deterministic: measurable objectives, archive-complete base backups, tooling that verifies itself, and disciplined restore drills.
You operate clusters with production SLAs, live data growth, and shared storage that sometimes misbehaves. Symptoms I see most often: base backups that look complete but miss WAL segments, archive_command silently returning success even though files never arrive, snapshot workflows that omit the WAL directory, and runbooks that only exist in three people's heads. Those symptoms produce long, uncertain restores and embarrassing post-mortems — not merely invoices for extra storage.
Defining RTO, RPO and backup objectives
- Recovery Time Objective (RTO) — the maximum tolerable downtime for an application or system component; the clock starts at the incident detection/notification and ends when the system meets its operational acceptance criteria. This is the common NIST definition used in enterprise continuity planning.
- Recovery Point Objective (RPO) — the point in time to which data must be recovered after an outage (i.e., the maximum acceptable data loss measured in time). This determines how frequently your recovery points (backups / WAL archives / replicas) must be created.
Translate RTO/RPO into technical constraints and acceptance criteria:
- RPO drives how you capture data: frequent logical dumps, base backups + WAL shipping, streaming replication, or synchronous replication.
- RTO drives how you restore: automated restore tooling, pre-seeded warm standbys, or manual restores from cold data.
Practical mapping examples (illustrative, not prescriptive):
- RPO = minutes → WAL shipping + streaming replication (near-zero data loss requires synchronous or synchronous-like patterns).
- RPO = hours → frequent base backups or WAL-archive windows measured against business tolerance.
- RTO = minutes → warm standby with automated promotion and DNS/traffic automation.
- RTO = hours → scripted restore to clean host plus validated PITR procedures.
Make these objectives explicit in the runbook and tie them to acceptance tests (what constitutes “service restored” — connection health, query latency, or business process tests).
NIST CSRC Glossary: Recovery Time Objective. NIST CSRC Glossary: Recovery Point Objective.
Implementing base backups and WAL archiving for reliable PITR
Point-in-time recovery (PITR) depends on two pillars: a base backup and a complete WAL archive starting at that base backup. PostgreSQL’s continuous-archiving model uses the WAL to roll a file-system-level backup forward to a chosen time or LSN. The manual on continuous archiving describes the model and tradeoffs (you must keep WAL back to the base backup).
Key elements and concrete steps
-
Base backups:
- Use
pg_basebackupfor cluster-level binary base backups that are easy to automate and consume the replication protocol.pg_basebackupensures PostgreSQL enters/out of backup mode and supports fetching WAL as part of the backup. - Example (tar-formatted, include WAL via streaming):
sudo -u postgres pg_basebackup -D /var/lib/pgsql/backups/base \ -Ft -z -P -X stream --max-rate=100M-X streampushes WAL via the replication stream so the backup is immediately usable for PITR. - Use
-
WAL archiving:
- Set
wal_level = replica(or higher),archive_mode = on, and configure anarchive_commandthat copies completed WAL segments to durable storage. Monitorarchive_timeoutand WAL archive arrival. - Minimal
postgresql.confsnippet:
wal_level = replica max_wal_senders = 4 archive_mode = on archive_command = 'test ! -f /mnt/archive/%f && cp %p /mnt/archive/%f' archive_timeout = 60PostgreSQL will only call
archive_commandfor completed WAL segments; the command must return non-zero only on failure. - Set
-
Important runtime details:
-
pg_basebackupruns over the replication protocol and requires a user withREPLICATIONprivileges andpg_hba.confaccess. - When relying on filesystem snapshots you must either (a) create an atomic snapshot that includes the entire data directory and WAL directory, or (b) bracket the snapshot with
pg_start_backup()/pg_stop_backup()(or their newerpg_backup_start/pg_backup_stop) so PostgreSQL writes the correct backup metadata. Cloud snapshot guidance frequently demonstrates this sequence. - Recovery will replay all WAL segments from the base backup to the recovery target — long WAL history equals longer replay times. Account for replay time when sizing RTO.
-
Important: WAL archiving works only when archiving is complete and monitored; an unmonitored
archive_commandthat returns success will not save you. Use tooling that validates WAL arrival.
Using pgBackRest and Barman for automated, verifiable backups
Hand-rolled scripts scale poorly. Two mature, widely-used automation frameworks are pgBackRest and Barman; both support base backups, WAL archiving, PITR, and verification hooks — but they converge on different operational models and integrations.
Comparison at a glance
| Feature | pgBackRest | Barman |
|---|---|---|
| Repository types (posix, S3, GCS, Azure) | S3/GCS/Azure/posix (multi-repo support) | posix, cloud snapshots integration; WAL incoming + storage catalog |
| WAL integration |
archive-push / archive-get + archive_command = 'pgbackrest --stanza=X archive-push %p'
|
barman-wal-archive utility or rsync/ssh in archive_command
|
| Incremental/differential support | Incremental/differential, merge/expire logic, retention controls | File-level/incremental options, snapshot support; retention policy configuration |
| Verification & checks |
pgbackrest check, info, verify, auto-check on stanza-create/backup |
barman check, barman check-backup, barman list-backups, barman recover
|
| PITR support | Full restore + `--type=time | lsn options; generates restore_command` entries |
Core pgBackRest workflow (practical):
- Configure
pgbackrest.confand repository paths on backup host. - Configure PostgreSQL
archive_command:
archive_command = 'pgbackrest --stanza=demo archive-push %p'
archive_mode = on
wal_level = replica
This lets PostgreSQL hand WAL segments to pgBackRest’s archive-push.
- Create stanza and validate:
sudo -u postgres pgbackrest --stanza=demo stanza-create
sudo -u postgres pgbackrest --stanza=demo check
sudo -u postgres pgbackrest --stanza=demo info
Use check regularly in automated monitoring to detect missing WALs before a restore is needed.
Core Barman workflow (practical):
- Barman expects WALs via
archive_command(rsync/barman-wal-archive) or streaming; it offersbarman check,barman backup,barman list-backups,barman recover, and acron/cron-style maintenance process. Examplearchive_commandfor Barman:
archive_command = 'barman-wal-archive backup pg %p'
archive_mode = on
wal_level = replica
Barman’s check-backup verifies that WALs required for a base backup are present.
Retention and expiration:
- pgBackRest exposes fine-grained
repo-retention-*settings to expire backups and WAL segments safely; WAL required for retained backups will be preserved. Useexpireto enforce retention during maintenance windows. - Barman uses
retention_policyandwal_retention_policyand itscronprocess to manage retention and stale backups.
Caveat: do not treat retention as the same as a backup copy — retention controls when old backups/WAL are expired; maintain separate immutable offsite copies for long-term archival if regulation requires it.
Snapshots and storage-consistent backups: practical caveats
Snapshots (LVM, EBS, ZFS, or cloud volume snapshots) can be fast and space-efficient, but correctness depends on atomicity and inclusion:
- A filesystem snapshot is acceptable for PostgreSQL if it captures the entire data directory (including all tablespaces and WAL) atomically at a single point in time; in that case PostgreSQL crash recovery semantics make the snapshot usable without
pg_start_backup/pg_stop_backup. For many common snapshot mechanisms this atomicity is not guaranteed. [6search1] - Cloud providers' snapshot workflows typically recommend bracketing snapshot creation with the PostgreSQL backup API (e.g.,
SELECT pg_backup_start(...)/SELECT pg_backup_stop()orpg_start_backup()/pg_stop_backup()in older versions) to ensure a recoverable base with a consistent WAL boundary; many cloud guides (AWS FSx, GCP snapshots) demonstrate exactly that sequence. [6search0]
Snapshot workflow example (safe pattern):
-- on primary (Postgres 14 or earlier)
SELECT pg_start_backup('snap-2025-12-20', true);
/* trigger snapshot at hypervisor or cloud provider here */
-- ensure snapshot completed on storage side
SELECT pg_stop_backup();
On PostgreSQL 15+, the low-level API name changed to pg_backup_start/pg_backup_stop and the semantics differ slightly (session needs to remain open). Consult your Postgres version documentation when scripting.
Operational rules:
- When the snapshot mechanism is known to be atlas-atomic across the entire data area, snapshot-only workflows are feasible.
- When atomicity is uncertain, always use the backup API to create the backup label and ensure WALs from the start of the base backup are archived.
- Keep
archive_commandand WAL ingestion monitored and test the ability to read WAL segments from the snapshot timeline (some object stores support repository time-slicing for recovery).
Testing restores, validating backups, and runbook discipline
A backup that hasn't been restored is not a backup — it’s a hope. Prove recoverability frequently and measure results.
Automated verifications (examples)
-
pgBackRest:
-
pgbackrest --stanza=demo check→ validates stanza, WAL push capability, and archiving path. -
pgbackrest --stanza=demo info→ shows backup catalog and WAL coverage. - Periodically perform a full restore in an isolated environment:
sudo -u postgres pgbackrest --stanza=demo --delta \ --type=time --target="2025-12-01 10:00:00+00" --target-action=promote restorepgBackRest will generate
restore_commandentries inpostgresql.auto.confso PostgreSQL can fetch WAL viapgbackrest --stanza=demo archive-get %f "%p". -
-
Barman:
-
barman check <server>andbarman check-backup <server> <backup_id>to confirm required WAL segments exist for a base backup. - Restore to a staging host:
barman recover pg latest /var/lib/postgresql/recover # then start Postgres and validate -
Restore test protocol (do this often for critical systems)
- Prepare an isolated staging host with equal-major OS/PG versions and storage layout.
- Confirm the latest backup is complete:
pgbackrest --stanza=... infoorbarman list-backups. - Restore a full backup and perform a PITR to a non-destructive checkpoint (e.g., a recent time).
- Start PostgreSQL in recovery and run a short acceptance test suite:
- user-facing API health checks
- SQL integrity checks: row counts, checksum queries, and a sample of business transactions validated against pre-captured hashes
- Measure:
- Time from start to “DB accepts connections” (RTO candidate)
- Time to run acceptance tests
- WAL replay throughput (MB/s) and total WAL replay time
- Record results and failure modes; update runbook entries and playbooks.
Automate the test and schedule it according to criticality: many teams run lightweight restores weekly and full restores quarterly or monthly for production; more critical services require more frequent full drills.
Runbook discipline: what a production restore playbook must contain (minimum)
- Owner and escalation contacts (names, roles, phones/pagers)
- RTO and RPO definitions and acceptance criteria for each service
- Exact commands to validate backups (commands + expected outputs)
- Exact commands to restore (with variables for
stanza,backup_id,target_time) - Verification checklist (connectivity tests, sample queries, application smoke tests)
- Post-restore cleanup and retention steps
- Post-mortem and update checklist (who writes the after-action report, where to store it)
Callout: Treat the runbook as code: version it, keep it with your runbook repository, and ensure multiple people can follow it successfully.
Practical recovery checklist and runbook templates
Below are compact artifacts you can copy into your operations documentation and adapt.
Minimal nightly verification (example):
- [ ]
pgbackrest --stanza=prod infoshows a successful full backup within retention window. - [ ]
pgbackrest --stanza=prod checkexits success (or alert). - [ ] Confirm latest base backup has
backup_labeland related.backupfile in archive. - [ ] Confirm
archive_commandreturn-code monitoring is integrated with alerting (Nagios/Prometheus). - [ ] Sample WAL restore test (weekly): restore to staging host and run smoke tests (see restore protocol above).
Sample restore runbook (skeleton, fill variables and secrets out-of-band)
# Recovery runbook: PostgreSQL (production)
meta:
db_stanza: prod
expected_pg_version: 16
rto_target_minutes: 120
rpo_target_minutes: 15
contacts:
oncall: alice@example.com
dba: dba_team_pager
prereqs:
- staging host with same PG major version
- network routes open between repo and staging
steps:
- name: validate-backup
cmd: "sudo -u postgres pgbackrest --stanza=${db_stanza} info"
success: "shows last backup state 'full' and 'ok'"
- name: restore-base
cmd: |
sudo -u postgres pgbackrest --stanza=${db_stanza} --delta \
--type=time --target="${TARGET_TIME}" --target-action=promote restore
timeout: 3600
- name: start-postgres
cmd: "sudo systemctl start postgresql"
wait_for: "port 5432 reachable"
- name: smoke-tests
cmd: "psql -U smoke -d appdb -c 'SELECT count(*) FROM important_table;'"
success: "expected counts within tolerance"
postmortem:
- collect logs: /var/log/postgresql, pgbackrest logs
- record timings: start_time, pg_ready_time, smoke_completed_time
- update runbook with deviations
Quick checklist for PITR restore with pgBackRest (commands)
# verify backups and WAL coverage
sudo -u postgres pgbackrest --stanza=prod info
sudo -u postgres pgbackrest --stanza=prod check
# restore to target time
sudo -u postgres pgbackrest --stanza=prod --delta \
--type=time --target="2025-12-01 12:34:00+00" \
--target-action=promote restore
# start and validate
sudo systemctl start postgresql
psql -U appuser -d appdb -c "SELECT 1;"
Run a Barman PITR (commands)
# list backups
barman list-backups prod
# verify backup WAL coverage (auto-invoked by cron, but can be run manually)
barman check prod
barman check-backup prod <backup_id>
# recover latest to directory
barman recover prod latest /var/lib/postgresql/recover
Testing frequency & metrics to capture
- Capture: restore duration (seconds), WAL replay speed (MB/s), data verification duration, and correctness results (pass/fail).
- Typical cadence: lightweight verification (catalog checks +
check) every night; targeted PITR (staging restore) monthly for high-impact clusters, quarterly for lower-impact — adjust according to your RTO/RPO and regulation. Record and track metrics across drills so SLAs are demonstrable.
A final operational point: automation and monitoring matter more than boutique settings. Use check and info outputs in automated health checks, export them to your monitoring stack, and ensure alert thresholds exist for failed archiving, missing WALs, or retention exhaustion.
Recoverability is a measurable property; instrument it, test it, measure it, and codify it into runbooks and schedules.
Sources
NIST CSRC — Recovery Time Objective - Definition and authoritative context for RTO used in continuity planning.
NIST CSRC — Recovery Point Objective - Definition and authoritative context for RPO that determines backup frequency and tolerable data loss.
PostgreSQL: Continuous Archiving and Point-in-Time Recovery (PITR) - Explanation of how base backups plus WAL archives enable point-in-time recovery, tradeoffs for replay duration, and the backup history file behavior.
PostgreSQL: pg_basebackup documentation - How pg_basebackup works, its options (-X stream, compression, progress), and replication/permission requirements.
pgBackRest — User Guide & Command Reference - Stanza creation, archive-push/archive-get usage, check, info, restore examples and repository/retention configuration.
Barman Manual — Backup and Recovery Manager for PostgreSQL - Barman commands (barman check, barman backup, barman recover, barman check-backup), barman-wal-archive guidance, and snapshot integrations.
PostgreSQL: Write Ahead Log — archive_command and archiving parameters - Runtime-config settings: wal_level, archive_mode, archive_command, archive_timeout, and caveats about archiver behavior.
pgBackRest — Configuration (retention options) - repo-retention-full, repo-retention-archive and how expiration interacts with WAL retention for safe PITR.
AWS Storage Blog — FSx for OpenZFS and PostgreSQL snapshot guidance - Example snapshot workflow using PostgreSQL backup API around a storage snapshot (demonstrates pg_backup_start / pg_backup_stop usage in cloud snapshot contexts).
Top comments (0)