Timevolt

Posted on Jun 25

The Empire Strikes Back: Mastering Database Backups & Disaster Recovery

#devops #docker #kubernetes #cicd

The Quest Begins (The "Why")

I still remember the night our staging database went dark. A rogue migration script had wiped a critical table, and the only “backup” we had was a nightly pg_dump that we’d never actually tried to restore. When the alarm blared at 2 a.m., I felt like I was standing in front of a closed gate with no key—frantic, helpless, and wondering if we’d ever see our data again. That moment was the dragon I needed to slay: unreliable backups and a disaster‑recovery plan that lived only on paper.

The reality hit hard: a simple dump isn’t enough when you need point‑in‑time recovery, or when your storage layer fails. If you’ve ever felt that sinking feeling when a DELETE statement runs without a WHERE clause, you know exactly what I mean. It’s time to upgrade our gear and learn the real spells of database resilience.

The Revelation (The Insight)

The treasure I uncovered wasn’t a single tool—it was a mindset shift. Backups aren’t just about copying files; they’re about capturing a continuous stream of changes so you can rewind to any moment, like rewinding a movie to the exact frame you need. For PostgreSQL (the principles apply to most SQL databases), that means combining:

Base backups – a snapshot of the entire data directory at a given point.
Write‑Ahead Log (WAL) archiving – a relentless stream of every transaction that happens after the base backup.
Retention policies – keeping enough WAL files to restore to any point within your recovery window.

When you have both, you can restore the base backup and then “replay” the WAL to roll forward to the exact second before disaster struck. It’s like having a save‑point system in a video game: you save your progress (base backup) and then keep logging every move (WAL) so you can reload from any save point.

The magic happens in the configuration. Turn on WAL archiving, point it at a reliable storage location (S3, GCS, or even a mounted NFS share), and schedule regular base backups. Suddenly, your recovery point objective (RPO) drops from “overnight” to “minutes,” and your recovery time objective (RTO) shrinks because you’re not rebuilding from scratch—you’re just replaying logs.

Wielding the Power (Code & Examples)

The Struggle: Naive Nightly Dump

# cron job that runs at 02:00 every day
0 2 * * * pg_dump -U backupuser -Fc mydb > /backups/mydb_$(date +\%F).dump

Why it’s weak:

Only a logical dump; restoring means replaying DDL and DML, which can be slow.
No way to recover to a time between dumps.
If the dump file corrupts, you’re out of luck.
No encryption or integrity checks by default.

The Victory: Base Backup + WAL Streaming

First, enable WAL archiving in postgresql.conf:

wal_level = replica          # needed for archiving
archive_mode = on
archive_command = 'aws s3 cp %p s3://my-db-wal-backups/%f'

(Feel free to replace the aws s3 cp line with gsutil cp or any copy command that fits your cloud.)

Now take a base backup with pg_basebackup. This streams the data files while simultaneously starting WAL archiving:

# Create a base backup labeled "weekly"
pg_basebackup -h localhost -U backupuser -D /backups/base_backup_weekly \
    -Ft -z -P -R

-Ft creates a tar archive.
-z compresses it on the fly.
-P shows progress.
-R writes a standby.signal file and postgresql.auto.conf so the directory can be used as a standby (perfect for restore).

Store that tarball somewhere safe (again, S3 works well). Then, set up a retention policy—for example, keep the last 7 daily base backups and WAL files for the last 30 days. A simple lifecycle rule in your object store handles the cleanup.

Restoring to a Point‑In‑Time

Suppose disaster strikes at 2025-09-24 14:37:00 UTC. Here’s how you rewind:

Stop the PostgreSQL instance (if it’s still running):

   pg_ctl stop -D /var/lib/postgresql/data

Clear the data directory (or copy it elsewhere for safety):

   rm -rf /var/lib/postgresql/data/*

Extract the most recent base backup before the target time:

   # Assume we picked the backup from 2025-09-23 02:00:00
   tar -xzf /backups/base_backup_2025-09-23.tar.gz -C /var/lib/postgresql/data

Create a recovery.conf (PostgreSQL 12‑) or recovery.signal + postgresql.auto.conf (13+) to tell PostgreSQL to recover until our target timestamp:

   # recovery.signal (just an empty file to enable recovery mode)
   touch /var/lib/postgresql/data/recovery.signal

   # postgresql.auto.conf (append these lines)
   restore_command = 'aws s3 cp s3://my-db-wal-backups/%f %p'
   recovery_target_time = '2025-09-24 14:37:00 UTC'
   recovery_target_action = 'pause'

Start PostgreSQL—it will read the base backup, then replay WAL from the archive until it hits the timestamp, then pause:

   pg_ctl start -D /var/lib/postgresql/data

Verify your data is exactly as it was at 14:36:59, then either let it continue (promote to primary) or clone a fresh instance for further inspection.

Common trap #1: Forgetting to set restore_command. Without it, PostgreSQL will sit there waiting for WAL that never arrives, and the start will hang.

Common trap #2: Using a base backup that’s newer than your target time. You must pick a backup older than the point you want to recover to, otherwise you can’t replay backwards.

Quick Test Script (Dev‑Only)

If you want to try this on a laptop, here’s a minimal script that spins up a temporary PostgreSQL cluster, takes a base backup, archives WAL to a local folder, and then restores to a specific time:

#!/usr/bin/env bash
set -euo pipefail

PGDATA=/tmp/pgdata
BACKUP=/tmp/backup
WALDIR=/tmp/wal
mkdir -p $PGDATA $BACKUP $WALDIR

# initdb
initdb -D $PGDATA
# enable archiving
cat >> $PGDATA/postgresql.conf <<EOF
wal_level = replica
archive_mode = on
archive_command = 'cp %p $WALDIR/%f'
EOF

# start postgres
pg_ctl start -D $PGDATA -l logfile

# create a table and insert some rows
psql -p 5432 -c "CREATE TABLE demo(id serial primary key, val text);"
psql -p 5432 -c "INSERT INTO demo(val) VALUES ('first');"

# take base backup
pg_basebackup -h localhost -U postgres -D $BACKUP -Ft -z -P -R

# insert another row after backup
psql -p 5432 -c "INSERT INTO demo(val) VALUES ('second');"

# stop pg
pg_ctl stop -D $PGDATA

# --- restore to point before the second insert ---
rm -rf $PGDATA/*
tar -xzf $BACKUP/base.tar.gz -C $PGDATA
touch $PGDATA/recovery.signal
cat >> $PGDATA/postgresql.auto.conf <<EOF
restore_command = 'cp $WALDIR/%f %p'
recovery_target_time = '$(date -u -d "1 second ago" +"%Y-%m-%d %H:%M:%S")'
recovery_target_action = 'pause'
EOF

pg_ctl start -D $PGDATA -l logfile
psql -p 5432 -c "SELECT * FROM demo;"
# you should see only the 'first' row
pg_ctl stop -D $PGDATA
echo "Success! Point‑in‑time restore works."

Run it, watch the output, and feel the thrill of turning a potential data loss into a controlled experiment.

Why This New Power Matters

With this setup, you’re no longer gambling on a nightly dump. You can:

Recover to any second within your WAL retention window—critical for financial systems, audit trails, or any compliance‑heavy workload.
Reduce downtime because you’re not rebuilding from scratch; you’re just replaying logs.
Test your DR plan regularly by spinning up a clone from the latest base backup and running a point‑in‑time restore—no production impact.
Sleep better knowing that even if a human error or a storage glitch occurs, you have a verifiable path back to a known good state.

It’s the difference between carrying a paper map and having a live GPS with turn‑by‑turn navigation. You still need to know how to read the map, but the GPS tells you exactly where you are and how to get where you need to go—fast.

Your Turn: Embark on Your Own Quest

Now that you’ve seen the spell, it’s time to cast it yourself. Challenge:

Take a throwaway PostgreSQL instance (Docker works great), enable WAL archiving, schedule a base backup, and then deliberately delete a row after the backup. Try restoring to the moment just before the deletion using the steps above. Share your experience—what tripped you up, what felt like uncovering a secret level, and how your confidence in backups shifted.

If you hit a snag, drop a comment below. Let’s learn from each other’s quests and keep our data safe, one WAL file at a time. Happy backing up! 🚀

DEV Community