Timevolt

Posted on Jul 3

Backup Quest: A Lord of the Rings Adventure

#devops #docker #kubernetes #cicd

The Quest Begins (The "Why")

I still remember the night our production database decided to take an unscheduled nap. It was 2 a.m., the alerts were screaming, and I was staring at a blank screen wondering if I’d just lost a week’s worth of user data. Turns out, the nightly pg_dump cron job had silently failed because the disk filled up — and we had no way to roll forward to the point just before the crash. I felt like Frodo staring at the cracks of Doom, realizing the One Ring (our data) was slipping away, and we had no backup plan to save it.

That moment kicked off a quest: find a backup strategy that isn’t just a safety net but a true disaster‑recovery (DR) shield. I wanted something that could give me sub‑second RPO (Recovery Point Objective) and minutes‑level RTO (Recovery Time Objective) without turning my ops team into full‑time archivists.

The Revelation (The Insight)

The treasure I uncovered wasn’t a single magic spell — it was a layered approach:

Physical base backups (filesystem snapshots or pg_basebackup) taken daily.
Continuous WAL archiving (Write‑Ahead Log) streamed to an off‑site store, giving us point‑in‑time recovery.
Automated validation – a nightly restore test to a temporary instance, because a backup you can’t restore is just a fancy paperweight.
Clear RPO/RTO targets and a run‑book that anyone on call can follow, even at 3 a.m.

Think of it like the Fellowship: the base backup is Gandalf providing the initial strength, the WAL stream is the relentless march of the Rohirrim arriving just in time, and the restore test is Aragorn checking the sword before battle. Miss any piece, and the whole quest falters.

Wielding the Power (Code & Examples)

The Struggle – Naïve Cron Job

# /etc/cron.d/db-backup
0 2 * * * postgres pg_dump -Fc mydb > /backups/mydb_$(date +\%F).dump

Problems:

No verification that the dump succeeded.
If the disk fills, the job silently truncates.
Restore means pulling a dump, creating a fresh DB, and loading – hours of downtime.

The Victory – Physical Base + WAL‑E (or wal-g)

First, enable WAL archiving in postgresql.conf:

wal_level = replica
archive_mode = on
archive_command = 'wal-g wal-push %p'
archive_timeout = 300   # switch WAL file every 5 min if idle

Now schedule a base backup (we’ll use wal-g for simplicity, but pg_basebackup works too):

# Daily base backup at 01:00
0 1 * * * postgres wal-g backup-push /var/lib/postgresql/14/main

And a continuous WAL push (the archive_command above already does this automatically, but we also run a lazy‑push cron for safety):

# Push any WAL files that didn’t make it via archive_command
*/5 * * * * postgres wal-g wal-push /var/lib/postgresql/14/main/pg_wal/*

Restore Script – Point‑in‑Time Recovery

#!/usr/bin/env bash
set -euo pipefail

TARGET_TIME="2025-09-24 01:30:00"   # whatever you need
RESTORE_DIR="/tmp/pitr_restore"
BASE_BACKUP=$(wal-g backup-list | tail -1 | awk '{print $1}')

# 1️⃣ Fetch the latest base backup
wal-g backup-fetch $RESTORE_DIR $BASE_BACKUP

# 2️⃣ Create recovery.conf (or postgresql.auto.conf for PG12+)
cat > "$RESTORE_DIR/postgresql.auto.conf" <<EOF
restore_command = 'wal-g wal-fetch %f %p'
recovery_target_time = '$TARGET_TIME'
recovery_target_action = 'pause'
EOF

# 3️⃣ Start PostgreSQL in restore mode
pg_ctl -D "$RESTORE_DIR" -l logfile start

echo "🚀 Restore started. Point‑in‑time recovery to $TARGET_TIME"
echo "Check $RESTORE_DIR/logfile for progress."

Why this works:

The base backup gives us a consistent snapshot.
WAL‑E streams every transaction log to an S3‑compatible bucket (or GCS, Azure Blob).
recovery_target_time tells PostgreSQL to replay WAL only up to the moment we specify, then pause — perfect for undoing a fat‑finger drop table.

Traps to Avoid

Trap	What it looks like	How to dodge it
“Set‑and‑forget” backups	You run `pg_dump` nightly but never test a restore.	Schedule a monthly restore to a staging DB and run your smoke tests.
Storing WAL archives on the same volume	If the disk dies, you lose both base backup and WALs.	Use a separate object store or at least a different NAS mount with its own replication.
Ignoring retention policy	Old backups pile up, eating storage and slowing down `wal-g backup-list`.	Configure `wal-g delete` to keep, say, 30 days of base backups and unlimited WALs (or a compliance‑driven window).

Why This New Power Matters

With this setup, I’ve gone from praying the nightly dump didn’t fail to having a self‑healing data shield. When our analytics team accidentally truncated a key table at 4 p.m., I ran the restore script, pointed it to 15 minutes before the mistake, and had the database back online in under ten minutes — no angry users, no late‑night pizza runs.

Beyond the peace of mind, the practice pays off in real business terms:

Compliance: Auditors love seeing documented RPO/RTO and regular restore tests.
Cost efficiency: Object storage is cheap; you keep only what you need.
Team confidence: On‑call engineers can follow a clear run‑book instead of guessing which backup file to grab.

It’s like discovering a hidden stash of lembas bread on a long trek — you keep moving, knowing you’ve got sustenance for whatever comes next.

Your Turn – Embark on Your Own Backup Quest

Here’s a challenge: spin up a spare PostgreSQL instance (Docker works fine), enable WAL archiving to a local folder mimicking an S3 bucket, take a base backup, and then try to restore to a point five minutes in the future. Did it work? How long did it take? Share your results or any hiccups you hit — let’s turn this into a fellowship of backup warriors!

May your backups be strong, and your restores swift. 🚀

DEV Community