The Backup That Wasn't
We had backups. Daily snapshots to S3. Perfectly configured. Never tested.
When we needed to restore after a data corruption incident, we discovered the backups had been silently failing for 3 weeks. The S3 bucket policy had changed, and nobody noticed.
Never again.
Rule 1: Test Your Restores
Backups don't matter. Restores matter.
#!/bin/bash
# weekly-restore-test.sh runs every Sunday at 3am
TEST_DB="restore_test_$(date +%Y%m%d)"
# Step 1: Download latest backup
aws s3 cp s3://backups/prod-db/latest.sql.gz /tmp/restore-test.sql.gz
if [ $? -ne 0 ]; then
alert "CRITICAL: Cannot download backup from S3"
exit 1
fi
# Step 2: Restore to test database
gunzip -c /tmp/restore-test.sql.gz | psql -h test-db-host -U admin -d $TEST_DB
if [ $? -ne 0 ]; then
alert "CRITICAL: Backup restore failed"
exit 1
fi
# Step 3: Validate data
PROD_COUNT=$(psql -h prod-db -t -c "SELECT count(*) FROM users")
TEST_COUNT=$(psql -h test-db-host -d $TEST_DB -t -c "SELECT count(*) FROM users")
DIFF=$(( PROD_COUNT - TEST_COUNT ))
if [ $DIFF -gt 100 ]; then
alert "WARNING: Restored DB has $DIFF fewer rows than production"
fi
# Step 4: Cleanup
psql -h test-db-host -c "DROP DATABASE $TEST_DB"
rm /tmp/restore-test.sql.gz
notify "Backup restore test PASSED. Rows: prod=$PROD_COUNT test=$TEST_COUNT"
We run this weekly. It's caught 4 backup issues in the past year.
Rule 2: Monitor Replication Lag
-- PostgreSQL replication lag
SELECT
client_addr,
state,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS send_lag_bytes,
pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag_bytes,
pg_wal_lsn_diff(write_lsn, flush_lsn) AS flush_lag_bytes,
pg_wal_lsn_diff(flush_lsn, replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;
Alert on:
- Replication lag > 1 second (warning)
- Replication lag > 10 seconds (critical)
- Replica disconnected (critical)
Rule 3: Connection Pool Management
# pgbouncer.ini
[pgbouncer]
pool_mode = transaction # Not session!
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 600
# Monitoring
stats_period = 60
Without a connection pooler, every application connection holds a full PostgreSQL process. With 10 services each opening 20 connections, that's 200 PostgreSQL backends. With PgBouncer in transaction mode, you need maybe 25.
Rule 4: Query Performance Monitoring
-- Find the worst queries
SELECT
queryid,
calls,
round(total_exec_time::numeric, 2) AS total_ms,
round(mean_exec_time::numeric, 2) AS avg_ms,
round((100 * total_exec_time / sum(total_exec_time) OVER())::numeric, 2) AS pct,
left(query, 100) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
Run this daily. The top 3 queries usually account for 60%+ of database load.
Rule 5: Safe Schema Migrations
# Dangerous: Locks table for entire ALTER
ALTER TABLE orders ADD COLUMN tracking_id VARCHAR(100);
# Safe: Use CREATE INDEX CONCURRENTLY
CREATE INDEX CONCURRENTLY idx_orders_tracking ON orders(tracking_id);
# Safe migration pattern:
# 1. Add nullable column (instant, no lock)
ALTER TABLE orders ADD COLUMN tracking_id VARCHAR(100);
# 2. Backfill in batches (no lock)
UPDATE orders SET tracking_id = generate_tracking_id(id)
WHERE id BETWEEN 1 AND 10000;
# 3. Add NOT NULL constraint when ready (requires ACCESS EXCLUSIVE briefly)
ALTER TABLE orders ALTER COLUMN tracking_id SET NOT NULL;
The Database SRE Dashboard
Panel 1: Active connections vs max_connections (saturation)
Panel 2: Query latency p50/p95/p99
Panel 3: Replication lag (seconds)
Panel 4: Transactions per second
Panel 5: Cache hit ratio (should be > 99%)
Panel 6: Disk usage + growth rate + days until full
If your cache hit ratio drops below 99%, you either need more memory or have a query pattern problem.
If you want AI-powered database monitoring that predicts issues before they impact users, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)