DEV Community

Cover image for Database Reliability: The SRE Approach to Keeping Data Safe
Samson Tanimawo
Samson Tanimawo

Posted on

Database Reliability: The SRE Approach to Keeping Data Safe

The Backup That Wasn't

We had backups. Daily snapshots to S3. Perfectly configured. Never tested.

When we needed to restore after a data corruption incident, we discovered the backups had been silently failing for 3 weeks. The S3 bucket policy had changed, and nobody noticed.

Never again.

Rule 1: Test Your Restores

Backups don't matter. Restores matter.

#!/bin/bash
# weekly-restore-test.sh runs every Sunday at 3am

TEST_DB="restore_test_$(date +%Y%m%d)"

# Step 1: Download latest backup
aws s3 cp s3://backups/prod-db/latest.sql.gz /tmp/restore-test.sql.gz
if [ $? -ne 0 ]; then
alert "CRITICAL: Cannot download backup from S3"
exit 1
fi

# Step 2: Restore to test database
gunzip -c /tmp/restore-test.sql.gz | psql -h test-db-host -U admin -d $TEST_DB
if [ $? -ne 0 ]; then
alert "CRITICAL: Backup restore failed"
exit 1
fi

# Step 3: Validate data
PROD_COUNT=$(psql -h prod-db -t -c "SELECT count(*) FROM users")
TEST_COUNT=$(psql -h test-db-host -d $TEST_DB -t -c "SELECT count(*) FROM users")

DIFF=$(( PROD_COUNT - TEST_COUNT ))
if [ $DIFF -gt 100 ]; then
alert "WARNING: Restored DB has $DIFF fewer rows than production"
fi

# Step 4: Cleanup
psql -h test-db-host -c "DROP DATABASE $TEST_DB"
rm /tmp/restore-test.sql.gz

notify "Backup restore test PASSED. Rows: prod=$PROD_COUNT test=$TEST_COUNT"
Enter fullscreen mode Exit fullscreen mode

We run this weekly. It's caught 4 backup issues in the past year.

Rule 2: Monitor Replication Lag

-- PostgreSQL replication lag
SELECT
client_addr,
state,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS send_lag_bytes,
pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag_bytes,
pg_wal_lsn_diff(write_lsn, flush_lsn) AS flush_lag_bytes,
pg_wal_lsn_diff(flush_lsn, replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;
Enter fullscreen mode Exit fullscreen mode

Alert on:

  • Replication lag > 1 second (warning)
  • Replication lag > 10 seconds (critical)
  • Replica disconnected (critical)

Rule 3: Connection Pool Management

# pgbouncer.ini
[pgbouncer]
pool_mode = transaction # Not session!
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3
server_idle_timeout = 600

# Monitoring
stats_period = 60
Enter fullscreen mode Exit fullscreen mode

Without a connection pooler, every application connection holds a full PostgreSQL process. With 10 services each opening 20 connections, that's 200 PostgreSQL backends. With PgBouncer in transaction mode, you need maybe 25.

Rule 4: Query Performance Monitoring

-- Find the worst queries
SELECT
queryid,
calls,
round(total_exec_time::numeric, 2) AS total_ms,
round(mean_exec_time::numeric, 2) AS avg_ms,
round((100 * total_exec_time / sum(total_exec_time) OVER())::numeric, 2) AS pct,
left(query, 100) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

Run this daily. The top 3 queries usually account for 60%+ of database load.

Rule 5: Safe Schema Migrations

# Dangerous: Locks table for entire ALTER
ALTER TABLE orders ADD COLUMN tracking_id VARCHAR(100);

# Safe: Use CREATE INDEX CONCURRENTLY
CREATE INDEX CONCURRENTLY idx_orders_tracking ON orders(tracking_id);

# Safe migration pattern:
# 1. Add nullable column (instant, no lock)
ALTER TABLE orders ADD COLUMN tracking_id VARCHAR(100);

# 2. Backfill in batches (no lock)
UPDATE orders SET tracking_id = generate_tracking_id(id)
WHERE id BETWEEN 1 AND 10000;

# 3. Add NOT NULL constraint when ready (requires ACCESS EXCLUSIVE briefly)
ALTER TABLE orders ALTER COLUMN tracking_id SET NOT NULL;
Enter fullscreen mode Exit fullscreen mode

The Database SRE Dashboard

Panel 1: Active connections vs max_connections (saturation)
Panel 2: Query latency p50/p95/p99
Panel 3: Replication lag (seconds)
Panel 4: Transactions per second
Panel 5: Cache hit ratio (should be > 99%)
Panel 6: Disk usage + growth rate + days until full
Enter fullscreen mode Exit fullscreen mode

If your cache hit ratio drops below 99%, you either need more memory or have a query pattern problem.

If you want AI-powered database monitoring that predicts issues before they impact users, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)