This experiment is designed to test and validate the pgrwl tool in
real conditions: https://github.com/pgrwl/pgrwlInstead of synthetic examples, we simulate a real-world failure and
verify that recovery actually works end-to-end.
Letβs do something slightly uncomfortable.
We're going to simulate a database crash and recovery.
Think disk failure. Whole server gone.
And then bring it back - byte for byte - as if nothing happened.
Not "some" data.
Not "close enough".
Everything.
The Myth of "Backups"
Most people think:
"I have a backup, so I'm safe."
That's... half true.
A base backup is just a snapshot - a frozen picture of your
database at one moment.
But databases don't sit still.
Every insert, update, delete - all of that happens after your
backup.
So where does that data live?
-> In WAL (Write-Ahead Log)
The Real Rule
If you remember one thing from this post, let it be this:
Recovery = Base Backup + WAL
Without WAL:
- your backup is outdated
- your data is incomplete
- your recovery is a lie
The Experiment
Warning: this is not intended to be run on production environment
Note: For simplicity, both the database and the backup tool are running on the same machine. In production, you should never store backups on the same host where the database is running.
We'll prove this using a bunch of simple shell commands.
Note: env-vars are omitted for simplicity.
A full working script will be attached at the end of the article.
Step 1 --- Build a Database From Nothing
log "Initializing PostgreSQL cluster..."
initdb -D "$PGDATA" -A trust \
--auth-local=trust \
--auth-host=trust >/dev/null
log "Starting PostgreSQL..."
pg_ctl -D "$PGDATA" -l "$WORKDIR/pg.log" start >/dev/null
wait_for_postgres
log "Creating physical replication slot: $REPL_SLOT"
psql -d postgres -v ON_ERROR_STOP=1 \
-c "select pg_create_physical_replication_slot('$REPL_SLOT');" >/dev/null
log "Creating test database: $DBNAME"
createdb "$DBNAME"
We start from zero.
Step 2 --- Start Capturing WAL
log "Writing pgrwl configuration..."
cat >"$PGRWL_CONFIG" <<EOF
{
"main": {
"listen_port": 7070,
"directory": "$WAL_ARCHIVE_DIR"
},
"receiver": {
"slot": "$REPL_SLOT",
"no_loop": true
},
"log": {
"level": "debug",
"format": "text",
"add_source": false
}
}
EOF
log "Starting pgrwl receiver..."
pgrwl daemon -m receive -c "$PGRWL_CONFIG" >"$WORKDIR/pgrwl-receive.log" 2>&1 &
Starting pgrwl in a receive mode
Step 3 --- Take a Base Backup
log "Creating base backup..."
pgrwl backup -c "$PGRWL_CONFIG"
This is your snapshot in time.
Backup created by using PostgreSQL replication protocol (i.e. additional tools are not required).
Step 4 --- Populate DB
log "Initializing pgbench data (scale=10 ~ about 1 million rows in pgbench_accounts)..."
pgbench -i -s 10 "$DBNAME"
log "Running pgbench workload..."
pgbench -c 4 -j 2 -t 200 "$DBNAME"
All this data exists ONLY in WAL.
Step 5 --- Save the Truth
log "Dumping cluster state before destruction..."
pg_dumpall --quote-all-identifiers --restrict-key=0 >"$WORKDIR/before.sql"
This dump becomes our ground truth.
After restore + WAL replay, we expect the cluster to match this state.
Step 6 --- Delete Everything
log "Stopping PostgreSQL and pgrwl receiver..."
stop_postgres
stop_pgrwl_receive
log "Removing original PGDATA to simulate data loss..."
rm -rf "$PGDATA"
Database gone.
- No tables.
- No data.
- No second chances.
Only backup and WAL remain.
Step 7 --- Restore the Base Backup
log "Restoring PGDATA from base backup..."
pgrwl restore --dest="$PGDATA" -c "$PGRWL_CONFIG"
chmod 0750 "$PGDATA"
chown -R postgres:postgres "$PGDATA"
# recovery.signal tells PostgreSQL to start in archive recovery mode.
touch "$PGDATA/recovery.signal"
We are back to snapshot state only.
Step 8 --- Replay History
log "Starting pgrwl restore server..."
pgrwl daemon -m serve -c "$PGRWL_CONFIG" >"$WORKDIR/pgrwl-serve.log" 2>&1 &
PGRWL_SERVE_PID=$!
cat >>"$PGDATA/postgresql.conf" <<EOF
restore_command = 'pgrwl restore-command --serve-addr=127.0.0.1:7070 %f %p'
EOF
log "Starting restored PostgreSQL cluster..."
pg_ctl -D "$PGDATA" -l "$WORKDIR/postgres-restored.log" start >/dev/null
wait_for_postgres
wait_until_out_of_recovery
Start pgrwl in serve mode for restore_command, run cluster, PostgreSQL starts replaying WAL.
It is replaying history.
Every insert.
Every update.
Every commit.
Reconstructed from WAL.
Step 9 --- Did It Work?
log "Dumping cluster state after recovery..."
pg_dumpall --quote-all-identifiers --restrict-key=0 >"$WORKDIR/after.sql"
log "Comparing dumps..."
if diff -u "$WORKDIR/before.sql" "$WORKDIR/after.sql" >"$WORKDIR/dump.diff"; then
log "SUCCESS: restored cluster matches original state"
echo "before: $WORKDIR/before.sql"
echo "after : $WORKDIR/after.sql"
echo "diff : $WORKDIR/dump.diff (empty)"
else
echo
echo "FAIL: restored cluster differs from original state"
echo "See diff: $WORKDIR/dump.diff"
exit 1
fi
If there is no diff:
We recovered every single transaction.
Not approximately. Not logically. Exactly.
Mental Model
Think Git:
- backup = commit
- WAL = commits after
- recovery = replay commits
Final Thought
If you don't understand WAL, you don't understand PostgreSQL recovery.
Using docker environment for integration tests
#!/usr/bin/env bash
set -Eeuo pipefail
# setup docker-compose env
cd /tmp
git clone https://github.com/pgrwl/pgrwl.git
cd pgrwl/test/integration/environ
make restart
# exec into container
docker exec -it pg-primary bash
# run tests
su - postgres
cd scripts/tests
bash 011-basic-flow.sh
Full Script
#!/usr/bin/env bash
set -Eeuo pipefail
###############################################################################
# Simple 'Point In Time Recovery' tutorial with pgrwl
#
# What this script demonstrates:
#
# 1. Start a fresh PostgreSQL cluster
# 2. Start pgrwl in WAL receiver mode
# 3. Take a base backup
# 4. Generate more data AFTER the base backup
# 5. Save a logical dump of the final database state
# 6. Destroy PGDATA (simulate disaster)
# 7. Restore from the base backup
# 8. Replay archived WAL files
# 9. Compare the restored database with the original state
#
# Main idea:
#
# A base backup is only a snapshot at one point in time.
# All changes made after that snapshot live in WAL.
# To recover to the latest committed transaction, we need BOTH:
#
# - the base backup
# - the WAL generated after the backup
#
###############################################################################
###############################################################################
# Configuration
###############################################################################
PGDATA="/tmp/pgrwl-basic/pgdata"
WAL_ARCHIVE_DIR="/tmp/pgrwl-basic/wal-archive"
PGRWL_CONFIG="/tmp/pgrwl-basic/pgrwl-config.json"
DBNAME="bench"
REPL_SLOT="pgrwl_v5"
export PGHOST="localhost"
export PGPORT="5432"
export PGUSER="postgres"
export PGPASSWORD="postgres"
PGRWL_RECEIVE_PID=""
PGRWL_SERVE_PID=""
###############################################################################
# Small helper functions
###############################################################################
log() {
printf '\n[%s] %s\n' "$(date '+%F %T')" "$*"
}
die() {
echo "ERROR: $*" >&2
exit 1
}
wait_for_postgres() {
log "Waiting for PostgreSQL to accept connections..."
for _ in $(seq 1 120); do
if pg_isready -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" >/dev/null 2>&1; then
return 0
fi
sleep 1
done
die "PostgreSQL did not become ready in time"
}
wait_until_out_of_recovery() {
log "Waiting for PostgreSQL to finish recovery..."
for _ in $(seq 1 120); do
if psql -d postgres -Atqc "select pg_is_in_recovery()" 2>/dev/null | grep -q '^f$'; then
return 0
fi
sleep 1
done
die "PostgreSQL did not finish recovery in time"
}
stop_postgres() {
if [[ -d "$PGDATA" ]]; then
log "Stopping PostgreSQL..."
pg_ctl -D "$PGDATA" -m immediate stop >/dev/null 2>&1 || true
fi
}
stop_pgrwl_receive() {
if [[ -n "${PGRWL_RECEIVE_PID:-}" ]]; then
log "Stopping pgrwl receiver..."
kill "$PGRWL_RECEIVE_PID" >/dev/null 2>&1 || true
wait "$PGRWL_RECEIVE_PID" >/dev/null 2>&1 || true
PGRWL_RECEIVE_PID=""
fi
}
stop_pgrwl_serve() {
if [[ -n "${PGRWL_SERVE_PID:-}" ]]; then
log "Stopping pgrwl restore server..."
kill "$PGRWL_SERVE_PID" >/dev/null 2>&1 || true
wait "$PGRWL_SERVE_PID" >/dev/null 2>&1 || true
PGRWL_SERVE_PID=""
fi
}
cleanup() {
stop_pgrwl_receive
stop_pgrwl_serve
}
trap cleanup EXIT
###############################################################################
# Phase 0. Start from a clean state
###############################################################################
log "Cleaning up old processes and files..."
sudo pkill -9 postgres || true
sudo pkill -9 pgrwl || true
sudo rm -rf "/tmp/pgrwl-basic"
log "Preparing work directory: /tmp/pgrwl-basic"
mkdir -p "/tmp/pgrwl-basic" "$WAL_ARCHIVE_DIR"
###############################################################################
# Phase 1. Create and start a fresh PostgreSQL cluster
###############################################################################
log "Initializing PostgreSQL cluster..."
initdb -D "$PGDATA" -A trust --auth-local=trust --auth-host=trust >/dev/null
cat >>"$PGDATA/postgresql.conf" <<EOF
listen_addresses = '*'
# Settings required for WAL streaming / archiving style workflows
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
wal_keep_size = 64MB
# Durability settings
fsync = on
synchronous_commit = on
full_page_writes = on
# Basic logging settings
log_directory = '/tmp/pgrwl-basic'
log_filename = 'pg.log'
log_lock_waits = on
log_temp_files = 0
log_checkpoints = on
log_connections = off
log_destination = 'stderr'
log_error_verbosity = 'DEFAULT' # TERSE, DEFAULT, VERBOSE
log_hostname = off
log_min_messages = 'WARNING' # DEBUG5, DEBUG4, DEBUG3, DEBUG2, DEBUG1, INFO, NOTICE, WARNING, ERROR, LOG, FATAL, PANIC
log_timezone = 'Asia/Aqtau'
log_line_prefix = '%t [%p-%l] %r %q%u@%d '
EOF
log "Starting PostgreSQL..."
pg_ctl -D "$PGDATA" -l "/tmp/pgrwl-basic/pg.log" start >/dev/null
wait_for_postgres
log "Creating physical replication slot: $REPL_SLOT"
psql -d postgres -v ON_ERROR_STOP=1 \
-c "select pg_create_physical_replication_slot('$REPL_SLOT');" >/dev/null
log "Creating test database: $DBNAME"
createdb "$DBNAME"
###############################################################################
# Phase 2. Configure and start pgrwl in receive mode
###############################################################################
log "Writing pgrwl configuration..."
cat >"$PGRWL_CONFIG" <<EOF
{
"main": {
"listen_port": 7070,
"directory": "$WAL_ARCHIVE_DIR"
},
"receiver": {
"slot": "$REPL_SLOT",
"no_loop": true
},
"log": {
"level": "debug",
"format": "text",
"add_source": false
}
}
EOF
log "Starting pgrwl receiver..."
pgrwl daemon -m receive -c "$PGRWL_CONFIG" >"/tmp/pgrwl-basic/pgrwl-receive.log" 2>&1 &
PGRWL_RECEIVE_PID=$!
# Give the receiver a moment to connect and begin streaming.
sleep 3
###############################################################################
# Phase 3. Take a base backup
###############################################################################
log "Creating base backup..."
pgrwl backup -c "$PGRWL_CONFIG"
###############################################################################
# Phase 4. Generate data AFTER the base backup
#
# This is the important part.
# If we recover only from the base backup, these changes would be lost.
# They survive only because the WAL receiver captures the WAL stream.
###############################################################################
log "Initializing pgbench data (scale=10 ~ about 1 million rows in pgbench_accounts)..."
pgbench -i -s 10 "$DBNAME"
log "Running pgbench workload..."
pgbench -c 4 -j 2 -t 200 "$DBNAME"
###############################################################################
# Phase 5. Save the final logical state before disaster
#
# This dump becomes our ground truth.
# After restore + WAL replay, we expect the cluster to match this state.
###############################################################################
log "Dumping cluster state before destruction..."
pg_dumpall --quote-all-identifiers --restrict-key=0 >"/tmp/pgrwl-basic/before.sql"
###############################################################################
# Phase 6. Force PostgreSQL to emit final WAL and let receiver catch up
###############################################################################
log "Forcing checkpoint and WAL switch..."
psql -d postgres -v ON_ERROR_STOP=1 -c "checkpoint;" >/dev/null
psql -d postgres -v ON_ERROR_STOP=1 -c "select pg_switch_wal();" >/dev/null
# Give pgrwl time to receive the last WAL segment(s).
sleep 3
###############################################################################
# Phase 7. Simulate disaster
###############################################################################
log "Stopping PostgreSQL and pgrwl receiver..."
stop_postgres
stop_pgrwl_receive
log "Removing original PGDATA to simulate data loss..."
rm -rf "$PGDATA"
###############################################################################
# Phase 8. Restore the base backup
###############################################################################
log "Restoring PGDATA from base backup..."
pgrwl restore --dest="$PGDATA" -c "$PGRWL_CONFIG"
chmod 0750 "$PGDATA"
chown -R postgres:postgres "$PGDATA"
# recovery.signal tells PostgreSQL to start in archive recovery mode.
touch "$PGDATA/recovery.signal"
###############################################################################
# Phase 9. Start pgrwl in serve mode for restore_command
###############################################################################
log "Starting pgrwl restore server..."
pgrwl daemon -m serve -c "$PGRWL_CONFIG" >"/tmp/pgrwl-basic/pgrwl-serve.log" 2>&1 &
PGRWL_SERVE_PID=$!
cat >>"$PGDATA/postgresql.conf" <<EOF
restore_command = 'pgrwl restore-command --serve-addr=127.0.0.1:7070 %f %p'
EOF
###############################################################################
# Phase 10. Start restored PostgreSQL and let it replay WAL
###############################################################################
log "Starting restored PostgreSQL cluster..."
pg_ctl -D "$PGDATA" -l "/tmp/pgrwl-basic/postgres-restored.log" start >/dev/null
wait_for_postgres
wait_until_out_of_recovery
###############################################################################
# Phase 11. Dump restored state and compare
###############################################################################
log "Dumping cluster state after recovery..."
pg_dumpall --quote-all-identifiers --restrict-key=0 >"/tmp/pgrwl-basic/after.sql"
log "Comparing dumps..."
if diff -u "/tmp/pgrwl-basic/before.sql" "/tmp/pgrwl-basic/after.sql" >"/tmp/pgrwl-basic/dump.diff"; then
log "SUCCESS: restored cluster matches original state"
echo "before: /tmp/pgrwl-basic/before.sql"
echo "after : /tmp/pgrwl-basic/after.sql"
echo "diff : /tmp/pgrwl-basic/dump.diff (empty)"
else
echo
echo "FAIL: restored cluster differs from original state"
echo "See diff: /tmp/pgrwl-basic/dump.diff"
exit 1
fi
Top comments (0)