HelperX

Posted on Jun 16

SQLite Backup Strategy for Production SaaS: WAL, Litestream, Recovery Tests

#database #sqlite #devops #architecture

Running SQLite in production for a multi-tenant SaaS means giving up the operational comfort of a managed database. No automatic backups, no point-in-time recovery dashboards, no on-call DBA. You build all of that yourself, or you find out the hard way during your first incident.

We've been running SQLite for HelperX — 200+ slot databases plus a global database — for over a year. In that time we've had two real recovery events and a few dozen test recoveries. Here's the backup architecture that survived contact with production.

The constraints we needed to satisfy

The backup strategy had to hit four numbers:

Metric	Target	Why
RPO (data loss tolerance)	< 60 seconds	We're processing user actions; missing the last hour is unacceptable
RTO (restore time)	< 15 minutes per slot	Customers expect their slot back fast
Cost	< $10/month total	Doesn't justify cloud DB pricing if backup eats the savings
Operational overhead	< 1 hour/week	Solo founder; no DBA budget

These constraints rule out a few common patterns:

Daily sqlite3 .backup cron jobs — RPO of 24 hours is way too lossy
Full filesystem snapshots only — slow restore, costly storage
Block-level replication — overkill, expensive, requires specialized infra
Managed backup services — most don't support SQLite, the ones that do are pricey

What works in our constraints: WAL-mode SQLite + Litestream for continuous streaming + filesystem snapshots as a secondary backup + monthly recovery tests.

Why WAL mode is the foundation

Before any backup tool, you need WAL (Write-Ahead Logging) mode enabled. WAL changes how SQLite handles writes — instead of writing directly to the main database file, changes go to a .wal file first and are eventually checkpointed to the main file.

For backups, WAL has two important properties:

Reads don't block writes and vice versa. A backup tool can read the database concurrently with application writes. Without WAL, a backup would block all writes (or fail).
Litestream and similar tools watch the WAL file to stream incremental changes to a remote target. This is what makes continuous backup possible.

Enabling WAL on every slot database:

function getSlotDb(slotId) {
  const db = new Database(`data/slots/${slotId}.db`);
  db.pragma('journal_mode = WAL');
  db.pragma('synchronous = NORMAL'); // important for backup tooling
  db.pragma('wal_autocheckpoint = 1000'); // checkpoint every 1000 pages
  db.pragma('busy_timeout = 5000');
  return db;
}

synchronous = NORMAL is a careful trade-off: it's faster than FULL, slightly less durable on crash, but compatible with how Litestream needs to read the WAL. We accept the trade because our continuous streaming gives us better effective durability than synchronous = FULL without streaming.

Litestream: the primary backup mechanism

Litestream is a small Go binary that watches a SQLite WAL file and streams its contents to a remote target (S3, B2, GCS, SFTP). It's the closest thing SQLite has to write-ahead replication.

Architecture:

SQLite (with WAL)  →  Litestream sidecar  →  S3/B2 bucket
                          │
                          └─ uploads WAL segments every N seconds

Our Litestream config:

# litestream.yml
dbs:
  - path: /app/data/global.db
    replicas:
      - type: s3
        bucket: helperx-backups
        path: global
        endpoint: s3.us-west-001.backblazeb2.com
        sync-interval: 30s
        retention: 168h
        retention-check-interval: 1h
        snapshot-interval: 24h

  - path: /app/data/slots/slot_abc.db
    replicas:
      - type: s3
        bucket: helperx-backups
        path: slots/slot_abc
        endpoint: s3.us-west-001.backblazeb2.com
        sync-interval: 30s
        retention: 168h
        snapshot-interval: 24h

  # ... one block per slot, generated dynamically

The configuration is generated by a small script when slots are created or deleted:

function regenerateLitestreamConfig() {
  const slots = db.prepare('SELECT id FROM slots WHERE active = 1').all();
  const config = {
    dbs: [
      { path: '/app/data/global.db', replicas: [GLOBAL_REPLICA_CONFIG] },
      ...slots.map(({ id }) => ({
        path: `/app/data/slots/${id}.db`,
        replicas: [{
          type: 's3',
          bucket: 'helperx-backups',
          path: `slots/${id}`,
          endpoint: S3_ENDPOINT,
          'sync-interval': '30s',
          retention: '168h',
          'snapshot-interval': '24h',
        }],
      })),
    ],
  };
  fs.writeFileSync('/etc/litestream.yml', yaml.dump(config));
  execSync('systemctl reload litestream');
}

Whenever a slot is added or removed, Litestream gets a hot reload of its config. No restart needed.

What Litestream actually backs up

Litestream uploads:

Periodic snapshots (every 24 hours) — a complete copy of the database at that point in time
WAL segments — small files representing the incremental changes since the last snapshot, uploaded every sync-interval

To restore, Litestream combines the most recent snapshot with all subsequent WAL segments. This is how you get sub-minute RPO without doing a full backup every minute.

Storage cost

Backblaze B2 pricing: $0.005 per GB/month for storage, $0.01 per GB egress (only paid on restores).

Our backup footprint:

200 slots × ~3 MB compressed snapshot = ~600 MB
24 hours of WAL segments × all slots = ~150 MB
Global database snapshot + WAL: ~5 MB
Retention overhead (7 days of history): ~5x multiplier = ~3.8 GB total

Monthly cost: 3.8 GB × $0.005 = $0.019. Practically free.

Egress is the more interesting line. If we restored everything once a month: 3.8 GB × $0.01 = $0.038. Also negligible.

We added Backblaze API calls for class A/B operations, which Litestream uses frequently. That ran us another $0.50/month. Total: under $1/month, well under our $10 budget.

The actual RPO with Litestream

With sync-interval: 30s, the worst-case RPO is roughly 30-45 seconds. WAL changes that happened in the last 30 seconds may not yet be uploaded when disaster strikes. Acceptable for our use case; for stricter requirements you can lower sync-interval to 10s at the cost of more API calls.

Filesystem snapshots: the secondary backup

Litestream is great for normal recovery scenarios. It does not protect against:

Logical corruption — a buggy migration that corrupts data across all slots
Operator error — DELETE FROM audit_log; without a WHERE clause
Litestream itself failing — bugs, misconfiguration, S3 outages

For these cases, we run filesystem snapshots of the entire data/ directory once per day:

#!/bin/bash
# /etc/cron.daily/snapshot-data.sh

set -euo pipefail

DATE=$(date +%Y%m%d-%H%M)
SNAPSHOT_DIR="/backup/snapshots/$DATE"

mkdir -p "$SNAPSHOT_DIR"

# Pause writes briefly using a lock
flock /app/data/.snapshot-lock -c "
  rsync -a --link-dest=/backup/snapshots/latest /app/data/ '$SNAPSHOT_DIR/'
"

# Update latest symlink
ln -sfn "$SNAPSHOT_DIR" /backup/snapshots/latest

# Prune old snapshots (keep 7 daily, 4 weekly, 3 monthly)
find /backup/snapshots -maxdepth 1 -type d -mtime +7 -name '20*-*' -exec rm -rf {} \;

# Encrypt and upload to off-site
tar czf - "$SNAPSHOT_DIR" | \
  gpg --symmetric --batch --passphrase-file /etc/backup-pass --cipher-algo AES256 | \
  aws s3 cp - "s3://helperx-snapshots/$DATE.tar.gz.gpg"

A few important details:

rsync --link-dest makes snapshots cheap. Unchanged files are hardlinked, not copied. A 480MB data directory only uses ~50MB of new disk per snapshot when most files haven't changed.
flock briefly serializes with the application to ensure a consistent snapshot. The lock is held for the duration of the rsync, typically 2-3 seconds.
GPG encryption before upload — the snapshot is encrypted on disk before it leaves our server. The B2 bucket holds only ciphertext.
The local copy is the fast restore path. Off-site is for disaster scenarios.

This catches the failure modes Litestream misses. If a bad migration runs at 02:30, the 02:00 snapshot has the pre-migration state and we can restore from there.

Recovery: the playbook

A backup that's never been tested is not a backup. Here are the documented recovery procedures.

Scenario 1: Single slot database corrupted

The most common scenario. A specific slot's database is unreadable, but everything else is fine.

# 1. Stop the worker process for this slot
curl -X POST http://localhost:3000/admin/slots/$SLOT_ID/stop

# 2. Move the corrupted database aside (don't delete yet)
mv /app/data/slots/$SLOT_ID.db /tmp/$SLOT_ID.db.corrupt

# 3. Restore from Litestream
litestream restore \
  -o /app/data/slots/$SLOT_ID.db \
  s3://helperx-backups/slots/$SLOT_ID

# 4. Verify integrity
sqlite3 /app/data/slots/$SLOT_ID.db 'PRAGMA integrity_check;'

# 5. Restart the worker
curl -X POST http://localhost:3000/admin/slots/$SLOT_ID/start

# 6. Sanity check via app dashboard

Typical recovery time: 2-4 minutes per slot. Well under our RTO target.

Scenario 2: Global database lost or corrupted

The global database holds user accounts, billing, plan info. Losing it is worse than losing any single slot.

# 1. Put the application in maintenance mode
curl -X POST http://localhost:3000/admin/maintenance/on

# 2. Stop the application
systemctl stop helperx

# 3. Restore from Litestream (use the closest point in time)
litestream restore \
  -o /app/data/global.db \
  s3://helperx-backups/global

# 4. Run integrity check
sqlite3 /app/data/global.db 'PRAGMA integrity_check;'

# 5. Run a cross-check script
node scripts/verify-global-restore.js

# 6. Restart and disable maintenance mode
systemctl start helperx
curl -X POST http://localhost:3000/admin/maintenance/off

Recovery time: 7-12 minutes.

The verify-global-restore.js script does sanity checks:

// scripts/verify-global-restore.js
const db = new Database('/app/data/global.db', { readonly: true });

const userCount = db.prepare('SELECT COUNT(*) as c FROM users').get().c;
const slotCount = db.prepare('SELECT COUNT(*) as c FROM slots WHERE active = 1').get().c;
const recentActivity = db.prepare(`
  SELECT COUNT(*) as c FROM users WHERE last_login > datetime('now', '-7 days')
`).get().c;

console.log(`Restored: ${userCount} users, ${slotCount} active slots, ${recentActivity} recent logins`);

if (userCount < 100 || slotCount < 100) {
  console.error('SUSPICIOUS: counts look too low');
  process.exit(1);
}

// Verify a known recent user exists
const canary = db.prepare("SELECT id FROM users WHERE email = ?").get('canary@helperx.app');
if (!canary) {
  console.error('FAIL: canary user not found');
  process.exit(1);
}

console.log('OK: restore verified');

The canary user is a real account we use only as a restore probe. If it's not present, the restore didn't include the data we expected.

Scenario 3: Full disaster — server is gone

Server-level failure. We need to spin up a new instance and restore everything.

# 1. Provision a new server, install dependencies
# 2. Pull the application code
git clone https://github.com/helperx/helperx-app.git
cd helperx-app && npm install

# 3. Restore global database
litestream restore -o /app/data/global.db s3://helperx-backups/global

# 4. Get list of slots from restored global db
SLOTS=$(sqlite3 /app/data/global.db 'SELECT id FROM slots WHERE active = 1')

# 5. Restore each slot in parallel
for slot in $SLOTS; do
  (litestream restore -o /app/data/slots/$slot.db s3://helperx-backups/slots/$slot) &
done
wait

# 6. Verify all restored
node scripts/verify-full-restore.js

# 7. Update DNS, start application

Recovery time: 45-90 minutes. Most of it is the parallel slot restores. With 200 slots, even at 30-60 seconds per restore, the parallelization gets it done in roughly the time of the slowest 5-10 slots.

This is the only scenario where we'd realistically miss our RTO target. We accept it because full disasters are extremely rare; the prior likelihood is dominated by Scenarios 1 and 2.

The monthly recovery test

This is the part most teams skip. We don't. Once a month, on the first Saturday at 09:00, we run a full recovery test against a staging environment.

The test:

Provision a fresh staging server
Run the Scenario 3 recovery procedure against the production backups
Run a battery of verification scripts
Tear down the staging environment

The verification scripts check:

All slot databases pass PRAGMA integrity_check
The most recent audit log entry is from within the expected RPO window
A randomly-chosen slot has the expected number of recent actions
The global database's user count matches a sanity range
Litestream config matches the slot list from the restored global db

The test takes about 90 minutes and costs ~$1 in cloud compute. We've caught real issues twice:

Once: Litestream had silently stopped backing up 3 slots due to a config drift. The recovery test failed because those slots' restores returned old data.
Once: a recent migration added a column the verification script didn't know about. Restore succeeded but verification flagged it — we updated the script.

Both bugs would have been catastrophic in a real incident. The monthly test caught them in a safe environment.

What we don't back up

Some things deliberately aren't in the backup:

The WAL itself. Litestream handles it; we don't snapshot it separately.
Application logs — rotated to a separate log aggregation system, not backed up to S3.
Temp directories. No durability requirement.
Browser session caches — large and regeneratable.
node_modules — regenerated from package-lock.json on restore.

Cutting these out keeps the backup footprint small and the restore fast.

Key takeaways

WAL mode is the prerequisite for any sane SQLite backup strategy.
Litestream gives you sub-minute RPO at near-zero cost for most workloads.
Filesystem snapshots are your secondary defense against logical corruption that Litestream can't help with.
Encrypt snapshots before they leave the server. Storage provider zero-knowledge is non-negotiable for user data.
Recovery is three scenarios, not one. Document each separately.
Monthly recovery tests are the only way to know your backups work. Skip them and you'll find out the hard way.
Verification scripts must include canaries — a known record that should always be present.
Total backup cost can be under $5/month for a typical multi-tenant SQLite SaaS. The math is much better than managed databases.

A backup strategy is only as good as the recovery test. Run them. The day you need them, you won't have time to discover they were broken.

HelperX ships with a recommended Litestream config and recovery scripts as part of the self-hosted distribution. Free 30-day trial — full source, no markup on infrastructure.

DEV Community