Juan Torchia

Posted on Apr 28 • Originally published at juanchi.dev

pgbackrest is unmaintained: what I'm doing with my Postgres backups in production now

#english #devops #produccion #railway

pgbackrest is unmaintained: what I'm doing with my Postgres backups in production now

A backup is basically like writing a phone number on a napkin. The napkin exists, the number is there, you feel covered. But the day you actually need to call and the napkin has dissolved at the bottom of a jeans pocket that went through the wash — that's the day you realize you never had a recovery plan. You had the illusion of one.

That's what the HN thread about pgbackrest made me see: I had a wet napkin.

pgbackrest alternative postgres backup production: the context that actually matters

The thread hit 425 points on Hacker News with a comment that left little room for doubt: the primary maintainer no longer has time, PRs are piling up unreviewed, and the project's direction is on indefinite pause. It's not an abandoned repo yet — but it's not something you'd want to stake a database holding real user data on.

I was using it. Not as a first line, but as part of the incremental backup flow I built two years ago after an AI agent deleted my production database. After that episode I swore I'd never depend on a single recovery mechanism again. pgbackrest was the layer handling incremental backups with compression and time-based retention. It worked. Until it stopped making sense to keep depending on something with no active maintainer.

My thesis: the death of an infra project isn't the problem itself — it's the smoke detector telling you you've never actually tested recovering anything. The problem existed before. The HN thread just made it visible.

What I evaluated and how I thought about it

Before jumping to whatever the trendy alternative was, I forced myself to define what I actually needed. My stack: PostgreSQL 16 running on Railway, ~4.2 GB database, WAL archiving enabled, 30-day retention, and an informal RTO of "under 2 hours" that I had never concretely measured.

Three options I evaluated seriously:

1. WAL-G

Open source, actively maintained by Wal-G Inc. (formerly part of the Citus/Microsoft stack), native support for S3, GCS, Azure, and local filesystem. The most concrete advantage: the binary is self-contained — no weird dependencies to wrangle.

# Basic install on Debian/Ubuntu
curl -L https://github.com/wal-g/wal-g/releases/latest/download/wal-g-pg-ubuntu-20.04-amd64.tar.gz \
  | tar -xz -C /usr/local/bin/

# Minimum environment variables for S3
export WALG_S3_PREFIX="s3://my-backup-bucket/postgres"
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export PGDATA="/var/lib/postgresql/16/main"

# Full base backup
wal-g backup-push $PGDATA

# List available backups
wal-g backup-list DETAIL

# Restore to a specific point in time
wal-g backup-fetch $PGDATA LATEST

What surprised me: in my restore tests, WAL-G took 18 minutes to recover the 4.2 GB from S3, including WAL application up to the target point in time. I measured that number three times with a simple script.

2. Barman (Backup and Recovery Manager)

Maintained by EnterpriseDB, far more mature in terms of operational interface. The configuration curve is steep — there's a separate Barman server acting as backup receiver, which means additional infrastructure.

# Basic barman.conf (on the dedicated Barman server)
[barman]
barman_home = /var/lib/barman
barman_user = barman
log_file = /var/log/barman/barman.log
compression = gzip
reuse_backup = link  # hardlinks for efficient incremental backups

[my-postgres-server]
description = "Main production"
conninfo = host=postgres-host user=barman dbname=postgres
backup_method = rsync
archiver = on
retention_policy = RECOVERY WINDOW OF 30 DAYS

# Check configuration
barman check my-postgres-server

# Backup
barman backup my-postgres-server

# List
barman list-backup my-postgres-server

# Restore
barman recover my-postgres-server latest /var/lib/postgresql/16/main \
  --target-time "2025-07-10 14:30:00"

Barman's restore time in the same scenario: 31 minutes. Nearly double. The main reason is that Barman uses rsync by default and has coordination overhead between servers. With backup_method = postgres (streaming) it comes down, but it still doesn't win.

3. Plain pg_dump with manual rotation

The most honest option. The one everyone knows, nobody wants to use for serious production, and the one that often is the only thing left standing when everything else fails.

#!/bin/bash
# pg_dump backup script — no magic, no external dependencies
# Saved as /usr/local/bin/pg_daily_backup.sh

set -euo pipefail

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
DB_NAME="my_database"
BACKUP_DIR="/mnt/backups/postgres"
RETENTION_DAYS=14

# Compressed backup
pg_dump -Fc \
  --no-password \
  -h $PGHOST \
  -U $PGUSER \
  -d $DB_NAME \
  > "$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.dump"

# Automatic rotation
find "$BACKUP_DIR" -name "*.dump" -mtime +$RETENTION_DAYS -delete

# Log the actual dump size
du -sh "$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.dump" >> /var/log/pg_backup.log

echo "Backup completed: ${TIMESTAMP}" >> /var/log/pg_backup.log

Restore with pg_dump is 23 minutes for the 4.2 GB. Faster than Barman, slower than WAL-G — but with no PITR (Point in Time Recovery). If you need to recover to 14:37 and your closest backup is from 14:00, you just lost 37 minutes of data. That trade-off is the one that really stings in real production.

What popularity benchmarks don't tell you

The problem with choosing infra tools by GitHub stars or how often people mention them on Reddit is that popularity measures adoption, not fitness for your specific case. pgbackrest has over 3,000 stars. That did nothing for me when I needed to understand how long my particular database would actually take to recover.

What helped: measuring. Three distinct scenarios:

Tool	Full restore (4.2 GB)	PITR available	Infra overhead	Storage cost (30 days)
WAL-G	18 min	Yes	Minimal	~$0.92/mo on S3
Barman	31 min	Yes	Dedicated server	~$0.92/mo + EC2
pg_dump	23 min	No	None	~$0.85/mo on S3

Storage cost on S3 is nearly identical because WAL-G compresses aggressively and archived WALs are reasonably small for a database without massive write throughput. But if Railway or Supabase were my managed Postgres option, external WAL archiving either comes pre-solved by the platform or simply isn't available for manual configuration.

That detail made me revisit why I moved certain things to my own infrastructure — control over how and where you store your recovery data is not a minor concern when the provider decides which features you get to touch.

The mistakes I made (and that you'll make if you don't check right now)

Mistake 1: I never actually restored anything

I'd had backups running for two years. I never did a full restore to a staging environment to measure real time. The number I had in my head ("Postgres backup, under an hour") was completely made up. My first real restore to a clean environment took 47 minutes with pgbackrest — almost double what I'd assumed.

If you haven't run a full restore in the last month, you don't have a recovery plan. You have a wet napkin.

Mistake 2: confusing backup with archive

WAL archiving and base backups are two different things that work together. If you only have WAL archiving without a recent base backup, restore time will be proportional to how many WAL files you need to apply from the last base backup. In my case, with a weekly base backup and continuous WAL, the worst case was 7 days of WAL — several additional minutes of replay.

# See how many WAL segments exist since the last backup
# In WAL-G:
wal-g wal-show

# Expected output — pay attention to the "segments" count
# +---------------------------+----------+---------+
# | Start                     | End      |Segments |
# +---------------------------+----------+---------+
# | 2025-07-04T03:00:00+00:00 | current  |    1842 |
# +---------------------------+----------+---------+
# 1842 segments = non-trivial replay time

Mistake 3: ignoring WAL size in production

My database is 4.2 GB of data, but it generates roughly 180 MB of WAL per day. Over 30 days: ~5.4 GB of additional archived WAL. If you don't measure it, the storage cost creeps up silently. On S3 it's cheap — on other providers it can surprise you.

-- Measure WAL generation in the last 24 hours
SELECT
  count(*) as wal_files_generated,
  pg_size_pretty(sum(size)) as total_size
FROM pg_ls_waldir()
WHERE modification > now() - interval '24 hours';

This kind of measurement is exactly what production logs reveal when you force yourself to look at them cold, without the adrenaline of an active incident.

FAQ: pgbackrest alternative postgres backup production

Is WAL-G a drop-in replacement for pgbackrest?

Functionally yes, in most cases. Both handle base backups + WAL archiving with PITR. The main difference is in configuration: WAL-G is simpler to get started (one binary, environment variables) while pgbackrest has a more expressive config file. If you already have pgbackrest configured, migrating to WAL-G means rewriting the config and doing a full base backup from scratch — you can't reuse existing backups in a different format.

Is Barman still worth it if you already have dedicated infra?

Yes, especially if you manage multiple Postgres instances and need a centralized operational interface with auditing. The overhead of a separate Barman server pays off when you're managing 5+ instances from one place. For a single instance like mine, it's overkill with a real cost.

Is pg_dump enough for production or just for development?

Depends on your RTO and RPO. If you can tolerate data loss up to N hours (where N is your dump frequency) and a 20–40 minute restore doesn't break any SLA, pg_dump with automated rotation is completely valid. The real limitation is the absence of PITR: you can't recover to an exact point between two dumps. For critical transactional databases, that's usually unacceptable.

How do I configure WAL archiving on Railway or Supabase?

On Railway with a custom Postgres install you can configure archive_mode = on and archive_command if you have access to postgresql.conf. On Supabase, WAL archiving is internal to the service — you can use Point in Time Recovery within the platform but you can't export WAL to external storage directly. That's a recovery vendor lock-in worth evaluating based on how critical your data is.

What base backup frequency makes sense?

For most cases: daily base backup + continuous WAL. A weekly base backup with continuous WAL is acceptable if the database grows slowly (under 500 MB/day of WAL). With daily base backups, WAL replay during restore is minimal. With weekly, in the worst case you need to apply 7 days of WAL — that can add tens of minutes depending on write volume.

Is it worth waiting to see if pgbackrest picks up maintenance again?

No. In infra systems, a maintainer who announces they don't have time rarely comes back with more energy. The window between "project without active maintenance" and "critical vulnerability with no patch" can be short. Migrate now, with time, not during an incident. The cost of migrating calmly is infinitely lower than migrating under pressure — something I learned the night I wiped a production server with rm -rf in my first week on the job.

What I chose and why it's not the right answer for everyone

I went with WAL-G + pg_dump as a second line.

WAL-G handles incremental backups with PITR. pg_dump runs every 24 hours to a separate S3 bucket as an independent fallback — no third-party tool dependencies, no special binaries, just pg_dump and aws s3 cp. If WAL-G disappeared tomorrow, I have yesterday's dump.

The criterion I used wasn't "which tool has more stars" — it was "how fast can I recover and how many dependencies are in the recovery chain." Fewer dependencies in the critical path is better. When you have to restore a database, every additional piece that can fail is a problem you don't need.

What I wouldn't choose for my case: Barman on a single instance. The operational overhead doesn't close the deal. It can make sense for teams with multiple databases and a dedicated DBA — but that's not my reality.

The uncomfortable part of all this: I spent two years with pgbackrest without ever measuring a real restore. The HN thread didn't break my infrastructure — it broke my false sense of security. And in the long run, that was better than keeping a wet napkin in my pocket.

If you want to audit what else might be in "works until it doesn't" state in your data layer, the post about migrating from Notion to Markdown has some of that same flavor: the silent dependency that only hurts when you try to leave.

And if you make the switch to WAL-G, measure the restore. Don't assume it. The real number is always different from the imagined one.

Migrating from pgbackrest or evaluating your options? Reach out — I'm putting together a repo of real WAL-G configs specifically for Railway.

This article was originally published on juanchi.dev

DEV Community

pgbackrest is unmaintained: what I'm doing with my Postgres backups in production now

pgbackrest is unmaintained: what I'm doing with my Postgres backups in production now

pgbackrest alternative postgres backup production: the context that actually matters

What I evaluated and how I thought about it

1. WAL-G

2. Barman (Backup and Recovery Manager)

3. Plain pg_dump with manual rotation

What popularity benchmarks don't tell you

The mistakes I made (and that you'll make if you don't check right now)

Mistake 1: I never actually restored anything

Mistake 2: confusing backup with archive

Mistake 3: ignoring WAL size in production

FAQ: pgbackrest alternative postgres backup production

What I chose and why it's not the right answer for everyone

Top comments (0)