DEV Community

Cover image for Barman Replacing pgbackrest: I Migrated My Postgres Backups in Production and Here's What I Found
Juan Torchia
Juan Torchia Subscriber

Posted on • Originally published at juanchi.dev

Barman Replacing pgbackrest: I Migrated My Postgres Backups in Production and Here's What I Found

Barman Replacing pgbackrest: I Migrated My Postgres Backups in Production and Here's What I Found

The weekend I migrated from Vercel to Railway — the same one I mentioned when I talked about cold starts — I spent nearly twelve hours reading Postgres logs I'd never had to read that seriously before. It wasn't a tutorial. It was real production, real data, and the underlying question was always the same: if this blows up right now, how long does it take me to get back?

That left me with a backup obsession I never had back when I was managing shared hosting servers at 19. At that job, the backup strategy was "let's hope nothing happens" combined with provider snapshots that nobody had ever actually verified. I learned that the hard way when a client lost contact form data and the snapshot was three days old. Nothing critical, but the fear stuck.

So when I wrote that pgbackrest stopped being maintained and started looking at alternatives, I wasn't coming from academic curiosity. I was coming from that old fear that stays with you once a restore has failed you.

Barman appeared trending on Hacker News (source) almost exactly right after. The timing was suspiciously perfect. The community adopted it with near-immediate enthusiasm — "finally a serious alternative" posts, Twitter threads with five-minute benchmarks, the classic technical FOMO. And me, in full honest_critic mode, sat down to do the actual migration before forming an opinion.

What I found is not what the community is telling you.


Barman PostgreSQL Backup in Production: What It Promises and What It Delivers

My thesis: Barman is technically solid, better documented than pgbackrest in 2025, and has an active team behind it (2ndQuadrant/EDB maintains it). But the HN conversation systematically omits the real operational cost of configuring it in a modern containerized stack. You swap a maintenance problem for an operational complexity problem. Nobody tells you that until you're already inside.

Barman — Backup and Recovery Manager — has existed since 2011. It's not new. What's new is that everyone suddenly "discovered" it because pgbackrest entered zombie mode. That doesn't automatically make it the right answer for every stack.

My concrete setup before the migration:

# Previous state - pgbackrest on Railway
# PostgreSQL 16 in Railway container
# Database: ~4.2 GB on disk
# Frequency: daily full backup + continuous WAL archiving
# Tested restore time: 18 minutes (measured, not estimated)
# Last active pgbackrest version: 2.49 (no relevant commits in 8 months)
Enter fullscreen mode Exit fullscreen mode

The number I cared most about was that 18-minute restore time. It's the only number that matters when something explodes in production. Everything else is marketing.


The Real Migration: Commands, Friction, and the Numbers I Got

Barman on Railway is not trivial. The core problem is that Barman assumes an architectural model where the backup server has direct SSH access to the Postgres server. In a containerized stack, that simply doesn't exist the same way.

# Base Barman installation (on dedicated server or separate container)
sudo apt-get install barman barman-cli

# Check version — important for PG16 compatibility
barman --version
# Output: Barman 3.10.0 (confirmed pg16 support)
Enter fullscreen mode Exit fullscreen mode
# /etc/barman.conf — base configuration
[barman]
barman_user = barman
configuration_files_directory = /etc/barman.d
barman_home = /var/lib/barman
# Directory where backups go — in my case, Railway persistent volume
log_file = /var/log/barman/barman.log
log_level = INFO
compression = gzip
# This matters: backup_method streaming avoids SSH on Railway
backup_method = streaming
streaming_archiver = on
Enter fullscreen mode Exit fullscreen mode
# /etc/barman.d/railway-postgres.conf — specific server configuration
[railway-postgres]
description = "PostgreSQL 16 production Railway"
conninfo = host=<RAILWAY_HOST> user=barman dbname=postgres
streaming_conninfo = host=<RAILWAY_HOST> user=streaming_barman
backup_method = streaming
streaming_archiver = on
slot_name = barman_streaming_slot
# Without this parameter, Barman will complain constantly
streaming_archiver_name = barman_receive_wal
# Retention: 7 full backups or 14 days, whichever comes first
retention_policy = RECOVERY WINDOW OF 14 DAYS
Enter fullscreen mode Exit fullscreen mode

The first gotcha: Barman needs two separate users in Postgres — one for the regular connection and another specifically for streaming replication. The main README doesn't clarify that. You find it in the extended documentation after thirty minutes of cryptic errors.

-- In PostgreSQL: create users required by Barman
CREATE USER barman WITH SUPERUSER;
CREATE USER streaming_barman WITH REPLICATION;

-- Adjust pg_hba.conf to allow both connections
-- host replication streaming_barman <BARMAN_IP> md5
-- host all barman <BARMAN_IP> md5
Enter fullscreen mode Exit fullscreen mode

After sorting that out, the first full backup ran without issues:

# Run initial backup
barman backup railway-postgres

# Relevant output (measured in my case):
# Starting backup for server railway-postgres
# Backup start at LSN: 0/8A000028
# Copying files...
# Backup size: 4.1 GB (vs 4.2 GB with pgbackrest — minimal difference with gzip)
# Elapsed time: 12 minutes 34 seconds
# Backup end at LSN: 0/8A001FF8
# Backup completed (start time: ..., elapsed time: 12 minutes, 34 seconds)
Enter fullscreen mode Exit fullscreen mode

4.1 GB in 12 minutes and 34 seconds. With pgbackrest the same full backup took 14 minutes using lz4 compression. Barman with gzip is marginally slower on backup but produces similar file sizes. Not a difference that justifies anything on its own.

The number that matters — the restore:

# Test restore to staging server
barman recover railway-postgres latest /tmp/postgres-restore-test \
  --target-time "2025-07-14 10:00:00" \
  --remote-ssh-command "ssh postgres@staging"

# Measured time: 23 minutes 17 seconds
# (vs 18 minutes with pgbackrest — 5 minutes slower)
Enter fullscreen mode Exit fullscreen mode

There's the uncomfortable number: the restore is 29% slower than with pgbackrest in my specific configuration using streaming backup. Why? Because backup_method = streaming in Barman is simpler to configure on Railway, but it's not as efficient as the traditional rsync method pgbackrest used. Barman's rsync method requires SSH, which on Railway is an additional headache.


The Gotchas Nobody Mentions in the HN Posts

1. Barman's mental model is pre-cloud.

Barman was designed for a world where you have a physical or virtual Postgres server and a separate backup server with SSH between them. That model is crystal clear and the tool executes it perfectly. But if you work with Railway, Render, Fly.io, or any modern containerized platform, you'll constantly be fighting against that assumed architecture.

This isn't a fatal criticism — there are solutions. But it's extra work that the enthusiastic posts don't mention. Same thing that happened with the supply chain attack on PyTorch Lightning: the ecosystem celebrates the tool until you hit the edge case nobody documented.

2. WAL archiving with a streaming slot has a non-trivial resource footprint.

# Monitor replication slot usage (run this on your Postgres)
SELECT slot_name, active, restart_lsn, confirmed_flush_lsn,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots
WHERE slot_name = 'barman_streaming_slot';

# In my case during the first 24 hours:
# slot_name              | active | lag
# barman_streaming_slot  | t      | 2.3 MB  <- normal
# After a Barman container restart:
# barman_streaming_slot  | f      | 847 MB  <- accumulated while Barman was down
Enter fullscreen mode Exit fullscreen mode

An inactive replication slot accumulates WAL. If the Barman container restarts and you don't bring it back up quickly, Postgres starts retaining WAL indefinitely. On Railway that can grow until it fills the disk and takes the whole server down. It's a risk that with pgbackrest + file-based WAL archiving was more controllable, because you could set a retention timeout without directly affecting Postgres.

3. barman check lies when it's partially configured.

barman check railway-postgres

# Misleading output:
# Server railway-postgres:
#   PostgreSQL: OK
#   is_superuser: OK
#   PostgreSQL streaming: OK
#   wal_level: OK
#   replication slot: OK
#   directories: OK
#   retention policy settings: OK
#   backup maximum age: FAILED (no target...)  <- this error is actually a warning
#   encryption: OK (disabled)
#   backup minimum size: OK
#   wal compression: OK
#   ssh: FAILED (SSH connection is not active)  <- expected with streaming, but check fails anyway
# FAILED (see log for details)
Enter fullscreen mode Exit fullscreen mode

The check reports FAILED even though the backup works perfectly with streaming. If you set up automatic alerts based on barman check output, you'll get constant false positives. I had to customize the monitoring script to ignore SSH checks when the method is streaming.

4. The official documentation is good but Stack Overflow is outdated.

Barman 3.x changed substantially from 2.x. Most SO answers are for old versions. Parameters changed, some were deprecated, and there are configurations that generate warnings in 3.x that were silent in 2.x. Minor, but when you're debugging at 11pm with a broken backup, the difference matters.

Reminded me of managing the internet café and diagnosing connection drops with the place packed. There was no Stack Overflow in 2005. You learned from the logs or you didn't learn. That discipline of reading logs first before reaching for a search engine saved me more time here than I expected.


FAQ: Barman PostgreSQL Backup in Production

Does Barman work well on Railway or containerized platforms?

It works, but it requires extra effort. Barman's native model assumes SSH between servers. On Railway, the recommended method is backup_method = streaming, which avoids SSH but has limitations on restore speed and requires careful management of replication slots. If you're coming from a traditional VPS, the experience is much smoother.

What's the difference between Barman and pgbackrest in 2025?

The main difference today isn't technical — it's about maintenance: pgbackrest entered low-activity mode, while Barman is actively maintained by EDB (EnterpriseDB). Technically, pgbackrest has better support for parallel compression (lz4, zstd) and faster restore times in similar configurations. Barman has better streaming replication integration and more complete official documentation.

How long does a full restore take with Barman on a ~4 GB database?

In my Railway stack with backup_method = streaming and gzip, the restore took 23 minutes and 17 seconds. With traditional rsync configuration (requires SSH), typical reported times are 12–15 minutes for the same size. The difference comes from the backup method, not from Barman itself.

Is it safe to use Barman replication slots in production?

With the right precautions, yes. The concrete risk is that an inactive slot retains WAL indefinitely. Implement monitoring on pg_replication_slots to alert when lag exceeds a reasonable threshold (I use 500 MB as warning, 1 GB as critical). If the Barman container can go down, that monitoring is mandatory.

Does Barman completely replace pgbackrest for every use case?

No. If you have baremetal or a VPS with free SSH access between servers, Barman is an excellent option and probably better maintained today. If you're on a cloud-native containerized stack, Barman works but with friction. In that case it's also worth evaluating pgBackRest with self-managed maintenance, pg_dump for smaller databases with your own retention logic, or provider-specific solutions like Railway's automatic backups. There's no universal answer.

What happens if Barman loses its connection to Postgres during an active backup?

Barman detects the interruption and marks the backup as failed. It doesn't leave the backup in a corrupted state — that's a genuinely good design point. The next full backup runs from scratch. What it doesn't do automatically is retry: you need an external cron job or scheduler that checks the status and retries if the day's backup didn't complete.


My Conclusion: The Community Is Celebrating Too Fast

Barman is good. I'm not saying don't use it. EDB maintains it actively, the official documentation is clear, and for the architectural model it was designed for — servers with free SSH between them — it's probably the best open-source tool available in 2025.

What I don't accept is the uncritical enthusiasm from the HN thread. The "pgbackrest died, use Barman" narrative that circulated that week ignores three concrete things I measured myself: the restore being 29% slower in containerized stacks, the real risk of inactive replication slots, and the configuration friction waiting for you if you're coming from a cloud-native architecture.

What I do buy: if you have a VPS or baremetal with free SSH access, Barman is worth migrating to today. If you're on Railway like me, the story is more complicated and the trade-off is real.

The operational vendor lock-in I mentioned in my thesis is exactly this: Barman makes you dependent on its operational model. It's not data lock-in — you can recover the backups without Barman if the files are accessible. But operationally, if the Barman server goes down, you need to bring it back up exactly the same way for it to work. That's operational debt worth putting on the table before you migrate.

What I'd do right now if I were starting from scratch on Railway: I'd first evaluate whether Railway's automatic backup solution covers my SLAs before adding operational complexity. If it doesn't, Barman with streaming. With replication slot monitoring from day one. And with a documented, timed restore test before calling the migration done.

The number that matters is still the restore time. Everything else is documentation.


Source: Hacker News


This article was originally published on juanchi.dev

Top comments (0)