Alex Kernel

Posted on Jun 18

Automate Database Backups Across a Server Fleet with AI: 7 Recipes for 2026

#claude #mcp #backend

If you are still SSH-ing into four boxes to confirm last night's dump actually
ran, automated database backups across servers are exactly the kind of chore
that should be running on autopilot — not eating your evenings. In this guide we
wire up postgres backup automation for a small fleet (one primary, two
streaming replicas, an offsite backup box) and drive the whole thing from a
single AI interface using the MCP server
remote-agents. No agent
controller, no central scheduler to babysit — every host runs its own cron and
reports back per-host.

In short: every database host runs a lightweight agent connected to an
encrypted relay room. Your AI assistant (Claude or opencode) sees the whole
fleet as one machine and calls tools like schedule_add, fleet_exec,
file_stat, fleet_git, send_file, and set_mode. Cron jobs live on the
host, so a nightly pg_dump keeps running even if the relay link drops.
Payloads are end-to-end encrypted (AES-GCM-256) — the relay forwards
ciphertext blind.

Architecture: four hosts, one room
Installing agents and tagging db hosts
Quick summary of the seven recipes
Why and when to use this
Recipe 1 — Tag db hosts and look around in plan mode
Recipe 2 — Schedule a nightly pg_dump with retention
Recipe 3 — Verify backups are fresh across the fleet
Recipe 4 — Run a schema migration across replicas safely
Recipe 5 — Replication-lag and health checks
Recipe 6 — Read a prod .env read-only
Recipe 7 — Ship a dump host→host, SHA-256 verified
FAQ
Wrapping up

Architecture: four hosts, one room

Let's take a realistic small-SaaS database tier: a Postgres primary, two
streaming replicas for read scaling and failover, and an offsite box that only
stores compressed dumps. We'll also nod to a Windows SQL Server host at the end
to show the cross-platform story.

Host	Role	Tag	Stack
`db-primary`	Postgres primary (writes)	`db,primary`	PostgreSQL 16, Linux
`db-replica-1`	Streaming replica	`db,replica`	PostgreSQL 16, Linux
`db-replica-2`	Streaming replica	`db,replica`	PostgreSQL 16, Linux
`backup-box`	Offsite dump target	`backup`	rsync, gzip, Linux
`sql-win`	SQL Server (optional)	`db,windows`	SQL Server 2022, Windows

All agents join one relay room (say dbfleet). Tags let you address a group
in a single call: target="db,replica" hits both replicas, target="os:linux"
hits every Linux box, and target="all" sweeps the whole fleet. Note that
multi-tag targets match a host carrying either tag, so db,primary resolves
to every database host.

            ┌────────────── AI (Claude / opencode) ──────────────┐
            │            remote-agents (MCP, stdio)               │
            └───────────────────────┬────────────────────────────┘
                                    │ wss:// (E2E-encrypted)
                            ┌───────┴────────┐  relay (CF Worker or self-hosted)
                            │   room=dbfleet  │
        ┌───────────┬───────┴────┬────────────┬─────────────┐
   db-primary   db-replica-1  db-replica-2  backup-box   sql-win
  (db,primary)  (db,replica)  (db,replica)  (backup)    (db,windows)

The relay is interchangeable: use the hosted Cloudflare Worker, or run your own
Rust relay with remote-agents-relay --bind 0.0.0.0:8080 and point agents at
ws://your-host:8080. Either way the relay only ever sees encrypted frames.

Installing agents and tagging db hosts

On each host you install the package once and start the agent with the right
tags. For 24/7 database servers you'll want the background service form so the
agent survives reboots:

# once on every machine
npm i -g remote-agents

# the Postgres primary
remote-agents run --relay wss://<relay> --room dbfleet --token <secret> \
  --name db-primary --tags db,primary

# the two replicas
remote-agents run ... --name db-replica-1 --tags db,replica
remote-agents run ... --name db-replica-2 --tags db,replica

# the offsite backup target
remote-agents run ... --name backup-box --tags backup

# install as a systemd service for 24/7 hosts (instead of `run`)
remote-agents install --relay wss://<relay> --room dbfleet --token <secret> \
  --name db-primary --tags db,primary

Then confirm the whole fleet is online. Ask your AI:

"List the agents in the room."

Under the hood that calls list_agents, which returns each peer's OS family,
distro, kernel, shell, tags, and an update_available flag when a newer agent
version is out. It's a flat peer network — there is no controller node; every
host both executes and dispatches.

Also read: Run a security audit across your whole fleet

Quick summary of the seven recipes

# Recipe	What it does
1. Plan-mode look	Tag db hosts, set them to read-only `plan`, and inspect Postgres safely before touching anything.
2. Nightly pg_dump	`schedule_add` a `0 3 * * * *` host-local cron that dumps + gzips and prunes anything older than 14 days.
3. Verify freshness	`fleet_exec` + `file_stat` to confirm the dump exists, its size, and its mtime on every host at once.
4. Replica migration	`fleet_git` pull migrations + `fleet_exec` across `db,replica` with per-host results so a failed node is obvious.
5. Lag / health check	`fleet_exec target="db,replica"` running `pg_stat_replication` to catch lag and broken streaming.
6. Read prod config	`read_file` a production `.env` in `plan` mode — zero write risk.
7. Ship a dump	`send_file` a dump from primary to `backup-box` over a direct UDP channel, SHA-256 verified.

Why and when to use this

Database fleets are where "I'll just SSH in real quick" goes to die. A single
forgotten retention prune fills a disk; a silent pg_dump failure isn't noticed
until the day you need the dump; a schema migration applied to the primary but
not the replicas causes mysterious read errors hours later. Here's where driving
the fleet through one AI interface pays off:

Cron that survives the network. schedule_add installs the cron on the host itself, so your nightly 0 3 * * * * dump runs even if the relay link is down, your laptop is closed, or the AI session has long ended. The schedule is not tied to your connection.
Per-host truth in one call. fleet_exec and file_stat give you the dump size and mtime on all four boxes in a single round trip instead of four SSH sessions. One failing host does not sink the batch — it just shows up red in the aggregated result.
Replica maintenance without drift. fleet_git + fleet_exec against the db,replica tag apply the same migration to both replicas and report each one separately, so a half-applied change is impossible to miss.
Safe-by-default reads. plan mode makes a host read-only, so you can pull a prod .env or run pg_stat_replication during an incident with no chance of fat-fingering a write.
Cross-platform reach. target="os:linux" hits the Postgres boxes; a Windows SQL Server host joins the same room and answers os:windows targeting with a sqlcmd backup instead of pg_dump.

This is not a replacement for a declarative backup orchestrator or a managed
RDS-style service. It shines for self-hosted databases, small-to-mid fleets,
dev/staging tiers, and the operational glue around backups that nobody wants to
maintain in YAML.

Recipe 1 — Tag db hosts and look around in plan mode

Before automating anything, look first and touch nothing. Set every database
host to plan mode — read-only read_file, git_status, and safe exec
only — and get oriented.

Flip the database hosts to read-only.
Confirm Postgres is up and check current data directory size.
Eyeball where existing backups (if any) land.

set_mode    target="db,replica"   mode=plan
set_mode    agent_id=db-primary   mode=plan

fleet_exec  target="db,primary"   command="systemctl is-active postgresql && psql -tAc 'SELECT version();'"
fleet_exec  target="db,primary"   command="du -sh /var/lib/postgresql/16/main; df -h /var"
list_dir    agent_id=backup-box   path=/srv/backups

Tip: plan mode still allows a read-only exec, so psql -tAc 'SELECT ...'
and du -sh work fine. The moment a command would write, the agent rejects it.
Start every incident here — you literally cannot break prod from plan.

The aggregated reply comes back per host: you'll see db-primary reporting
active and a 240 GB data directory, while a replica that's catching up might
report differently. That per-host shape is the whole point — no guessing which
box you're looking at.

Recipe 2 — Schedule a nightly pg_dump with retention

This is the core of scheduled pg_dump automation. We install a host-local
cron on db-primary that dumps the database, gzips it, and prunes anything
older than 14 days. Because schedule_add writes the cron to the host, it
keeps firing at 03:00 every night whether or not anyone is connected.

Remember the cron is a 6-field spec — sec min hour day month dow — so
"3 AM nightly" is 0 0 3 * * *.

Switch the primary to edit so the agent may create the dump file and the wrapper script.
Add the nightly dump-and-prune schedule.
List schedules to confirm it registered.

# the command the cron runs on db-primary (one line, gzip + 14-day prune)
pg_dump -Fc -U postgres app_prod \
  | gzip > /srv/backups/app_prod-$(date +\%F).sql.gz \
  && find /srv/backups -name 'app_prod-*.sql.gz' -mtime +14 -delete

set_mode      agent_id=db-primary  mode=edit

schedule_add  agent_id=db-primary  name=nightly-pgdump \
  cron="0 0 3 * * *" \
  command="pg_dump -Fc -U postgres app_prod | gzip > /srv/backups/app_prod-$(date +%F).sql.gz && find /srv/backups -name 'app_prod-*.sql.gz' -mtime +14 -delete"

schedule_list agent_id=db-primary

The -Fc custom format gives you a compressed, pg_restore-friendly archive; a
240 GB cluster typically lands around a 30–45 GB gzipped dump depending on how
much of it is indexes and TOAST. The -mtime +14 -delete clause keeps roughly
two weeks of dumps and nothing more, so the disk doesn't quietly fill.

Note: for a mysqldump cron the only thing that changes is the command —
mysqldump --single-transaction app_prod | gzip > ... — wrapped in the exact
same schedule_add. And for the Windows SQL Server host you'd schedule a
sqlcmd -Q "BACKUP DATABASE ..." instead. The scheduler, retention, and
verification flow are identical across all of them.

To remove the schedule later (say you've migrated to WAL archiving):

schedule_remove  agent_id=db-primary  name=nightly-pgdump

Recipe 3 — Verify backups are fresh across the fleet

A backup you never check is a backup you don't have. This recipe answers the only
question that matters at 9 AM: did last night's dump actually run, and is it the
right size, on every host? We combine fleet_exec to find the newest dump with
file_stat to read its exact size and mtime.

Find the newest dump file on each db host.
file_stat that file for an authoritative size and modification time.
Flag anything older than ~26 hours as stale.

fleet_exec  target="db,primary"  \
  command="ls -t /srv/backups/app_prod-*.sql.gz 2>/dev/null | head -n1"

file_stat   agent_id=db-primary  path=/srv/backups/app_prod-2026-06-18.sql.gz

fleet_exec  target="db,primary"  \
  command="find /srv/backups -name 'app_prod-*.sql.gz' -mmin -1560 -printf '%p %s bytes\n' || echo NO_FRESH_BACKUP"

file_stat returns the size and mtime directly, which is more trustworthy than
parsing ls output. The -mmin -1560 window (26 hours) gives the 03:00 job some
slack — anything that didn't write a fresh file in that window prints
NO_FRESH_BACKUP, and because results are aggregated per host, a replica
that skipped its dump stands out instantly while the healthy boxes report a clean
~42 GB file.

Important: size is your early-warning signal. A dump that suddenly drops
from 42 GB to 200 KB almost always means pg_dump errored out mid-run (bad
credentials, a dropped connection, a full disk) and gzipped only the error.
Watching size over time catches silent failures that a green exit code hides.

You can fold this into a once-a-day self-check by wrapping the find in another
schedule_add that appends to a log — same host-local cron pattern as Recipe 2.

Also read: Build a cross-platform CI test farm

Recipe 4 — Run a schema migration across replicas safely

Replicas in database fleet management drift the moment a migration lands on
one box but not another. With fleet_git to pull the migration repo and
fleet_exec to apply it across the db,replica tag, both replicas move in
lockstep and you get a per-host verdict.

On true streaming replicas you don't run DDL directly (they're read-only and
replay WAL from the primary). This recipe fits the common real-world setup where
"replica" hosts also run migration tooling against a logical target, or where you
roll out an out-of-band maintenance script. Adjust to your topology.

Pull the latest migrations on both replicas at once.
Apply them, capturing per-host success/failure.
Re-check that both ended on the same schema version.

fleet_git   target="db,replica"  op=pull  repo=/srv/migrations  remote=origin  branch=main

fleet_exec  target="db,replica"  \
  command="cd /srv/migrations && ./run_migrations.sh 2>&1 | tail -n 5 || echo MIGRATION_FAILED"

fleet_exec  target="db,replica"  \
  command="psql -tAc 'SELECT max(version) FROM schema_migrations;'"

The payoff is the per-host aggregation: if db-replica-2 returns
MIGRATION_FAILED while db-replica-1 reports the new version, you know exactly
which node to fix — no scrolling through interleaved SSH output trying to figure
out whose error you're reading. One failing replica does not block the batch from
finishing on the healthy one.

Tip: put the hosts in edit mode (not bypass) for migrations. In edit,
the agent auto-creates a backup before any overwrite, so a botched
write_file to a config or script leaves you the original. Drop back to plan
the moment you're done.

Recipe 5 — Replication-lag and health checks

The single most useful replica maintenance check is replication lag: how far
behind the primary each replica is. From the primary you query
pg_stat_replication; from each replica you check pg_last_wal_replay_lag.
Both are one fleet_exec away.

From the primary, list connected replicas and their byte lag.
From each replica, read its replay delay.
Alert on anything beyond your threshold.

fleet_exec  target="db,primary"  \
  command="psql -xtAc \"SELECT client_addr, state, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes FROM pg_stat_replication;\""

fleet_exec  target="db,replica"  \
  command="psql -tAc 'SELECT now() - pg_last_xact_replay_timestamp() AS replay_lag;'"

fleet_exec  target="db,replica"  \
  command="psql -tAc 'SELECT CASE WHEN pg_is_in_recovery() THEN 1 ELSE 0 END;' | grep -q 1 && echo OK_RECOVERY || echo NOT_A_REPLICA"

Run from plan mode — these are all read-only queries, so there's zero risk even
on a busy primary. The first query, sent to db-primary, shows you both replicas
as rows with their state (streaming is what you want) and lag_bytes. The
second, fanned out across the db,replica tag, returns each replica's replay lag
side by side. A replica that has fallen out of streaming shows up either as a
missing row in the primary's view or a ballooning replay_lag.

Note: a healthy LAN replica usually sits under a few hundred milliseconds
of replay lag. A sudden jump to tens of seconds (or a replica vanishing from
pg_stat_replication entirely) means streaming broke — often a WAL retention
or network issue. Catching it here, across both replicas in one call, beats
discovering it when a read query returns stale data.

Recipe 6 — Read a prod .env read-only

Sometimes you just need to confirm a connection string or a backup target path
without any chance of editing it. With the host in plan mode, read_file
hands you the file contents and nothing about the session can write.

Ensure the host is in plan mode.
read_file the config.
Optionally grep for one key with a read-only exec.

set_mode   agent_id=db-primary  mode=plan

read_file  agent_id=db-primary  path=/srv/app/.env

exec       agent_id=db-primary  command="grep -E '^(PGHOST|PGDATABASE|BACKUP_DIR)=' /srv/app/.env"

Because the agent is read-only, even if you (or the AI) followed up with a
write_file, it would be rejected. On top of that, a hard deny-list —
covering paths like /etc/shadow and /boot — applies on every host in
every mode, including bypass, so the truly sensitive system files are never
readable or writable through the agent regardless of how you set things up.

Important: plan mode is the right default for production database hosts.
Keep them in plan day to day and only flip a single host to edit for the
specific window you're changing something — then flip it straight back. It
turns "I'll be careful" into "I literally can't write right now."

Recipe 7 — Ship a dump host→host, SHA-256 verified

Finally, get the dump off the primary and onto the offsite backup-box.
send_file streams a file host→host over a direct UDP data channel (opened on
demand, with automatic relay fallback) and verifies it end-to-end with a
SHA-256 checksum. Because the receiving side performs a write, the
destination host needs edit or bypass.

Put backup-box in edit so it can write the incoming file.
send_file the latest dump from db-primary to backup-box.
Confirm arrival and size with file_stat on the receiver.

set_mode     agent_id=backup-box  mode=edit

send_file    agent_id=db-primary  \
  path=/srv/backups/app_prod-2026-06-18.sql.gz \
  to=backup-box \
  dest=/srv/offsite/app_prod-2026-06-18.sql.gz

file_stat    agent_id=backup-box  path=/srv/offsite/app_prod-2026-06-18.sql.gz

The transfer goes peer-to-peer where the network allows, so a 42 GB dump doesn't
have to round-trip through the relay — and if a direct UDP path can't be
established, it transparently falls back to the relay so the copy still
completes. The SHA-256 check means you find out about a corrupted transfer
immediately, not when a restore fails three weeks later. If you'd rather pull
from the receiving side, transfer_get is the mirror-image tool.

Tip: in the browser fleet-chat panel there's a Files view that does the same
move with a live progress bar, plus a binary-safe chunked download of any file
through the relay. Handy when you want to eyeball a dump's size or grab one to
your laptop without dropping to the CLI.

Set backup-box back to plan when the copy is done, so the offsite store is
read-only at rest:

set_mode  agent_id=backup-box  mode=plan

FAQ

How is this different from a cron job I'd write myself?

It's the same idea, but the cron is installed and managed through one AI
interface across the whole fleet, and schedule_add registers it on the host
so it survives relay outages and reboots. The difference is operational: you add,
list, verify, and remove backup schedules on four boxes from one place, and you
get per-host confirmation that each dump actually ran, instead of trusting four
independent crontabs you never look at.

Will my nightly pg_dump still run if the relay or my laptop is offline?

Yes. schedule_add writes a host-local cron that lives entirely on the database
server. The relay link and your AI session are only needed to create, list, or
remove the schedule — not to run it. Your 0 0 3 * * * dump fires at 03:00
regardless of whether anything is connected.

Is it safe to point this at a production database?

Start every host in plan mode, which is read-only — you can run
pg_stat_replication, read_file a .env, and check dump freshness with zero
write risk. Flip a single host to edit only for the specific change you're
making (edit auto-backs-up before overwriting), and a hard deny-list on paths
like /etc/shadow and /boot applies even in bypass. All command payloads and
results are end-to-end encrypted; the relay forwards ciphertext and never sees
your data or keys.

Can I back up MySQL or SQL Server the same way?

Yes — the workflow is database-agnostic. Swap the command inside schedule_add:
mysqldump --single-transaction ... | gzip for MySQL, or
sqlcmd -Q "BACKUP DATABASE ..." on a Windows host targeted via os:windows.
The scheduling, 14-day retention prune, freshness verification, and SHA-256
transfer all stay identical.

How do I keep the agents themselves up to date across the fleet?

Run fleet_update_check, which tells you which idle hosts have a newer agent
version available, then run npm i -g remote-agents@latest on those hosts. It
won't interrupt a host mid-task, so you can update the fleet without disrupting a
running backup or migration.

Wrapping up

One MCP interface, five hosts, zero manual SSH sessions: a nightly scheduled
pg_dump with 14-day retention runs host-local on the primary, freshness is
verified across the fleet with file_stat, replicas get migrations and lag checks
in lockstep, prod config is read safely in plan mode, and the dump lands offsite
with a SHA-256 guarantee. That's what automated database backups across servers
should feel like in 2026 — boring, observable, and running whether you're watching
or not. Add cross-platform targeting and the same recipes cover your MySQL and SQL
Server hosts too.

Install: npm i -g remote-agents →
package on npm ·
source and docs

DEV Community