April 24, 2026. I logged into the Hetzner panel, clicked into the CPX31, typed the confirmation, and pulled the trigger. The little green dot went grey. My redundancy was gone.
I had a laptop, an email account, social media, and four months ago I didn't know what systemctl status meant. Now I'm out here making infrastructure decisions like I know what I'm doing. I don't. Senior devs will spot ten amateur moves in this post and I'll learn from every one of them.
Here's the math that pushed me. I had two boxes running. A primary CPX31 in Falkenstein doing the heavy work (x402 routes, the bot inference glue, a Postgres I keep telling myself I'll tune one day) and a second CPX31 that was supposed to be the failover. The failover was beautifully configured. I'd set up the cron, the pg_basebackup chain, a little health-check script that pinged the primary every 30 seconds and would flip DNS if things went sideways.
It never flipped. Not once. The primary has been up since I rebuilt it in February. My "failover" was a server I paid for monthly so I could feel safe.
The surgery is on the calendar. August 11. Every dollar I'm not spending on a second server is a dollar that buys me a day on the other side of that table, when I can't type for a while and the bots have to keep earning without me babysitting them. That math is not abstract. I am racing the calendar.
So I made the call. Simplicity won.
What "simplicity" actually looks like now:
- One CPX31. 4 vCPU, 8 GB RAM, 160 GB SSD, in Falkenstein.
- Nightly
pg_dumppiped straight into a Hetzner Storage Box over SSH. Costs almost nothing. - A second nightly job that rsyncs the bot code, the env files (gpg-encrypted), and the Caddy config to the same storage box.
- A documented rebuild script. Literally a markdown file titled
REBUILD.mdthat walks me through bringing a fresh Ubuntu 24.04 box back to a working state in under an hour. I tested it on a temporary CX22 on Tuesday. 47 minutes fromapt updateto the first x402 request succeeding.
The theory: if the primary dies, I spin up a new box, run the script, restore last night's dump, point DNS, and I'm back. Worst case I lose a few hours of request logs and some webhook state I can replay. Best case nobody notices.
That's the theory.
The doubt is real. Every morning I open Uptime Kuma and there's that one green dot where there used to be two, and my brain does the thing where it whispers "today is the day the disk fails." I'm still learning what production-grade actually means, and I know in my gut that "one box and a nightly dump" is not what a real ops person would call resilient. But I'm one guy. I am not Netflix. My uptime SLA is to myself.
The other thing I keep coming back to: complexity has a cost beyond the invoice. Two servers meant two patch schedules, two sets of certs to renew, two firewalls to keep in sync, twice the surface where I could quietly break something at 2 AM and not notice for a week. One server means when something is wrong, I know exactly where to look.
A Safety Pack sold while I was writing this. They add up little by little like pennies. That's the whole engine right now, and the engine only needs one room to run in.
So here's the honest question I want to put to anyone reading this who has actually run things in production for real money:
At what scale of monthly revenue did you stop treating a nightly dump + a rebuild script as "good enough" and decide you actually needed hot standby?
Top comments (0)