Zero-downtime deploys and one-click rollback for self-hosted apps — no Kubernetes

#selfhosted #devops #docker #deployment

The most awkward moment in self-hosted deployment is the couple of seconds after you run docker stop old && docker run new — to the outside world, your service is a string of 502s. A user who clicks in during that window sees a big red error page.

Rollback is even messier. Something breaks in prod, and now you're frantically digging for "which image tag was the last good one?" or squinting at a list of SHAs in git log trying to guess which one was fine.

Small teams, homelabs, internal services — most don't have Kubernetes' rolling updates and one-click rollback. Are you just stuck tolerating that window? I used to think so. Then I actually took it apart, and the conclusion is: zero-downtime deployment isn't about K8s-grade scheduling at all. It's three small things — stage the new version first, flip to it with an atomic operation, and flip straight back if it's wrong. Here's how the pieces work.

1. The essence: compress "the switch" into one atomic action

First, see clearly why stop old → run new returns 502s: between those two commands there's a vacuum where the old one is gone and the new one isn't up yet. No matter how short the window, any request that lands in it fails.

The key mental shift is to split a deploy into two phases: "stage" and "switch".

Stage phase: put the complete new version next to the old one (a new directory / a new image). During this, the old version keeps serving, untouched — zero impact. It can be slow; nobody's affected.
Switch phase: point from old to new. This step must be as atomic as possible — ideally compressed into a single kernel-guaranteed operation, so the "vacuum" shrinks to nonexistent or a single instant.

Once you internalize that, the rest is just finding the "atomic switch action" for each artifact type.

2. File artifacts (jar / dist / static site): atomic symlink is the standard answer

If you deploy a jar, a frontend dist, or a static site, the standard solution is an atomic symlink switch:

releases/
  20260701-a1b2c3/   <- this deploy's new version
  20260630-f9e8d7/   <- previous version
current -> releases/20260701-a1b2c3   <- one symlink; the app only knows "current"

On deploy, you first upload the new version into releases/<new>/ while leaving the current symlink completely untouched. Once the new version is fully staged, you flip the symlink. There's an easy-to-miss detail in how you flip it:

# ❌ Has a window: ln -sfn is unlink + symlink, two steps; for one instant current doesn't exist
ln -sfn releases/new current

# ✅ Atomic: create a temp symlink, then mv -T does a single rename(2) overwrite
ln -sfn releases/new current.tmp && mv -T current.tmp current

mv -T is a single rename(2) syscall underneath, which the kernel guarantees is atomic — there is never an instant where current points at nothing. At any moment, current points either at the complete old version or the complete new one. No in-between state.

An honest but important caveat: an atomic symlink doesn't mean zero process restart. If your app is a long-running process (say a jar that loads into memory and stays there), after flipping the symlink you still have to send it a reload signal so it re-reads current. The symlink guarantees "current always points at a complete version" — it does not guarantee "the process never restarts." Don't conflate the two.

3. Containers: pull first, then swap — shrink the window to seconds

Containers don't have the symlink layer, but the idea is identical — stage first, then switch:

docker pull <new image>: pull the new image locally. During this, the old container keeps running — zero impact. While you're at it, docker inspect the old container's current image and record it as your rollback anchor.
docker rm -f <old> && docker run -d <new>: remove old, start new, two commands back-to-back to keep the window minimal.

The honest trade-off: between rm and the new container being fully ready, there's a brief "no container" window — usually seconds. For the vast majority of self-hosted scenarios, a few seconds is entirely acceptable. But don't pretend it's zero-window. To truly get zero-window you need blue-green: start the new container, health-check it, flip the reverse-proxy upstream from old to new, then remove the old one. The cost is running two copies during the switch, plus reverse-proxy coordination. Rolling (seconds-long window) vs. blue-green (zero window, double the resources) — pick by how critical the service is. Don't reflexively reach for the most complex option.

4. Where the health check goes: after the switch, roll back on failure

The new version is live — how do you know it didn't crash? Probe a health URI immediately after the switch, and roll back to the previous version the moment it's unhealthy.

One directional point to be clear about: this is "verify after switching," not "canary before switching." The new version is already serving the instant you switch; the health check exists to "catch a broken deploy and back it out fast," not to "gate release until it's verified good." If you want "gate until verified good," what you need is blue-green (verify health on the side, flip traffic only once it passes).

For a small team, "seconds-fast rollback after switch" is usually a better deal than "canary before switch" — far simpler to implement, and the only cost is a broken version being exposed for a few seconds. Your reverse proxy (e.g. Caddy) can add a second layer of passive health checking (health_uri + health_interval) that automatically ejects unhealthy upstreams from the load balancer — one more safety net.

5. One-click rollback: don't dig for image tags — have the tool "re-ship the last success"

The easiest way to get rollback wrong is to rely on human memory — "was the last good image tag v1.3.2 or v1.3.1?" People are least reliable exactly when something's on fire.

The right approach: the system already stores the artifact + the target-server list for every deploy. Rollback = find the "last deploy that fully succeeded across all servers" and re-ship its artifact to the same targets. The whole thing reuses the normal deploy path, so rollback itself runs the health check and can roll back again — it's not a special side-channel.

The honest trade-offs you need to accept:

One-click rollback = redeploying a historical version, so its time and cost ≈ a fresh deploy (re-pull the image / re-upload the files). It is not "instantly switch back to the old container." If you truly want a seconds-fast switch-back, you'd keep the old container around undeleted — a different trade-off (it sits there consuming resources).
It depends on history: you can only roll back to the "last success" among the recent N deploys. If you've deleted the old tag from your image registry, pull can't get it back and rollback fails. Don't rush to prune old images.

One real gotcha worth logging: to save display space, deploy records store the 7-char short SHA. But go-git's CommitObject API only accepts the 40-char full SHA — hand it a short SHA to look up a commit and you get "commit unreachable," and the feature silently degrades. The fix is to ResolveRevision first (turning a short SHA / branch name / tag into a full hash), then fetch the commit object. This "stored an abbreviation to save space, have to expand it at use time" trap is exactly the kind of thing you fall into when hand-rolling your own deploy tool. Note it down.

6. A few things that will bite you

Atomic symlink is only atomic for "the switch step"; whether the app needs a restart to pick up the new version is a separate concern — wire up a reload for that.
Container rm+run has a seconds-long window — don't advertise it as zero-downtime. For true zero-window, go blue-green and eat the double-resource cost.
The health check is post-switch verification — a failure means the broken version was already exposed for a moment. Fine? Use rolling. Not fine? Go blue-green.
A first-ever deploy has no "previous version" to roll back to — a health failure is a hard failure that needs manual intervention. Don't expect the first ship to auto-recover.
Rollback cost ≈ a fresh deploy, not a free instant switch. Treat it as "one automated redeploy," not an "undo button."

7. I ended up building this into a tool

None of the pieces are complex on their own, but you have to wire every one yourself: mv -T for atomic symlinks, pull-then-swap for containers, choosing blue-green vs. rolling, post-switch health checks, auto-rollback on failure, one-click re-ship of the last success, ResolveRevision for short SHAs... I packed all of this — along with the CI builds, multi-server deploys, Caddy auto-HTTPS, and Per-PR preview environments from earlier posts — into my single-binary Go deployment tool Pipewright:

File artifacts switch via atomic symlink; containers pull-then-swap with a recorded rollback anchor; blue-green / rolling both available as strategies;
runs a health check automatically after the switch and auto-rolls-back to the previous version if unhealthy;
one-click rollback: click once in the panel, it locates the last all-servers-success deploy and re-ships it — no memorizing image tags;
short SHAs in deploy records and code diffs are auto-resolved via ResolveRevision, so no "commit unreachable."

Open source, MIT, single binary, no runtime dependencies (frontend baked in via embed.FS, SQLite by default). Aimed at individual developers and small teams self-hosting.

Repo: https://github.com/huangchengsir/pipewright

But even if you never use it, the core takeaway holds: zero-downtime deploy = stage first + atomic switch + roll back after; one-click rollback = re-ship the last successful version. No Kubernetes required — self-hosting can have this experience too. It really isn't magic.

Issues and pushback welcome.