Phil Rentier Digital

Posted on Mar 17 • Originally published at rentierdigital.xyz

Docker 29 Broke My VPS. Claude Code Found Two Bugs I Didn’t Know Existed.

#devops #docker #softwareengineering #ai

I opened my SaaS dashboard and got this:

Web server is down - Error code 521

You          →    Cloudflare    →    myapp.example.com
Browser           Dallas              Host
✅ Working        ✅ Working          ❌ Error

TL;DR: A routine Docker update silently broke two things on my production server. My reverse proxy couldn't discover services anymore, and a port conflict appeared out of nowhere. Claude Code diagnosed both problems over SSH from my local machine, without being installed on the server. Total time from "it's down" to HTTP 200: 25 minutes.

When your production server decides to take an unscheduled vacation.

First reflex: the domain. I pulled up my registrar to check if the domain was still active. It was. Then I tried another subdomain on the same domain. Loaded fine. So the domain isnt the problem.

Second thought: n8n crashed. It happens. Docker containers die, databases run out of memory, the usual suspects. Im already mentally preparing for a docker restart and a cup of coffee.

Third thought: wait. If the other subdomain works, and both go through Cloudflare to the same server, then the server itself is responding. Which means it's not n8n. It's the reverse proxy. Traefik.

Great (not great). Traefik.

Nobody wakes up thinking "oh nice, my production server is down, I had nothing planned today anyway. -haha" Debugging a reverse proxy on a live server isn't something you do casually. It's layers on layers: Docker socket, service discovery, TLS certificates, routing rules. The kind of problem where you Google the error, get 15 answers that are close but not your setup, and lose two hours before you even find the right thread to pull.

I'd been using Claude Code for months on dev work. Building features, refactoring, debugging application code. It had impressed me enough that I thought: let me see if it can handle this. I opened it on my Mac, SSH'd into the server, and described the symptom.

Twenty-five minutes later, two bugs were fixed. Not one. Two. Same root cause, completely different symptoms. Every step below.

Three Containers Running. Zero Serving Traffic.

Claude checked the containers first. Standard triage. If the app is down, see what's actually running.

sudo docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

NAMES          STATUS                   PORTS
n8n            Up 2 minutes             5678/tcp
n8n_postgres   Up 2 minutes (healthy)   5432/tcp
traefik        Up 2 minutes

All three up. App logging "ready on port 5678", database healthy, reverse proxy running. But all three showing "Up 2 minutes." They just restarted. Something knocked them down.

Oh and the first docker ps failed. Permission denied on the Docker socket. Claude didnt magically know I needed sudo. It tried, got the error, added sudo, moved on. This isn't magic. It's iteration.

So the app is healthy. The reverse proxy is running. But Cloudflare can't reach the origin. Claude went straight to Traefik logs. Because if three healthy containers can't reach the internet, the routing layer is the suspect.

sudo docker logs traefik --tail=30

Repeating every second:

ERR "client version 1.24 is too old. Minimum supported API version
is 1.44, please upgrade your client to a newer version"
providerName=docker

Thats the smoking gun. Traefik uses the Docker socket to auto-discover services. It queries the daemon, finds containers with the right labels, builds routes dynamically. If it can't talk to the daemon, it discovers nothing. No services, no routes, Cloudflare hits the origin, gets nothing, throws 521.

Two commands. Zero visibility to root cause. That would've taken me 45 minutes of log-hopping and wrong guesses on my own, probably more if I'm being honest with myself.

Docker 29 Changed the Rules. Nobody Sent a Memo.

Claude checked versions:

sudo docker version

Docker Engine 29.2.1. API version 1.53. Minimum supported: 1.44.

My Traefik image: traefik:v3.0, released April 2024. Almost two years without updating a reverse proxy on production. Not my proudest moment, but also - wait, actually, let me rephrase that. I knew it was old. I just kept telling myself "if it works don't touch it" which is the kind of thinking that ages really well until one day it doesn't.

Traefik v3.0 hardcodes API version 1.24 when talking to the Docker socket. Worked fine with Docker 28 and every version before it. Docker 29 raised the floor, and the handshake gets rejected. The irony: you pin a specific version to avoid surprises, and two years later that exact decision breaks everything. But if you run latest, you get a different flavor of surprise. Pick your poison.

No deprecation warning at upgrade time. No "hey, this might break your reverse proxy." Just a silent minimum version bump that kills service discovery for anyone running an older Traefik.

"Critique Your Own Diagnosis. This Is Production."

I didn't just accept it. I have other servers running the same Traefik setup with zero issues. If the diagnosis was right, those should be broken too.

So I told Claude: this is a production server. Critique your own reasoning. Tell me why my other servers aren't failing. And give me a prompt I can run on the other box to verify.

This is the part I didnt expect. Claude generated a standalone verification script. Not a vague "go check the Docker version." An actual set of commands, with context comments, that I could copy-paste into an SSH session on the second server. No need to install Claude Code over there. Just SSH in, paste, read the output.

docker version --format \
  'Docker {{.Server.Version}} - API min: {{.Server.MinAPIVersion}}'
docker inspect traefik --format 'Image: {{.Config.Image}}'
docker logs traefik --tail=10 2>&1 \
  | grep -i "api version" || echo "No API version error"

Server B results: Docker 28.3.3, minimum API 1.24, Traefik v3.6.2, zero errors.

Two differences explained everything. Docker 28 still accepts the old API. And Traefik v3.6.1+ includes a fix, WithAPIVersionNegotiation(), that auto-negotiates instead of hardcoding 1.24. Server A had the old Docker and the old Traefik. Server B had the old Docker but a newer Traefik.

Any good engineer can follow a log trail. That part isn't hard. The hard part is knowing when to doubt the conclusion. I've seen enough production incidents where "the obvious answer" explains the symptom but misses the context entirely. Forcing Claude to defend its reasoning against a contradicting data point turned a plausible diagnosis into a confirmed one.

It's also the exact workflow I use when building CLI tools instead of relying on abstraction layers. When something breaks in a discovery chain, you don't trust the dashboard. You go inspect the actual handshake.

One Fix Applied. One New Bug Created.

Obvious fix: update Traefik to a version that can negotiate with Docker 29.

sudo sed -i 's|image: traefik:v3.0|image: traefik:v3.6.2|' \
  ~/traefik/docker-compose.yml
sudo docker compose -f ~/traefik/docker-compose.yml up -d --pull always

Image pulled. Container recreated. And:

failed to bind host port 0.0.0.0:443/tcp: address already in use

Fantastic 💀

Claude ran ss -tlnp | grep ':443' and found Tailscale Serve listening on the VPN IP, port 443. Traefik was trying to bind 0.0.0.0:443, all interfaces. Old Docker allowed this overlap: a specific IP and a wildcard on the same port coexisted fine. Docker 29 made port binding stricter. Same update, second breaking change.

Fix: bind Traefik to the public IP explicitly instead of the wildcard. Claude edited the compose file, restarted, verified. Each service on its own IP. No conflict.

curl -sI https://myapp.example.com | head -1
HTTP/2 200

Back online. Traefik logs: zero lines. When Traefik has nothing to complain about, it says nothing. Thats how you know its working.

One update. Two breaking changes. Zero warnings.

What Actually Matters if You Manage Servers

Claude Code wasn't installed on the server. It runs on my Mac. I SSH'd in, Claude saw the terminal, ran commands, read output, reasoned, ran the next command. No Node.js to maintain on production boxes, no API tokens on exposed machines. When Anthropic killed my OpenClaw infrastructure overnight, I rebuilt everything from one machine. Same principle. Centralize the brain, distribute the access.

Now the embarrassing part. While writing this article, I realized something. The day before the crash, I had rebooted the VPS. Ubuntu was showing "System restart required" and I just did it without thinking twice. That reboot is almost certainly what triggered the Docker 29 update to take effect. And I never made the connection. I didnt check anything after the reboot. Just rebooted, saw containers come back up, and moved on with my life. The bug was already there, silently waiting for Traefik to restart and try to talk to the new Docker API.

If you're a DevOps engineer, a sysadmin, or just a dev who manages their own infra, here's what I'd take from this:

After every reboot, verify. Don't just check that containers are up. Check that services are actually reachable. Here's the prompt I use now:

I just rebooted my VPS. SSH in and verify that my critical 
processes are actually running - I mean n8n, my database, 
and Traefik. Don't just check docker ps. Confirm they're 
reachable and not throwing errors in the logs.

(Swap n8n/database with whatever you run. The point is: tell Claude what matters to you, let it figure out the checks.)

After every major package update, check your API contracts. Docker, Kubernetes, Traefik, Nginx, Postgres. An update that doesn't break your app can still break the glue between your services. Here's the prompt:

I just updated Docker (or: apt upgraded my server). Check 
that all my services can still talk to each other. Look at 
API version compatibility between Docker and any container 
that uses the Docker socket. Flag anything that might break 
silently.

Challenge the diagnosis before applying the fix. If you have multiple servers, use them. Generate a verification script, run it on the healthy box, diff the results. This is where Claude shines: it doesn't just diagnose, it produces portable checks you can run anywhere over SSH.

Stop memorizing commands. Start describing problems. I used to think getting better at servers meant knowing the right iptables syntax by heart or which config file lives where. The actual skill now is describing what you see clearly enough that Claude can follow the causal chain. Error 521, containers running, logs clean, origin unreachable. Four layers deep in 25 minutes. Not because I don't know Linux. Because tracking dependencies across four layers on a live server requires the kind of working memory that humans burn through fast and LLMs don't.

The server doesn't care who typed the command. But I care that it took 25 minutes instead of 3 hours.

I write about the tools and disasters that shape how indie devs actually ship. If your stack includes a VPS you're slightly afraid of, you'll feel at home.

Subscribe →

*Yeah, that cover image is AI-generated. I can diagnose a Docker API mismatch but I can't draw a server rack to save my life.