Why "is it fully fixed?" has no honest answer — a small-server story

#devops #monitoring #sre #lessons

I run a side project on a 1GB free-tier VPS. Small box, a few services, nothing fancy.

Last week I spent a session hardening its infra. After each fix I asked the same thing: is it fully fixed now? Will this stop happening? The honest answer kept coming back no, and sometimes the fix itself caused the next problem.

Here is the incident that made me retire that question.

The incident: my monitoring almost killed a healthy server

The setup:

Main box: 1GB VPS, one API service.
Watchdog: a second box pings /health every 5 minutes. If it looks down, it alerts. If it stays down, it calls the cloud API to hard-reboot the main box.

Felt safe. Self-healing, even.

Then I ran an ffmpeg job on the main box. One core, so ffmpeg took the whole CPU. During that window, /health could not answer within the 10s timeout. The watchdog read "no response" as "dead." One more failed check and it would have rebooted a perfectly healthy production box.

The server never went down. A job I started made it slow, and the thing I built to protect it almost rebooted it.

Why "is it fully fixed?" has no honest yes

1. "Fixed" is a closed question; the system is open.
You can only defend against failures you have already seen. Every change adds new surface. "Will this never happen again" has no honest yes.

2. Small servers are tightly coupled.
One core, one gig. A heavy job starves everything else. No isolation, so "this part is safe" can quietly break the part next to it.

3. The fix is a variable too.
Monitoring is part of the system. This time it was the failure. Earlier in the same session, making an alert script executable turned it into a source of false alarms. Every change made to harden things added a new way to fail.

4. Confident answers are cheap.
"Done, safe" is easy to say before checking the blast radius, whether it comes from a tool, an AI agent, or me. The reason I kept asking "is it fixed?" was that part of me did not believe the yes. I was right not to.

What actually helps

Not guarantees. Detection and reversibility.

The watchdog fix was not "make it perfect." It was: stop trusting a single failed check. Retry a few times before declaring death.

code="000"
for i in 1 2 3; do
  code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "$URL")
  [ "$code" = "200" ] && break
  [ "$i" -lt 3 ] && sleep 10
done
# only "dead" if all three fail

A transient spike recovers within 30 seconds, so three misses in a row means it is really down. And heavy jobs do not run on the box they would starve.