Why "Just Restart It" Stopped Working
A eulogy for the universal debugging technique
The Universal Truth
Every engineer has said it.
Every engineer has heard it.
Three words that have debugged more systems than all monitoring tools combined:
"Have you tried restarting it?"
It worked for decades. So well we turned it into a meme. A joke. A badge of honor.
"Did you turn it off and on again?"
We laughed because it was true.
When Restarting Made Sense
Once upon a time, a server was a physical thing.
One machine. One process. One problem.
When something broke:
Service stops responding
→ SSH into the box
→ ps aux | grep myapp
→ PID still there? Process hung?
→ kill -9 PID
→ ./start-myapp.sh
→ Everything works again
Total time: 2 minutes
Total stress: Minimal
Total sleep lost: None
Why did this work?
Because the problem was usually temporary.
A memory leak. A deadlock. A bad connection that timed out wrong.
The code had a bug, sure. But restarting reset the state to before the bug happened.
It wasn't elegant. It wasn't permanent.
But at 3 AM, that's all anyone cared about.
The First Sign of Trouble
Then we got more servers.
One box became ten.
Ten became a hundred.
Restarting stopped being a single command.
It became a deployment.
for server in $(cat servers.txt); do
ssh $server "systemctl restart myapp"
done
This worked. Mostly.
Until the day it didn't.
The Cascade
I watched this happen once.
02:15 - Pager: "Database connections failing"
The on-call engineer checks the logs.
Database is overwhelmed. Too many connections.
The solution, burned into muscle memory from years of single-server debugging:
"Restart the database."
One command. One mistake.
systemctl restart postgresql
The database came back in 45 seconds.
In those 45 seconds:
- All 200 application servers lost their connection pools
- All 200 retried simultaneously, using identical retry logic
- All 200 failed their health checks
- The load balancer marked them all unhealthy
- The site went down
The database was fine.
The app servers were fine.
The connections were gone.
The restart fixed nothing and broke everything.
One restart.
47 minutes of downtime.
Why Restarting Broke
Restarting worked when:
- State lived in one place
- Dependencies were simple
- Recovery was faster than finding root cause
Restarting broke when:
- State moved to databases, caches, message queues
- Services started calling other services
- "Just restart it" became "restart everything in the right order with the right delays and pray"
A restart is no longer a local action.
It's a distributed event.
You don't restart one thing.
You restart a graph of dependencies.
What Happens When You Restart Now
You restart Service A
↓
Service A disconnects from database
↓
Database releases locks
↓
Service B loses connection to Service A
↓
Service B retries aggressively
↓
Retries overwhelm Service C
↓
Service C crashes
↓
Everything is on fire
All because you restarted "just one thing."
The Lie We Tell Ourselves
"Restarting is harmless."
It isn't.
Every restart is:
- A forced state reset
- A connection teardown
- A potential cascade trigger
- A temporary partial outage (even if small)
We accepted restarts as "free" because the cost was invisible.
Until it wasn't.
What Replaced Restarting
The industry didn't ban restarts.
It made them unnecessary.
Health checks
Detect problems before users do.
# Kubernetes liveness probe example
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
If service unhealthy, don't send traffic
Let it recover or replace it
Users never see the failure
Graceful degradation
Fail partially, not completely.
Cache down? Serve stale data
Database slow? Queue writes, serve reads
Something broke? Everything else keeps running
Automatic replacement
Never restart. Always replace.
Pod dies? New one starts
Node fails? Pods move
Same binary. Clean state. No cascade
Rolling restarts
One at a time, with verification.
Restart server 1 of 10
Wait for health check
Restart server 2 of 10
Never lose capacity
The Systems That Don't Need Restarts
Netflix doesn't restart. It terminates and replaces.
Google doesn't restart. It shifts load and repairs.
Your bank doesn't restart. It fails over to another region.
These aren't magic.
They're design choices.
They assumed from day one that "restart" was not a strategy.
The Honest Confession
I still say "have you tried restarting it?"
Sometimes it's the fastest path to it works now.
But I don't pretend it's a fix anymore.
It's a diagnostic.
A temporary patch.
A way to buy time until the real problem reveals itself.
The difference is:
I know the difference now.
What You Can Do Monday
For your most critical service:
- Find the last time it was restarted
- Ask: "Why did that restart happen?"
- Ask: "Could we have avoided it?"
If yes, build the automation.
If no, document why (so next time you know).
For your next outage:
- Resist the restart reflex
- Check dependencies first
- Check connections second
- Check logs third
- Restart only when you understand what you're about to break
The Question
When was the last time you restarted something
and didn't know exactly what would happen when it came back?
Be honest.
This is part of a series on operations in the age of distributed systems. Next up: "The Pager Should Not Exist."
Top comments (0)