Sreekanth Kuruba

Posted on Mar 24

Why "Just Restart It" Stopped Working

#kubernetes #devops #sre

Why "Just Restart It" Stopped Working

A eulogy for the universal debugging technique

The Universal Truth

Every engineer has said it.

Every engineer has heard it.

Three words that have debugged more systems than all monitoring tools combined:

"Have you tried restarting it?"

It worked for decades. So well we turned it into a meme. A joke. A badge of honor.

"Did you turn it off and on again?"

We laughed because it was true.

When Restarting Made Sense

Once upon a time, a server was a physical thing.

One machine. One process. One problem.

When something broke:

Service stops responding
→ SSH into the box
→ ps aux | grep myapp
→ PID still there? Process hung?
→ kill -9 PID
→ ./start-myapp.sh
→ Everything works again

Total time: 2 minutes
Total stress: Minimal
Total sleep lost: None

Why did this work?

Because the problem was usually temporary.

A memory leak. A deadlock. A bad connection that timed out wrong.

The code had a bug, sure. But restarting reset the state to before the bug happened.

It wasn't elegant. It wasn't permanent.

But at 3 AM, that's all anyone cared about.

The First Sign of Trouble

Then we got more servers.

One box became ten.

Ten became a hundred.

Restarting stopped being a single command.

It became a deployment.

for server in $(cat servers.txt); do
    ssh $server "systemctl restart myapp"
done

This worked. Mostly.

Until the day it didn't.

The Cascade

I watched this happen once.

02:15 - Pager: "Database connections failing"

The on-call engineer checks the logs.

Database is overwhelmed. Too many connections.

The solution, burned into muscle memory from years of single-server debugging:

"Restart the database."

One command. One mistake.

systemctl restart postgresql

The database came back in 45 seconds.

In those 45 seconds:

All 200 application servers lost their connection pools
All 200 retried simultaneously, using identical retry logic
All 200 failed their health checks
The load balancer marked them all unhealthy
The site went down

The database was fine.

The app servers were fine.

The connections were gone.

The restart fixed nothing and broke everything.

One restart.

47 minutes of downtime.

Why Restarting Broke

Restarting worked when:

State lived in one place
Dependencies were simple
Recovery was faster than finding root cause

Restarting broke when:

State moved to databases, caches, message queues
Services started calling other services
"Just restart it" became "restart everything in the right order with the right delays and pray"

A restart is no longer a local action.

It's a distributed event.

You don't restart one thing.

You restart a graph of dependencies.

What Happens When You Restart Now

You restart Service A
↓
Service A disconnects from database
↓
Database releases locks
↓
Service B loses connection to Service A
↓
Service B retries aggressively
↓
Retries overwhelm Service C
↓
Service C crashes
↓
Everything is on fire

All because you restarted "just one thing."

The Lie We Tell Ourselves

"Restarting is harmless."

It isn't.

Every restart is:

A forced state reset
A connection teardown
A potential cascade trigger
A temporary partial outage (even if small)

We accepted restarts as "free" because the cost was invisible.

Until it wasn't.

What Replaced Restarting

The industry didn't ban restarts.

It made them unnecessary.

Health checks

Detect problems before users do.

# Kubernetes liveness probe example
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

If service unhealthy, don't send traffic
Let it recover or replace it
Users never see the failure

Graceful degradation

Fail partially, not completely.

Cache down? Serve stale data
Database slow? Queue writes, serve reads
Something broke? Everything else keeps running

Automatic replacement

Never restart. Always replace.

Pod dies? New one starts
Node fails? Pods move
Same binary. Clean state. No cascade

Rolling restarts

One at a time, with verification.

Restart server 1 of 10
Wait for health check
Restart server 2 of 10
Never lose capacity

The Systems That Don't Need Restarts

Netflix doesn't restart. It terminates and replaces.

Google doesn't restart. It shifts load and repairs.

Your bank doesn't restart. It fails over to another region.

These aren't magic.

They're design choices.

They assumed from day one that "restart" was not a strategy.

The Honest Confession

I still say "have you tried restarting it?"

Sometimes it's the fastest path to it works now.

But I don't pretend it's a fix anymore.

It's a diagnostic.

A temporary patch.

A way to buy time until the real problem reveals itself.

The difference is:

I know the difference now.

What You Can Do Monday

For your most critical service:

Find the last time it was restarted
Ask: "Why did that restart happen?"
Ask: "Could we have avoided it?"

If yes, build the automation.

If no, document why (so next time you know).

For your next outage:

Resist the restart reflex
Check dependencies first
Check connections second
Check logs third
Restart only when you understand what you're about to break

The Question

When was the last time you restarted something

and didn't know exactly what would happen when it came back?

Be honest.

This is part of a series on operations in the age of distributed systems. Next up: "The Pager Should Not Exist."

DEV Community

Why "Just Restart It" Stopped Working

Why "Just Restart It" Stopped Working

The Universal Truth

When Restarting Made Sense

The First Sign of Trouble

The Cascade

Why Restarting Broke

What Happens When You Restart Now

The Lie We Tell Ourselves

What Replaced Restarting

Health checks

Graceful degradation

Automatic replacement

Rolling restarts

The Systems That Don't Need Restarts

The Honest Confession

What You Can Do Monday

The Question

Top comments (0)