Why We Restart to Fix It

#reliability #erlang #freebsd #architecture

On Second Thought — Episode 10

The pager has gone off. Memory on the auth service is climbing in a way it should not be. You SSH in, you observe nothing in particular, you kubectl delete pod. The pod comes back, memory is fresh, the graph flattens. The on-call channel returns to silence. Nobody asks what was wrong.

This is the tenth episode of On Second Thought, a series about the daily routines we perform without ever quite deciding to. Today's routine is the one that runs at the top of half the world's incident response: when the machine misbehaves, restart it. Have you tried turning it off and on again. The line is a joke, until it is a runbook, until it is the production strategy, until it is the only debugging step we still know.

The Axiom

The reflex is universal. A stuck container, a frozen browser tab, a JVM that has been ruminating on a class loader for forty minutes, a kafka consumer that fell off a partition, a connection pool that quietly stopped reaping idle handles. The runbook for half the world's incidents has three steps, and the first two are window-dressing for the third. We accept this as the natural response to a system that fails, in the same way we accept that traffic jams are simply the price of having cars. On second thought, both deserve a second thought.

The strange thing is not that we restart. Restart is, in the right architecture, a perfectly reasonable response to a particular class of fault. The strange thing is that the restart has become the diagnosis, and that we have built an entire generation of platforms around the assumption that it would.

The Origin

The reflex has two parents, and we mostly inherited only one of them.

The first is the consumer-electronics tradition, codified in the IT Crowd's catchphrase but older than the show. A Sky+ box from 2004, a Windows laptop, a wireless router: a device that has wandered into an unknown state can be returned to a known one only by power-cycling it, because the device offers no language in which to be asked where it had gone wrong. The assumption was reasonable for the hardware of its time. It was honest about its limits: the only legible interface to the device's interior was the off switch.

The second tradition is the one engineers built deliberately to replace the first, and it is the one we mostly forgot.

In 1986, at Ericsson's Computer Science Laboratory in Stockholm, Joe Armstrong, Robert Virding and Mike Williams began work on what became Erlang, a language for the kinds of telephone exchanges that simply could not be allowed to go down. The constraint was not academic. A switch that dropped calls cost regulatory fines and lost contracts. A switch that ran for a year between reboots was the point. The output of that work, the AXD301 ATM switch, runs on roughly two million lines of Erlang and is the system most often cited for "nine nines" of reliability: on the order of 31 milliseconds of downtime per year. The figure is contested in the way every figure of that shape is contested; whether the measurement was apples-to-apples, whether it included planned maintenance, whether the operational data was systematically collected. The architecture that produced it, however, is uncontested, and it is the architecture that matters here.

Armstrong's principle, on the surface, looked exactly like the consumer tradition: when a process gets into a bad state, terminate it. He called it "let it crash", and the phrase has done more damage to the idea than any critic could. Read as a slogan it sounds like the Sky+ box: when in doubt, kill it. Read as architecture, it is the opposite.

Three properties make it architecture.

First, processes are isolated. An Erlang process is not a thread, and not a coroutine; it has its own heap, its own message queue, and shares nothing mutable with any other process. When one crashes, it cannot corrupt the state of another, because there is no shared state to corrupt. A crash takes itself with it and nothing else.

Second, every worker has a supervisor. The supervisor is not a vague concept; it is a specific process, with a specific role, defined in OTP, the standard Erlang library. When a worker crashes, the crash is delivered to its supervisor as a message. The supervisor decides what to do.

Third, the supervisor decides according to a written strategy. The strategies have names: one-for-one (restart only the crashed worker), one-for-all (restart all siblings), rest-for-one (restart the crashed worker and any later in the dependency order). Every supervisor has a maximum restart frequency, and when the frequency is exceeded, the supervisor itself crashes, which delivers the failure to its supervisor, one level up. A failure escalates a tree, not a runbook. The rule that handles it was written years before the outage.

Let it crash, in its proper form, is not "have you tried turning it off and on again." It is "we have already decided what to do when this fails, and we wrote it down." The restart is the same gesture. The contract underneath is wholly different.

The Cost

What we kept of let-it-crash is the let-it-crash. What we left in Stockholm is the supervisor.

The first cost lands daily. Restart is the diagnosis. The pod comes back, the alert clears, the day is shipped. The cause of the alert is not investigated, because nothing in the response made room for investigation: the runbook said restart, the restart worked, the page closed. A memory leak, a file-descriptor exhaustion, a lock contention, a queue backing up because a downstream service is throttling, each leaves the same heartbeat-recovery signature on a dashboard, and each needs a different fix. The restart erases the question that distinguishes them. The bug remains exactly as resident in the code as it was before the pager went off, with the small refinement that the team is now slightly more trained to ignore it.

The second cost is structural. We have built whole platforms on the assumption. Kubernetes liveness and readiness probes are, in the honest reading, a contract that the orchestrator will rotate the symptoms while the cause goes unexamined. A pod that fails its liveness check is killed and replaced. There is no concept, in the standard Kubernetes flow, of capturing the dying process's state, of preserving the crash for later inspection, of asking why before the next pod is scheduled. "Self-healing" is the marketing term for this, and it is accurate in the sense that a person who takes paracetamol every four hours has a self-healing headache. The symptom keeps disappearing. The cause has not been touched.

The third cost is institutional. A team that restarts to fix gets very good at restarting and never gets good at diagnosing. The post-incident review produces a runbook with an additional command. The runbook is consulted next time the pager goes off; the additional command is added; the team's collective intuition about the system shifts from "what is this system actually doing" to "what sequence of recovery steps clears the current alert". In the worst case, the only conjecture anybody had about why the alert ever fired leaves quietly with the last engineer who maintained the service, and the new on-call rotation inherits the runbook but not the model. A few months later the system is misbehaving in a new way that the old runbook does not cover, and nobody is in a position to ask why.

The fourth cost is the one this series exists to point at: we have stopped expecting our systems to be debuggable. The restart was a shortcut, originally; we took it because diagnosing the live system was hard, and the restart was cheap, and the bug was small. We then built more software on top of that shortcut, and more on top of that, until "you cannot reasonably diagnose this in production" stopped being an embarrassment and started being a feature description. Container orchestration is, among other things, a way to ship software that nobody knows in detail and to rotate it fast enough that no one has to.

The Question

There is software in operation that does not work this way, and the alternative is older and quieter than the current default.

WhatsApp serves north of a billion users with around fifty engineers. Its backend is Erlang. The supervisor model from Stockholm runs the company's production. Crashes happen. They are caught by their supervisors. The strategies are written. The escalation tree handles the rest. The engineers do not spend their day power-cycling boxes; they spend it writing the rules under which the boxes manage themselves. It is a small, deliberate team operating a system that, by every comparable measure, ought to require many times its size to keep running. The supervisor architecture is why.

In the unixoid tradition, the FreeBSD base provides the operator's half of the same picture. init and rc.d use the same model that Stockholm did: explicit start, explicit dependency, explicit recovery. A service has a script that says how it starts, what must be up before it starts, and what to do when it dies. When a service on a FreeBSD machine misbehaves, the operator has dtrace to follow what the kernel and user-space code are actually doing, ktrace to record system calls for later inspection, procstat and fstat to read what a process is holding, post-mortem core dumps that survive the crash and can be examined at leisure, and a kernel that will, with some precision, tell you what process held what lock at what time. The reboot is available, on FreeBSD as everywhere else. It is rarely the first reach, because the system is willing to speak, and the operator has been trained to listen.

So the honest question is not whether to keep the restart. The runbooks have it for a reason and they are not foolish. The restart, in a supervisor architecture, is a perfectly normal recovery step. The question is the one we did not write down: in a system that fails, was the restart the answer, or the moment the question got dropped?

A restart, on second thought, is not a tool. It is a measurement. It tells you, with some precision, how much of the cause you decided you could afford to leave unknown.

Read the full article on vivianvoss.net →

By Vivian Voss — System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.