Therac-25, Boeing MCAS, and the IoT Stack Your Team Built Last Quarter

#ai #webdev #devops #iot

The Terrifying Pattern That Keeps Repeating Every Time We Trust a Single Signal With Our Lives

The history of catastrophic technological failure is, at its core, the history of systems that were designed to be certain when they should have been calibrated.

Therac-25 was a radiation therapy machine deployed in North American hospitals between 1985 and 1987. It was more advanced than its predecessors. Its safety relied on software rather than the hardware interlocks of earlier models.

Between 1985 and 1987, it delivered massive radiation overdoses to at least six patients. Three died. The cause, identified in a landmark 1993 computer science case study by Nancy Leveson and Clark Turner of MIT, was not that the software failed unpredictably. It was that the software acted with perfect confidence on single-source input that was, in specific race condition scenarios, wrong.

A race condition. In a medical device. In 1987.

The device's control software had a timing vulnerability — a race condition — where under specific operator input sequences, the system's state could become inconsistent with physical reality. The machine would calculate that it was in a certain configuration when physically it was in a different one. It then administered radiation therapy based on the calculated state rather than the actual state. The patient received a dose calibrated for a configuration that didn't exist.

Race conditions in computing are not exotic. They are among the most common and most dangerous failure modes in concurrent and distributed systems. And the specific race condition that defines the ghost offline problem in IoT monitoring—a disconnect event arriving after a reconnect event because the two events traveled different network paths with different latency—is, structurally, the same class of failure that Leveson and Turner documented in the Therac-25 case.

A system receiving events in an order that does not reflect physical reality. A system acting with confidence on that incorrect order. Consequences that scale with how consequential the automated decisions are. The IoT industry has been building systems with unmitigated race conditions in their device state processing architectures for fifteen years.

MCAS: The Cost of One Signal, No Corroboration

In October 2018, Lion Air Flight 610 departed Jakarta carrying 189 passengers and crew. Thirteen minutes after takeoff, it struck the Java Sea.

In March 2019, Ethiopian Airlines Flight 302 fell from the sky six minutes after takeoff from Addis Ababa. 346 people died across the two crashes. The root cause, documented exhaustively in subsequent investigations, was architectural.

The Boeing 737 MAX's Maneuvering Characteristics Augmentation System, MCAS, was designed to prevent aerodynamic stall by automatically pushing the aircraft's nose down when its Angle of Attack sensor indicated excessive pitch. The aircraft had two AoA sensors.

MCAS used one.

It used one AoA sensor, trusted unconditionally, without requiring corroboration from the second sensor eighteen inches away, as the sole input to an automated system making repeated, forceful physical corrections that pilots, untrained in MCAS's existence, could not override in time.

A Congressional investigation found that Boeing's engineers had documented the single-sensor dependency as a single point of failure in 2015. The information was not acted upon. The aircraft was certified. It flew. The single sensor malfunctioned. The system acted with complete confidence on the malfunction. 346 people died.

The lesson is not that sensors fail. Sensors fail. The lesson is that automated systems making consequential physical decisions must evaluate the confidence of their inputs before acting — must require corroboration, must maintain calibrated uncertainty, must not act with equal conviction on high-quality evidence and degraded evidence.

The IoT monitoring system that acts on device state events without evaluating their ordering correctness or signal quality is MCAS without the second AoA sensor.

It is Therac-25's race condition, distributed across seventeen billion sensors.

The Third Pattern: CrowdStrike, 2024

On July 19, 2024, CrowdStrike uploaded a flawed update to its Falcon Endpoint Detection and Response software. The problem caused Windows devices to display Microsoft's "Blue Screen of Death." In all, roughly 8.5 million Windows devices were affected worldwide, disrupting sectors as diverse as airlines, finance, and healthcare.

The CrowdStrike outage is not, on its surface, an IoT device state story. It is a software update validation story. But the infrastructure failure it reveals is identical in structure to the IoT arbitration gap.

A system responsible for monitoring the state of millions of devices—the CrowdStrike Falcon sensor monitoring endpoint health—was updated in a way that made it incapable of accurately reporting device state.

The monitoring layer failed.

The downstream systems that depended on accurate device state information—flight operations systems, hospital management systems, financial trading infrastructure—had no mechanism for evaluating whether the device state information they were receiving corresponded to physical reality.

The CrowdStrike forensic timeline is instructive. The faulty update went live at 04:09 UTC. CrowdStrike identified the problem and reverted it at 05:27 UTC—seventy-eight minutes later.

But by the time the reversion was deployed, the device state information reaching downstream systems for those 8.5 million endpoints was a mix of accurate reports from unaffected devices, crash reports from affected devices, and reconnection events from devices that had undergone the crash-and-restart cycle.

The downstream systems processed this mix without any capability for evaluating ordering correctness or evidence quality. The question that should haunt every IoT architect:

in the seventy-eight minutes between 04:09 and 05:27 UTC on July 19, 2024, how many automated systems made decisions based on device state information they had no way to verify was correctly ordered and accurately reported?

How many production systems paused or rerouted?

How many clinical workflows were interrupted?

How many logistics operations mis-scheduled?

*Nobody knows. The number was never measured. *

The measurement wasn't possible, because the arbitration layer that would have flagged low-confidence events during the outage window didn't exist.

The Pattern Recognition That Changes Everything

Therac-25. MCAS. CrowdStrike. Three events separated by decades, in different domains, using different technologies. The unifying pattern: Automated systems making consequential decisions on the basis of single-source input that was accepted as ground truth without corroboration or confidence evaluation.

Professor Sanjit A. Seshia of UC Berkeley, whose formal methods research has produced some of the most rigorous frameworks for designing dependable cyber-physical systems, describes the goal of "verified AI" as ensuring that automated systems have "strong, ideally provable, assurances of correctness with respect to formally-specified requirements."

The correctness requirement for a device state monitoring system is straightforward:

the device state committed to the historian should correspond, with measured confidence, to the device's actual physical state at the time of the event.

This requirement is not currently met by any major IoT monitoring platform as a standard feature. It is met only by deployments that have added an arbitration layer between event receipt and state commitment — that have built the equivalent of MCAS's second AoA sensor, the Therac-25's hardware interlock, the CrowdStrike validation gate, into their IoT monitoring stack.

More than 180,000 publicly reachable, unique IPs are tied to the 13 most common ICS/OT protocols as of the Bitsight TRACE team's latest assessment, with global exposure on track to exceed 200,000 IPs per month.

Each of those exposed systems is a physical installation that depends on IoT monitoring for operational continuity.

Each is subject to the event ordering non-determinism that has produced ghost offline events, false production stops, and corrupted audit trails.

Each is, without an arbitration layer, operating with a single-signal device state architecture whose
failure mode is historically documented, technically understood, and stubbornly unaddressed.

SignalCend's five-signal arbitration model is the second AoA sensor. The question is whether the industry waits for its version of the 346 to build it in.

signalcend.com