DEV Community

Vivian Voss
Vivian Voss

Posted on • Originally published at vivianvoss.net

The Unit That Crossed a Boundary: Mars Climate Orbiter, 1999

A darkened space mission control room at night in cyberpunk-noir style, lit by the cold glow of monitors. A young woman with long dark hair and pink cat-ear headphones, sits at an operator console in three-quarter view, one hand pressed to her headset, her face caught in dawning realisation. On the large screen behind her, a trajectory plot shows two arcs: a green

Tales from the Bare Metal — Episode 04

23 September 1999. The Mars Climate Orbiter fires its main engine to enter orbit around Mars, passes behind the planet as planned, and is never heard from again. The spacecraft cost 193 million dollars to build, part of a 327.6 million dollar mission. It travelled 670 million kilometres across nine months of deep space, and it was lost to a number with no unit written on it.

This is the cleanest example in the engineering record of why a quantity is not the same thing as a number.

The Incident

The Mars Climate Orbiter launched on 11 December 1998. Its job was to enter a stable orbit around Mars and study the planet's atmosphere, also serving as a communications relay for the Mars Polar Lander that would follow.

The mission proceeded normally for nine months. On 23 September 1999, the orbiter executed its Mars Orbit Insertion burn: a planned firing of the main engine to slow the craft enough for Mars' gravity to capture it. The burn was designed to place the orbiter at a closest approach of 226 km above the Martian surface, comfortably above the atmosphere.

The craft passed behind Mars, as expected, and signal was lost, as expected. It was never reacquired. Post-failure reconstruction showed that the trajectory had brought the orbiter to approximately 57 km above the surface, deep within the atmosphere. A spacecraft built for the vacuum of orbit does not survive atmospheric entry at orbital speed. It either burned up or was torn apart; either way, it was gone.

The Mars Climate Orbiter Mishap Investigation Board released its Phase I report on 10 November 1999. The root cause was stated plainly, and it was not a hardware fault, not a navigation error in the usual sense, not a launch problem. It was a unit mismatch at a software interface.

The Diagnosis

The trajectory of a spacecraft is adjusted over its journey by small thruster firings, called Angular Momentum Desaturation manoeuvres among others. Each firing produces an impulse, and the magnitude of that impulse must be fed into the navigation software so the predicted trajectory stays accurate.

Two organisations wrote the two halves of this loop. Lockheed Martin, in Colorado, built the spacecraft and the ground software that calculated the impulse from each thruster firing. NASA's Jet Propulsion Laboratory, in California, ran the navigation software that consumed those impulse figures and computed the resulting trajectory.

Lockheed Martin's software produced the impulse in pound-force seconds. This is an imperial unit: a pound-force is the force exerted by gravity on a one-pound mass, and a pound-force second is that force applied for one second.

JPL's navigation software expected the impulse in newton-seconds. This is the metric (SI) unit: a newton is the force that accelerates one kilogram at one metre per second squared, and a newton-second is that force applied for one second.

One pound-force second equals 4.45 newton-seconds. The two numbers describe the same physical impulse, but the number representing it differs by a factor of 4.45 depending on which unit you mean.

The interface specification between the two systems required metric units. Lockheed Martin's software, for the specific file in question, produced imperial. JPL's software read the imperial numbers as though they were metric, and so every impulse was interpreted as 4.45 times smaller than it actually was. The trajectory corrections were therefore systematically wrong, in the same direction, for nine months. The error accumulated until the predicted 226 km insertion altitude was, in reality, 57 km.

The single most important sentence in the whole account is this: the number was correct. The software did not miscalculate. The bits that crossed the interface were the right bits. What was missing was the label that said what those bits meant, and each side filled in the missing label with its own assumption.

The Context

A unit mismatch sounds like the kind of thing that should be caught in an afternoon. The interesting question is not how the mistake was made, but how it survived nine months and 670 million kilometres without being caught. Three conditions allowed it.

First, the interface specification existed and was clear: it required metric. The imperial output was a deviation from spec. But nothing enforced the specification at runtime. The specification was a document, not a check. A document does not stop a wrong number; it only assigns blame after the wrong number has done its work.

Second, the discrepancy was visible before the loss. Navigators at JPL noticed during the cruise that the trajectory was not behaving quite as the models predicted; small corrections were needed more often than expected. The signal was there in the data, months before arrival. It was discussed informally but never escalated into a formal anomaly investigation, partly because each individual correction was within tolerance and the cumulative drift looked like ordinary navigation noise until it was too late.

Third, there was no end-to-end test. No test ran the Lockheed Martin impulse calculation and the JPL trajectory calculation against the same manoeuvre and compared the result to an independent reference. Each system was tested in isolation and behaved correctly in isolation. The fault lived only in the handoff between them, which is precisely the region that isolated unit tests do not cover.

These three conditions recur in almost every interface failure. The spec is a document not a check; the warning signal is visible but below the escalation threshold; the test coverage stops at the boundary rather than crossing it.

The Principle

Science settled the question of units a century ago. The Système International (SI) is the one coherent measurement system the entire scientific and engineering world shares, precisely so that a number computed by one team means the same thing to another. The Mars Climate Orbiter was lost because one half of the programme had not honoured that settlement: it still computed in pound-force seconds, an imperial unit with no place in serious engineering work.

The lesson is direct. In scientific and engineering contexts, measure in metric. There is no defence for imperial units in work where a factor of 4.45 decides whether a spacecraft enters orbit or enters the ground. Every serious laboratory, the United States included, standardised on SI for exactly this reason, and the orbiter is the canonical demonstration of what the alternative costs. Imperial units are a regional convention for everyday life; they are not a tool for computing trajectories, and the moment a programme treats them as one, it has built a 4.45 into its maths and dared the universe to find it.

There is a second guard worth stating, because metric alone is not sufficient. Two systems both using metric can still fail if one means seconds and the other milliseconds, if one means metres and the other kilometres, if one means bytes and the other kibibytes. So the unit must also travel with the number, enforced where the two systems meet:

  • Types the compiler checks. Rust newtypes let you define a NewtonSeconds(f64) distinct from a PoundForceSeconds(f64); the compiler refuses to pass one where the other is expected. F# has units of measure built into the type system: a value typed float<N*s> cannot be assigned to a float<lbf*s> without an explicit conversion. The mistake becomes a compile error, which is the cheapest place a mistake can possibly be caught.
  • Fields the parser demands. A data interchange format that requires the unit as a mandatory field (not an optional comment) makes the bare number unrepresentable. You cannot serialise the quantity without stating what it is.
  • Names the variable carries. The oldest discipline, available in any language: never write timeout = 30; write timeout_ms = 30 or timeout = Duration::from_secs(30). The unit lives in the name where the next reader cannot miss it.

The first rule needs no tooling: in science, measure in metric. The second rule catches what the first cannot: name the unit at every boundary regardless. The orbiter needed both, and had neither.

On FreeBSD this discipline is visible in the small, ordinary places. dd bs=1M states the block size with its unit; the bare number would be ambiguous. sysctl values are documented with their units, and the tunables in loader.conf name their dimensions. The convention across the base system is that a number with a physical meaning is rarely written naked. This is not glamorous and it has prevented an enormous number of quiet disasters.

The boundary is where the unit must be loudest, because the boundary is exactly where two assumptions meet and discover they disagree.

Where It Travels

The Mars Climate Orbiter is a spacecraft, which makes the failure feel exotic. It is not exotic. The same failure ships in ordinary software every day:

  • Every API that passes a duration as a bare integer. Is 30 seconds or milliseconds? The function name setTimeout(30) does not say, and the two readings differ by a factor of a thousand.
  • Every configuration value that takes a size without a suffix. Is cache_size = 1000 bytes, kilobytes, or entries?
  • Every function signature with a parameter named distance, interval, rate or size and no unit anywhere in the type or the name.
  • Every CSV handed from one team's export to another team's import, where column 7 is "amount" and nobody agreed whether it is gross or net, dollars or cents.
  • Every number that means one thing on the left of an interface and another thing on the right, because the interface carried the number faithfully and the meaning not at all.

The Same Mistake, Smaller: Gigabyte and Gibibyte

The unit mismatch does not even need two systems as different as imperial and metric. It hides inside a single system, in the gap between decimal and binary.

A gigabyte, by the SI definition, is 1,000,000,000 bytes; "giga" is the standard decimal prefix for a billion, the same prefix as in gigahertz or gigawatt. A gibibyte, defined by the IEC in 1998, is 1,073,741,824 bytes: two to the power of thirty, the nearest binary round number. The two differ by about 7.4%, and the gap widens with every step up the ladder: terabyte versus tebibyte is about 10%, petabyte versus pebibyte about 12.6%.

The disk industry sells in decimal. A "1 TB" drive holds 1,000,000,000,000 bytes, exactly as labelled. Many operating systems, Windows most stubbornly, then display that capacity in binary while still calling the result "GB". The drive shows up as roughly 931 "GB", the customer concludes they have been short-changed by seven per cent, and a support ticket is born. Nobody has lied; two definitions of the same word have simply met at a boundary without introducing themselves.

It is the Mars Climate Orbiter with the stakes turned down from a spacecraft to a slightly disappointing disk. The IEC defined a complete binary ladder in 1998, parallel to the decimal one at every rung: kibibyte (KiB) beside kilobyte (kB), mebibyte (MiB) beside megabyte (MB), gibibyte (GiB) beside gigabyte (GB), and on through tebibyte (TiB), pebibyte (PiB) and beyond.

The recommendation applies to the entire scale, and it is the same discipline as everywhere else in this story. If you mean the binary value, write GiB, MiB, TiB. If you mean the decimal value, write GB, MB, TB. Never write one while meaning the other. The standard is correct, unambiguous, and has been available for over a quarter of a century. Use it: prefer the explicit binary prefixes (KiB, MiB, GiB, TiB) wherever you actually mean powers of two, which on a computer is most of the time. The only thing still keeping GB where GiB belongs is habit, and habit is precisely what put pound-force seconds where newton-seconds were expected.

A 193-million-dollar spacecraft was lost because one team did its sums in imperial units. The number was right. The unit was from the wrong century.

The lesson costs nothing to apply. In science and engineering, measure in metric; the rest of the world, and most of science within the imperial holdouts themselves, settled this long ago for exactly the reason the orbiter demonstrates. And where two systems must still meet, name the unit of every number that crosses between them, so that the compiler or the variable name holds what no shared assumption ever safely will.

Read the full article on vivianvoss.net →


By Vivian Voss, System Architect and Software Developer. Follow me on LinkedIn for daily technical writing.

Top comments (0)