DEV Community: Mickael Lamare

The $327 Million Implicit Contract

Mickael Lamare — Tue, 14 Jul 2026 06:24:42 +0000

On September 23, 1999, NASA's Mars Climate Orbiter reached Mars after a nine-month journey. It entered the atmosphere at the wrong angle and was destroyed.

The spacecraft worked. The navigation worked. The teams were among the best engineers on the planet. What failed was something so mundane it's almost embarrassing to say out loud:

One team produced thruster data in pound-force seconds. The navigation software expected newton-seconds.

That's it. That's the whole failure. Two systems, two teams, one assumption about units — never written down, never verified, never tested across the boundary. The contract between the two systems existed only in the shared understanding of engineers who had not explicitly communicated it.

Total cost: $327.6 million, and a decade of scientific opportunity.

You have this exact bug in production right now

I don't know your system, but I'd bet on this: somewhere in it, two services are communicating through an agreement that exists only in someone's head.

A consumer that assumes a timestamp is UTC, while the producer sends local time. A field that one team treats as nullable and another treats as guaranteed. An amount in cents on one side, in euros on the other. An event schema that "everyone knows" — meaning the two people who wrote it three years ago, one of whom has left.

These are all the same failure as the Orbiter: an undocumented assumption at a module boundary, undetected until the worst possible moment.

In software the cost is measured in outages and regressions rather than orbital disasters, but the root cause is identical — and so is its invisibility. Implicit contracts are accurate at the moment of creation and drift silently from that moment forward.

Why "good engineers" doesn't protect you

The standard takeaway from the Orbiter story is "test more" or "communicate better." Both miss the point. NASA tested extensively. NASA communicated constantly. The failure survived all of it, because testing and communication operate on what people know to check — and nobody checks an assumption they don't know they're making.

The Lockheed Martin team didn't decide to use imperial units against spec. They built to their own internal conventions, the way every team does. The navigation team didn't skip verification. They verified against their own understanding of the interface, the way every team does. Each side was locally correct. The system was globally wrong.

This is why implicit contracts are the most dangerous artifact in a distributed system: they fail at the boundary, where neither side is looking, and the failure manifests far from its cause — in the consumer, in production, months later.

The fix costs an hour. The alternative costs the mission.

The defense is not heroics. It's a discipline: no integration between two modules without a prior, written, versioned contract. What the interface receives. What it returns. What units, what formats, what error behaviors, what guarantees. Agreed before implementation starts, not reverse-engineered after the incident.

For a typical service interface, that's a one-page schema. An hour of work, maybe two with review. Against that hour, weigh what the implicit alternative costs when it fails: days of cross-team debugging per incident, multiplied by every integration point in your system, multiplied by the years the assumption sits there waiting.

NASA's post-mortem reached the same conclusion. The recommendations were not "hire better engineers." They were process and verification at the interfaces — making the implicit explicit, structurally, so that correctness no longer depended on two teams happening to share an assumption.

A contract defined up front takes hours. A contract discovered through failure takes weeks — or, at sufficient altitude, a spacecraft.

Chaos Is a Threshold, Not an Accident

Mickael Lamare — Tue, 30 Jun 2026 10:04:46 +0000

When a distributed system descends into chaos — cascading incidents, unpredictable regressions, changes that break things nobody thought were connected — we tend to treat it as bad luck. A rough quarter. Too much pressure. The wrong hire.

It isn't luck. It's mathematics. And the mathematics says something uncomfortable: your system was always going to end up there. The only question was when.

A population equation that explains your codebase

In 1976, the biologist Robert May published a study of a deceptively simple equation describing population dynamics. One variable, one parameter — the rate of interaction within the system.

What May demonstrated changed how we understand complex systems. Increase that single parameter gradually, and the system's behavior doesn't degrade gradually with it. It transforms in distinct phases:

At low values, the system is stable. Perfectly predictable.
Push the parameter past a first threshold, and the system bifurcates — it starts oscillating between states. Still predictable, but no longer simple.
Push further, and the oscillations multiply. Two states become four, four become eight.
Then, past a critical threshold: chaos. Genuine, mathematical chaos — deterministic rules producing behavior that is, in practice, unpredictable.

Here is the crucial point: the rules never changed. No external shock, no failure, no bad actor. The same simple equation, the whole way through. The only thing that changed was the rate of interaction. The structure crossed a threshold.

This is not a metaphor

I want to be precise here, because the comparison usually gets made loosely: this is not a metaphor for software systems. It is a description of them.

Every module added without a contract is an increment of that parameter.
Every dependency left implicit is an increment of that parameter.
Every ownership gap tolerated is an increment of that parameter.

Each one feels harmless in isolation — and is. A team of five with eight services and a few informal agreements is far below the threshold. The system behaves. Intuition is enough. Everyone holds the dependency graph in their head, more or less correctly.

Then the system grows. More modules, more integrations, more people, more assumptions. The rate of interaction climbs — silently, because no individual increment is visible. And one day the system bifurcates: a deploy in one service breaks another nobody connected to it. Then oscillation: incidents start arriving in patterns nobody can explain. Then the threshold.

After that, every change is a gamble. Not because your engineers got worse. Because the structure crossed a line, and on the other side of that line, intuition stops working.

You cannot reduce the complexity. You can govern the threshold.

The instinctive response is to simplify — fewer services, fewer teams, fewer moving parts. Sometimes that's right. Usually it isn't possible: the complexity is essential, driven by what the business actually needs the system to do.

But May's equation points to a different lever. Chaos doesn't emerge from complexity itself. It emerges from ungoverned interaction. The parameter that matters isn't how many parts you have — it's how those parts are allowed to interact.

That's what structure does. Explicit, versioned contracts between modules turn unpredictable interactions into bounded ones. Named ownership turns ambiguous responsibility into immediate response. An audit trail turns archaeology into lookup.

None of this reduces the number of parts. The complexity stays. What changes is that the interactions are constrained, declared, and visible — which keeps the system navigable on the far side of the threshold.

The complexity stays. The bifurcation happens. The system stays navigable.

The threshold arrives earlier than you think

The dangerous property of this dynamic is that the system gives almost no warning. Below the threshold, everything works, and governance feels like overhead. The moment it stops working, you're already past the point where adding governance is easy — you're adding it during incidents, during growth, during turnover. The hardest possible moment.

Which means the rational time to draw the boundaries, name the owners, and write the contracts is precisely when it feels unnecessary.

That's not bureaucracy. That's reading the equation.

I'm writing a book about this — the structural conditions under which complex systems stay navigable, and the governance framework I built around them.

Read the preface free →

Three days to find who owned a failing service

Mickael Lamare — Tue, 16 Jun 2026 08:53:55 +0000

The incident started the way they always do: a monitoring alert, then another, then the dashboard turning red faster than anyone could acknowledge the pages.

P1. Highest severity. Production down. All hands.

The platform was not small. Thousands of APIs in production, a second event-driven platform on top, four environments, a cloud footprint to match. By any external measure, a sophisticated, mature engineering organization. The kind of place where you'd assume that when something breaks, somebody knows whose job it is to fix it.

Nobody did.

The question that took three days to answer

The first hours of the incident were not spent debugging. They were spent answering a question that should never need asking: whose system is this?

The investigation pulled in Microsoft support, network engineers, and a rotating cast of developers who each owned a piece of the picture but never the whole. Everyone could explain their fragment. Nobody could explain the path a request actually took from one end to the other.

What eventually emerged was worse than a bug. Load balancer traffic routed through the United States — when legal requirements mandated it stay within European borders. Somewhere in the chain, data was leaving the private network and crossing the public internet. Microsoft's position was clear and entirely fair: once traffic exits the private network, it is outside their responsibility.

The fix, when it finally came, was a DNS configuration change. In the organization's own infrastructure. One record.

Three days. A P1. An international legal exposure. Resolved by changing a DNS entry that someone should have owned, documented, and validated from day one.

The root cause was not technical

Here is the uncomfortable part: the engineers were talented. The tools were modern. The architecture had ambition. Nothing about this incident required more skill, better tooling, or smarter people.

The root cause was the absence of a named owner for a precise perimeter — and the absence of a contract that defined what that perimeter was supposed to guarantee.

Nobody had drawn the line and put a name next to it. So when the line broke, the organization spent three days discovering where the line even was. With production down across a platform of that scale, the cost was conservatively in the millions of euros. Not because anyone was incompetent. Because ownership was structurally ambiguous, and ambiguity has a price that comes due at the worst possible moment.

"Whose problem is this" is a design flaw

In every ungoverned system I've worked in, the first question after an incident is never how do we fix this. It's whose problem is this. And the answer is rarely obvious, because ownership was never assigned — it was assumed. It drifted with reorganizations, evaporated with departures, and dissolved in the spaces between teams where nobody had claimed jurisdiction.

This is the industry default. Ownership gaps don't announce themselves. They accumulate silently, invisible on every roadmap, until an incident forces them into the open. By then, the cost of the gap isn't the time to fix the bug. It's the time to find out who should.

The test is simple. Pick any service in your production environment, right now, and ask: if this fails at 3am, is there a single named person who is unambiguously accountable for it — not a team, not a channel, a name? If you hesitate, you have the same gap. You just haven't paid for it yet.

What would have changed

A named owner for the network perimeter. A documented contract stating what that boundary guaranteed — including where traffic was allowed to flow. That's it. Not a new platform, not a reorg, not a methodology with a logo. A name and a written guarantee.

The incident would still have happened — entropy doesn't ask permission. But the three days of archaeology would have been thirty minutes of escalation, because the question whose system is this would have had an answer before the incident, not during it.

Every decision has a consequence. Every unowned perimeter is an accountability gap accumulating interest. The question is never whether the consequences arrive. They always do. The only question is whether you'll be in a position to contain them when they do.

I'm writing a book about this — how distributed systems lose ownership, why it always ends in an incident, and the governance framework I built to prevent it.

Read the preface free →