This article is part of the “When Systems Age” series exploring how assumptions, automation, and architectural drift shape modern system failures.
There's a particular kind of silence before a large system fails. Not the silence of nothing happening — the opposite, actually. Everything is humming. Metrics are green. The on-call engineer is halfway through a cup of coffee. And somewhere in the stack, a belief that has been quietly wrong for eleven months is about to introduce itself.
I've been in that silence enough times to stop blaming the proximate cause.
When the postmortem gets written, it will name a culprit. An expired certificate. A deployment that bumped a queue depth past an undocumented limit. A retry storm that someone designed on purpose, for a topology that no longer exists. The culprit is real. The finding is technically accurate. And it misses the point almost entirely.
What Actually Ages
Software doesn't rot the way people mean when they say it rots. The bits don't degrade. What rots is the correspondence — the relationship between the system as built and the world it was built to operate inside.
Every non-trivial architecture is, at its foundation, a collection of propositions. Not written ones, usually. The propositions live in configuration files and runbook assumptions and the spatial memory of engineers who've since moved to other companies. They are things like: this service will never be called more than a thousand times per second, so we sized the thread pool at thirty. Or: these two systems are always deployed together, so we didn't bother with a version handshake. Or, more dangerously: we trust everything inside this subnet because in 2019 only internal services could reach it.
Propositions like those are not bugs. They were correct. The thread pool was fine for years. The coupling was invisible until deployment pipelines diverged. The trust boundary made perfect sense when the perimeter was coherent.
What changes is reality. The propositions don't update themselves.
This is what I mean by assumption drift, and it differs from technical debt in ways that matter. Technical debt you can see, at least in principle — deprecated dependencies, test coverage gaps, a module nobody's touched since the original author left. Assumption drift is submerged. The code looks fine. The architecture diagram (if one exists at all) looks consistent. The system continues to function. But the gap between what the system believes about its environment and what is actually true has been widening for months, and nobody has been measuring it, because the metrics you'd need to measure it are not the metrics anyone thought to instrument.
The Seduction of Stability
Here's the uncomfortable part: the systems most likely to harbor deep assumption drift are the ones that have been working reliably for a long time.
Stability is, perversely, a form of concealment. When a system processes requests correctly for eighteen months, you stop looking at it carefully. You stop interrogating its dependencies. You stop asking whether the access patterns from five years ago still describe what the system is actually doing. You let the workarounds calcify into infrastructure. You let the temporary IAM permissions become permanent because nobody can remember what breaks if you revoke them, and the cost of finding out feels higher than the cost of carrying them.
I've watched teams operate a service in production for two years without anyone being able to precisely answer the question: what does this thing actually depend on? Not what the documentation says. What actually happens when you trace the calls. The documentation was written during the initial deployment, before two major refactors and a cloud migration that moved three downstream dependencies to a different availability zone. Nobody updated the docs because the service kept working.
This is the operational equivalent of flying on instruments in clear weather. Everything looks fine from the cockpit. You stop cross-checking.
And then the weather changes.
Automation's Hidden Clause
Automation is where assumption drift becomes genuinely dangerous, because automation scales whatever belief is encoded inside it — correct or not.
Think about what a deployment pipeline is, mechanically. It's a set of assertions about what a valid deployment looks like, operationalized as code. It checks certain things, skips others, applies changes to systems in a specific order predicated on a specific understanding of how those systems relate to each other. When that pipeline was written, the understanding was accurate. The checks were sufficient. The order was correct.
Now imagine the architecture has evolved — new services added, dependencies reshuffled, a critical stateful component moved from one team's ownership to another's without the pipeline being updated to reflect the new blast radius. The pipeline doesn't know any of this. It executes. It marks itself successful. It has applied an outdated model of the system to a system that no longer matches that model, at machine speed, with no hesitation.
A human operator making the same series of decisions would at least pause at an unexpected output. They might notice that something seems off. They might ask a question. Automation doesn't ask questions. It doesn't feel the slight wrongness that precedes a bad outcome. It just scales.
This is not an argument against automation — I'd sooner argue against gravity. It's an argument for treating automation as a bet: you're betting that the assumptions baked into the automation remain valid. And like any bet, you need to know what you've wagered and how often to reassess whether the odds have shifted.
Why Postmortems Miss It
The postmortem process, as practiced at most organizations, has a structural bias toward symptoms.
It's not malicious. It's practical. When a system has just failed, the people in the room have real obligations — restore service, communicate with stakeholders, identify the immediate change that unlocked the failure. Those are legitimate priorities. And they produce a legitimate artifact: a clear account of what happened, with action items that address the thing that happened.
The problem is that the thing that happened is usually not the thing that caused the failure. The thing that happened was a triggering event — it found a vulnerability. But the vulnerability was there before the trigger arrived, and it will still be there, in some form, after the trigger is patched.
I've been in postmortems where we correctly identified a misconfigured circuit breaker as the root cause, wrote an action item to fix the configuration, fixed it, closed the ticket, and then had a substantively identical failure eight months later from a different triggering event finding the same underlying belief — that a particular downstream service would recover within a thirty-second window — which was no longer true after that service was migrated to a platform with slower cold-start behavior.
We fixed the circuit breaker twice. We never questioned the thirty-second assumption. Why would we? The assumption wasn't in the postmortem. It wasn't a named thing. It was just how the system thought about the world.
What Breaks, Concretely
Let me be specific about failure modes, because generality is the enemy of action.
Trust boundaries. In 2019, many organizations built service meshes or network policies around the assumption that the internal network was trustworthy, that mutual TLS was overhead that slowed things down and wasn't worth it, that the implicit perimeter was sufficient. By 2022, those same organizations had contractors, acquired companies, multi-cloud deployments, and third-party SaaS tools with internal network access through various integration mechanisms. The perimeter was meaningless. The trust boundaries hadn't moved. The systems designed to trust each other still trusted each other — including the ones that shouldn't.
Ownership maps. A service is owned by a team. The team reorganizes. Some members go to a new team, some go to a different area. The service gets assigned to whoever seems closest. The runbook has the previous team's Slack channel. The alert routing goes to a queue nobody checks on weekends. This is not hypothetical — I have personally been the on-call person for a service I had never touched, because the alert routing hadn't been updated through two reorgs and the monitoring system didn't distinguish between "team owns this alert" and "someone at this company owns this alert."
Capacity assumptions. A cache was sized for a certain volume. That volume was sensible in 2021, when the feature was new and the user base was limited. The feature grew. The cache was never resized because it was working — hit rates looked fine in aggregate. They looked fine in aggregate because the cache was still hitting on the popular items; the long tail was going to the database, which was quietly absorbing the difference, which nobody noticed until a new user cohort with different access patterns inverted the hit rate and the database began responding with five-second latencies instead of fifteen-millisecond ones.
Interface contracts. Service A expects a response field from Service B. Service B deprecated that field eighteen months ago, but kept emitting it for backward compatibility. A new version of Service B, deployed quietly during a maintenance window, didn't emit it anymore — the compatibility shim had been removed as part of a codebase cleanup. Service A started returning malformed data to its own callers. Nobody caught it immediately because the field was optional in A's schema, which meant it parsed fine; it just wasn't being used to populate a field that turned out to matter a great deal.
Each of these is a story of a belief that was valid, stopped being valid, and wasn't retired.
The Measurement Problem
The reason assumption drift accumulates is that we measure the wrong things.
Performance dashboards measure latency, throughput, error rates. These are current-state metrics — they tell you whether the system is doing its job right now. They don't tell you whether the system's model of itself matches the system's actual behavior. They don't tell you whether the trust boundaries are still coherent. They don't tell you whether the ownership map reflects who would actually respond to an incident.
Some organizations do architecture reviews, but they're usually triggered by new projects rather than elapsed time. The review is for the thing being built, not for the thing that's been running for two years while the environment around it shifted.
The gap is fundamentally an information problem. You cannot revalidate assumptions you're not aware of holding. And most teams, if you asked them to enumerate the operational assumptions embedded in their largest production services, would produce a list that's both incomplete and already partially wrong before they finish writing it down.
I've started, in my own work, keeping what I'd call an assumption register — not a formal document, just a running file of decisions made about how a system relates to its environment, with a timestamp and a rough expiration condition: this is valid as long as traffic stays below X, as long as we're on a single cloud provider, as long as the deployment cycle for Service Y is less than five minutes. When the conditions are revisited — quarterly, or when a significant architectural change happens — we go back to the register and ask which of these are still true.
It doesn't catch everything. Nothing does. But it shortens the mean time between assumption expiration and assumption discovery, and that interval is where the fragility lives.
Monday Morning
So what actually changes on Monday? Not the architecture — that takes months. Not the tooling — that takes longer.
What changes is a habit of questioning.
Before you add the next alert, ask what assumption the alert is protecting, and whether that assumption is still valid. Before you extend an existing automation to handle a new case, ask whether the model of the system that the automation encodes is the model the system actually exhibits now. Before you close a postmortem, spend fifteen minutes on a question the postmortem doesn't ask: what had to be true for this triggering event to find anything to trigger?
That last question is harder and slower than writing action items. It resists being put in a ticket. But it's the question that finds the belief underneath the incident, and without it you're fixing the lock while leaving the window open.
A small, concrete discipline: when a service changes hands — a team reorg, an acquisition, a project sunset — do a one-hour review of what that service assumes about its environment before closing the handoff. Not a full architecture review. Just: what does this system trust? Who does it depend on? What operational assumptions are baked into its configuration that the new owners might not know to question?
Most teams skip this because they're busy and the service is working. That's exactly the condition that makes the review worth doing.
The instinct to move fast is reasonable. The instinct to declare a system stable and stop looking at it is where the gap opens. Stability is not a terminal state. It's a temporary alignment between a system and a world that keeps changing.
Your job — the unglamorous part of it — is to keep measuring that alignment, long after anyone stops asking you to.
Top comments (0)