Iyanu David

Posted on Mar 3

The Architecture Drift Nobody Measures

#architecture #sre #devops #security

Systems rarely collapse suddenly. I know that sounds obvious — every engineer has read the postmortem that opens with "a cascade of small failures" — but the knowledge doesn't seem to change how we build or how we watch. We still instrument for the acute. We still treat the chronic as background noise.

Drift is chronic. And chronic things are genuinely hard to see, not because we lack the tools but because they change too slowly to register against the baseline of what we already expect.

The Word Engineers Already Own — And Why It's the Wrong One

Say "drift" in a room full of platform engineers and you'll get a specific, conditioned response: configuration drift. A node that someone SSH'd into at 2 AM and patched by hand. A Helm value overridden in production because the feature flag wasn't ready. A Terraform state that diverged from what's actually running in the account. That kind of drift is real, it causes real incidents, and — critically — it's detectable. You can diff against a desired state. You can run a reconciliation loop. You can build a dashboard and put a red cell in it when something diverges.

Architectural drift doesn't give you that. There is no desired-state document for trust models. There is no reconciliation loop for ownership boundaries. Nobody is running architecture --plan and watching for a delta.

What I mean by architectural drift is subtler and nastier: it's what happens when the structure of a system — its service topology, its permission surfaces, its implicit contracts between teams, its assumptions about what can reach what and who is responsible for what — evolves away from the intent that shaped it, without anyone making a deliberate decision to change that intent.

The permissions weren't widened in a single bad commit. They were widened in fourteen reasonable commits over eight months, each one unblocking something real, each one reviewed and approved, none of them understood as part of a cumulative pattern.

That's drift.

Architecture as Fossilized Reasoning

Every architectural decision is a theory. Not a fact — a theory. A claim about the world that seemed well-supported at the time: the expected traffic profile, the team size and structure, the threat model, what failure modes were considered credible, how much operational maturity you were willing to assume. Those theories get encoded.

They get encoded in IAM policies that grant a CI service account write access to the specific S3 buckets that existed at the time of writing, plus a * wildcard added six months later for expediency. They get encoded in network segmentation rules designed around a monolith that has since been decomposed into eleven services. They get encoded in deployment pipelines whose scope of authority was never explicitly defined because nobody imagined the pipeline would eventually be able to trigger changes in four environments, two cloud accounts, and a Kubernetes cluster that didn't exist when the pipeline was written.

The decisions were reasonable. Locally rational. The problem is that local rationality doesn't compound into global coherence.
Reality keeps moving and the architecture doesn't move with it — not in any deliberate way. The gaps widen quietly. You don't get a notification.

Success Is the Accelerant

Here's the part that took me an embarrassingly long time to internalize: successful systems drift faster than struggling ones. Not slower. Faster.

When a system is visibly struggling — reliability is poor, latency is bad, deployments are flaky — there's organizational permission to stop and fix things. Engineers can make the case for redesign. The pain is a forcing function.

When a system is working, the pressure calculus inverts. Features need shipping. The platform is fine. Why would you spend two weeks re-examining service ownership when nothing is broken? That instinct is not stupid — it's a reasonable allocation of attention under real constraints. But it has a structural consequence: the things that would compound into future fragility don't get addressed, because there's no present signal that they matter.

Temporary permissions become permanent through sheer persistence — nobody revokes them because nothing has gone wrong yet. Workarounds that were introduced as three-week bridges are still running three years later, load-bearing and undocumented. Ownership of a service migrates informally when a team reorganizes, but the old team's name is still in the runbook and the PagerDuty rotation. A shared library accumulates four internal consumers, then six, then eleven, and at some point it becomes a platform primitive with none of the stability guarantees a platform primitive should carry.

Each of these is an individually manageable problem. Collectively, they constitute a system that is very hard to reason about and very easy to miscalculate during an incident.

The Nonlinearity Problem

There's a combinatorics issue hiding underneath all of this that I think doesn't get talked about enough in concrete terms.

When you add a new service, you don't add one thing. You add N+1 things, where N is the number of existing services it might interact with, plus the infrastructure components it touches, plus the IAM policies it requires, plus the CI pipeline permissions it needs, plus the failure modes it introduces into any service that depends on it. The number of possible system states doesn't grow linearly with the number of components. It grows exponentially — or faster, depending on how tightly coupled the components are.

Most observability tooling is designed to measure behavior: latency distributions, error rates, saturation, throughput. The RED method, the USE method — good frameworks, genuinely useful, not the point. What they capture is how a system performs given its current structure. They don't capture anything about the structure itself. High fan-in on a single service? Not a latency metric. Overly broad blast radius from a CI pipeline? Not an error rate. Seventeen implicit callers depending on an undocumented internal API contract? Not visible in a Grafana dashboard.

We've invested heavily in observability for runtime behavior and almost nothing in observability for structural health. That's a meaningful gap.

When the Org Chart Rewires and the Architecture Doesn't

Conway's Law is usually taught as a prescription — design your system the way you want your organization to look. What gets less attention is its corollary: when your organization changes, your system silently breaks that correspondence. And organizations change constantly.

A team of eight splits into two teams of four. The service that team owned is now jointly owned, which in practice means ambiguously owned. An on-call rotation that used to involve five people who all held full context now involves ten people, half of whom have only ever seen the service through the lens of recent incidents, not its original design. A new team inherits a microservice because the person who wrote it moved to a different organization, and that team's understanding of the service's upstream dependencies is incomplete in ways nobody realizes until 3 AM when a dependency behaves unexpectedly.

None of this shows up in the code. The code doesn't reorganize when the org does. The implicit knowledge — why a particular retry strategy was chosen, what constraint drove a specific timeout value, which downstream service has a known quirk that the original author knew to compensate for — that knowledge disperses. Sometimes it survives in documentation. Often it doesn't.

Organizational drift is architectural drift. The system's resilience degrades not because the code changed but because the humans who understood the code and its operational context no longer hold that understanding collectively.

Automation: The Visibility Tax

Automation is genuinely good. I want to be careful here not to slide into a lazy critique of something that has made software reliability substantially better across the industry. Automated deployments are more reliable than manual ones. Infrastructure-as-code produces more consistent environments than hand-provisioned servers. Reconciliation loops catch configuration drift faster than human audits.

But automation has a visibility tax that I don't think we account for carefully enough.

When a deployment workflow becomes complex — many stages, many conditional branches, cross-account promotion, post-deploy validation hooks — and you wrap that complexity in a pipeline that produces a green checkmark, you've created something that works reliably until its underlying assumptions are violated in a way the pipeline can't detect. The complexity didn't go away. It got abstracted behind an interface that only shows you outcomes.

Overly broad IAM permissions still work. Pipelines succeed. The blast radius isn't visible until something uses it. Fragile dependencies still pass the happy path. Retries compensate for flakiness that should be investigated. You're looking at surface reliability — the rate of green checkmarks — while structural fragility accumulates underneath.

This is not a reason to avoid automation. It is a reason to build automation that exposes its own structural surface area — pipelines that report the scope of environments they can affect, permission policies that log the breadth of their authority, dependency graphs that are computed and versioned alongside the code they describe.

Most pipelines don't do this. It's not hard to add. It's just not where the attention has gone.

Drift Doesn't Alert

The core danger is mundane: drift doesn't page anyone.

There is no threshold for "your service boundaries have diverged meaningfully from team ownership structure." Nobody gets a Slack message that says "your trust model was designed for a 40-engineer organization and you now have 200 engineers." No SLO fires because a CI pipeline's authority has expanded beyond what was originally scoped.

The system looks stable. Dashboards are green. Deployments are succeeding. On-call load is manageable. From the outside, and even from the inside, there's no signal that the structural assumptions are aging faster than the system is being maintained.

And then something breaks. Usually something small — a bad deploy, a dependency timeout, a retry storm, a permissions misconfiguration that was always possible but hadn't been triggered. The triggering event is mundane. The consequences are not, because drift has changed the geometry of the system in ways that weren't modeled and weren't understood.

The incident investigation finds it: ownership was unclear so mitigation was slow. A coupling that nobody had mapped amplified the failure across services that had no business being affected. A pipeline that was never scoped to the blast radius it had quietly accumulated propagated an incorrect state into three environments before anyone caught it.
The trigger was small. Drift made it catastrophic. And nobody knew the fragility was there because nothing had measured it.

What to Actually Do on Monday

I'm suspicious of framework proposals that require organizational consensus to implement, because organizational consensus takes months and architectural debt compounds daily. What follows are things a careful engineer can start doing with existing authority, this week, without waiting for a working group.

Map your blast radius. Pick one CI pipeline — ideally the one that feels the most like infrastructure rather than application deployment — and write down, concretely, every environment it can modify, every cloud account it has credentials for, every approval gate it bypasses in practice versus in theory. The act of writing it down is diagnostic. If you can't complete the list in thirty minutes because you're not sure what the pipeline can reach, that's the finding.

Audit permissions by age, not by scope. Most permission review processes ask "is this permission appropriate?" which is hard to evaluate in the abstract. A more tractable question is "when was this permission last reviewed, and what was the context at that time?" Permissions that are six months old and were granted in a different organizational context are candidates for revalidation regardless of whether they've caused a problem. They are structural liabilities with no expiration date.

**Find the undocumented load-bearing components. **Every system has them — services or libraries that have accreted callers without ever being designed for broad consumption. A rough proxy: look for internal components with no SLO, no on-call owner, and more than three consumers. That combination suggests something that is being treated as infrastructure but without infrastructure-level rigor. It will matter during an incident.

Reread your last three postmortems for structural signals. Not for the contributing factors that were already documented, but for the things the investigation surfaced and then didn't pursue: the ownership ambiguity that got noted but not resolved, the coupling that was surprising but got attributed to "we need better monitoring" rather than "we need to reduce this coupling," the permission that allowed unexpected reach and was remediated but never questioned as a class of problem.

Tie architecture review to scale events, not just feature events. Code review is continuous. Architecture review is usually either nonexistent or triggered by major new projects. The more useful trigger is scale: when your user count doubles, when your engineering org grows by 50%, when your deployment frequency crosses a threshold. Those are the moments when the gap between current architecture and current reality is most likely to be meaningful.

None of this requires a new platform. None of it requires executive sponsorship. It requires treating structural health as a real engineering concern with the same seriousness as runtime reliability — which means measuring it, reviewing it on a cadence, and accepting that visibility into it is a form of risk management, not overhead.

The most dangerous systems I've encountered weren't the ones with poor uptime metrics. They were the ones where everyone felt confident, the dashboards were clean, and nobody had looked carefully at the structure in a year and a half. By the time the confidence was revealed to be misplaced, the conditions for failure had been present for a long time.

That's the thing about drift. It's already happened before you notice it.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.