Iyanu David

Posted on Mar 4

Complexity Is a Liability (Until It Isn't)

#architecture #sre #devops #security

Every mature system accumulates complexity the way old buildings accumulate load-bearing walls that nobody drew on the original blueprints — quietly, necessarily, and in ways that become genuinely dangerous to remove.

More services. More integrations. More policies, more environments, more abstraction layers stacked on top of abstraction layers that were themselves stacked on top of something someone built at 2am in 2019 and never properly documented. The conventional wisdom, the thing you hear in every architecture review when someone senior leans back and sighs, is that you should reduce complexity to improve reliability. And that's directionally correct. It's also incomplete in ways that cause real damage when teams take it literally.

Complexity is not inherently harmful. Unmanaged complexity is. And sometimes — this is the part that gets elided in the simplification rhetoric — removing complexity creates new fragility that won't announce itself until the worst possible moment.

The Two Kinds, and Why the Distinction Matters More Than Almost Anything
There's a taxonomy here that most organizations gesture at but rarely operationalize with any rigor. Architectural complexity falls into two categories, and conflating them is how you end up with a post-incident review that says "we over-simplified" with no apparent awareness of the irony.

The first kind is accidental complexity. Inconsistent patterns — one team using REST, another gRPC, a third using some homebrew message protocol because they hired a Kafka evangelist in 2021 and he's since moved on. Duplicate services that do nominally different things but solve the same underlying problem. Unclear ownership, which sounds like a people problem and is actually a systems problem, because unclear ownership means nobody knows which service to page when something breaks at 3am and the blast radius expands while everyone figures it out. Redundant pipelines. Legacy scaffolding that was supposed to be temporary in 2017 and has since become so load-bearing that nobody touches it, which means nobody understands it, which means it functions as a kind of institutional learned helplessness.

None of this adds capability. It's drift. It's what entropy looks like in production. It should be removed, aggressively, and the only reason it usually isn't is that removal carries risk and carries cost and the people with context to do it safely are always busy with something more urgent. This is how accidental complexity compounds. You don't build it intentionally. You inherit it incrementally, one expedient decision at a time, and then one day you're onboarding a new engineer and you realize you cannot explain why any of this exists.

The second kind is essential complexity. Multi-region failover — which means you have to deal with split-brain scenarios, with replication lag, with the gnarly question of what "current" means when your data is in three availability zones and the network between two of them just partitioned. Zero-trust network segmentation, which is genuinely complicated to implement and maintain and test. Fine-grained IAM boundaries. Event-driven architectures where the complexity lives in the guarantees — exactly-once delivery, ordering within partitions, dead-letter queues that someone has to actually drain. Data lineage enforcement for regulated industries where you need to be able to prove, forensically, what touched what and when.

This complexity exists because the problem space demands it. Removing it simplifies your architecture diagram. It also increases your risk profile in ways that may not be immediately visible, which is exactly what makes simplification seductive and occasionally catastrophic.

The Simplification Trap, Which Is Real and Which I Have Watched Happen
Teams pursue simplification as a blanket objective. Usually this follows a rough trigger: a new VP of Engineering who came from a startup and finds the accumulated weight of the existing system alarming. A cost-optimization initiative where someone runs a spreadsheet and finds that infrastructure spend has grown 40% year-over-year. A postmortem where the incident was traced to complexity — a cascading failure that propagated through too many services — and the action item is to have fewer services.

The typical moves: collapse microservices into a monolith, because microservices have overhead and inter-service latency and distributed transactions are a nightmare to debug. Centralize permissions, because fine-grained IAM is hard to reason about and someone just got paged because a service couldn't read from a bucket it should have been able to read from. Remove staging environments, because staging is expensive and it never quite matches production anyway. Flatten network boundaries. Reduce observability to "core metrics" because the telemetry bill is too high.

Short term, clarity genuinely improves. You can hold more of the system in your head. Deployments are faster. The architecture diagram is legible.

Long term, resilience declines. And it declines quietly, in ways that are easy to attribute to bad luck rather than structural degradation. You removed a circuit breaker here, a bulkhead there, a redundant path whose redundancy you never needed until you did. You collapsed the blast radius boundaries and didn't notice because nothing went wrong for six months.

Why does this happen? Because some complexity absorbs failure. Redundancy is complex — running multiple instances of the same service, maintaining consistency across them, handling failover, all of this is overhead that only pays off during incidents. Isolation is complex — network segmentation, separate databases per service, independent deployment pipelines, all of this costs money and time and engineering attention during normal operations. Defense-in-depth is complex — it means you have multiple mechanisms that each partially mitigate a class of failure, which means you have to understand all of those mechanisms, which means cognitive load. But these mechanisms reduce blast radius. They turn a total outage into a degraded service. They turn a compromised service into a contained breach rather than a full network traversal.

Complexity as Shock Absorber
High-reliability systems — the term has a specific origin in organizational theory, Perrow and Weick and the research on nuclear plants and aircraft carriers, but it applies directly here — intentionally introduce structured complexity. Circuit breakers that stop calling a failing dependency rather than queuing up an ever-growing backlog of requests that will never succeed. Bulkheads that partition thread pools so a slow downstream doesn't exhaust the whole application's capacity. Rate limits, retries with exponential backoff and jitter, fallback paths that serve stale data rather than errors.

Every one of these mechanisms complicates control flow. You have to test them. You have to make sure they're configured correctly, because a circuit breaker with the wrong threshold will trip too eagerly during normal transient errors, or not trip at all when you actually need it. You have to reason about what the fallback behavior means for your users and your data consistency guarantees. This is real overhead.

But what you're buying with that overhead is failure containment. The absence of visible complexity in your system is often — not always, but often — the absence of protection. The system that looks clean in the diagram is frequently the system that cascades catastrophically when one component fails because you removed all the circuit breakers in the simplification initiative.

I think about this in terms of what I'd call structural load distribution. In a building, you spread load across multiple paths. If one fails, the structure redistributes. If you've streamlined everything through a single critical path — which is cleaner, architecturally elegant, fewer components — then you've also made that single path a single point of failure. The elegance is real. So is the fragility.

The Cost Dimension, Which Is Where Idealism Goes to Die
Complexity carries economic weight, and anyone who tells you otherwise has never had to justify an infrastructure bill to a CFO.

More services means higher infrastructure cost — more compute, more networking, more storage, more licensing. More observability means higher telemetry spend — traces and logs are not free, especially at scale, and the first thing that gets cut in a cost optimization is usually the instrumentation that would have told you about the next incident. More isolation means resource duplication. More review layers mean slower delivery, which has its own cost in opportunity terms.

So you have this genuine tension between cost optimization, operational resilience, and delivery velocity, and the three don't resolve cleanly. Over-optimizing for cost removes protective complexity — you cut the redundancy, you consolidate the services, you turn off the detailed logging, and everything looks fine right up until it doesn't. Over-optimizing for resilience bloats operational overhead — you run redundant everything, you maintain elaborate failover mechanisms for failure modes that have never actually occurred, you spend more engineering time maintaining protective infrastructure than building product.

Sustainable architecture balances both, which is genuinely hard and not a thing you achieve once and then maintain passively. It requires revisiting trade-offs as the system evolves, as load patterns change, as the business context shifts. The right amount of complexity for a system handling ten thousand requests per day is different from the right amount for one handling ten million. But those revisions require organizational discipline that most teams don't have, because the pressure is always toward the immediate cost saving or the immediate velocity gain, not the long-horizon resilience.

Cognitive Complexity Versus Structural Complexity
Here's a distinction that I think is more important than the accidental/essential taxonomy, and which gets less attention than it deserves.

A system can be structurally complex — many services, many integration points, many mechanisms — but cognitively simple, if the patterns are consistent. If every service exposes its health state the same way, deploys via the same pipeline, emits structured logs in the same format, uses the same circuit breaker library configured the same way, then an engineer who understands one service can reason about all of them. The cognitive overhead per component is low because the patterns transfer.

Conversely, a system can be structurally simple — few services, minimal integrations — but cognitively complex if the behaviors are unpredictable or inconsistent. One service that fails silently. Another that retries aggressively rather than circuit-breaking. A third whose logging is inconsistent depending on the code path. A fourth that has some undocumented special behavior during weekends because it was originally deployed as a batch job and still has that heritage baked into its scheduler logic. Each of these is tractable in isolation, but the accumulated cognitive load of keeping track of all the exceptions, all the "this one is different because," is genuinely degrading.

Cognitive complexity is what actually degrades reliability. You can see it in specific indicators. Engineers who are reluctant to deploy on Fridays — not because Friday deployments are inherently unsafe, but because the system's behavior during failures is unpredictable enough that they want a full week of support buffer. On-call rotations where certain services are dreaded, where people angle to not be the one holding the pager when service X alerts because service X is a nightmare to debug. Tribal knowledge clustering around individuals — "only Sarah really understands the event processing pipeline" — which means the system's reliability is now contingent on Sarah's availability and tenure. These are diagnostic signals. They tell you where cognitive complexity has accumulated to a level that's affecting human decision-making, which means it's affecting reliability.

Consistency reduces cognitive burden even when systems are large. This is why investing in internal developer platforms, standardized deployment tooling, shared libraries for common concerns — observability, circuit breaking, authentication — pays reliability dividends that are hard to measure directly but very real. You're not reducing structural complexity so much as you're reducing the cognitive overhead per unit of structural complexity.

The Entropy Curve
Over time, accidental complexity increases. This is the default trajectory if you're not actively working against it. Essential complexity stays roughly constant or grows slowly as the system takes on new capabilities. And cognitive clarity decreases, because the documentation that existed at the start is now stale, the engineers who held context have moved on, and the patterns that were once consistent have been gradually diverged from as teams made expedient local decisions.

Without intentional pruning — not occasional, not when things get bad enough to force a migration, but regular, systematic — accidental complexity eventually overwhelms essential complexity. You can't see the protective mechanisms through the noise anymore. The circuit breakers are buried under six layers of abstraction and nobody's sure if they're actually configured. The redundant data path exists on paper but hasn't been tested in two years and probably doesn't work.

This is when incident frequency increases. Not dramatically, not in a way that's easy to attribute to the structural decay rather than to bad luck or increased traffic. Change failure rate rises. Bugs that should have been caught by staging slip to production because staging is too different from production, or because people have started skipping it because it's flaky. MTTR expands because debugging requires tribal knowledge that's no longer distributed across the team. Teams slow down — not because the architecture was wrong when it was designed, but because it was never simplified intentionally as it evolved.

When to Reduce, When to Preserve
Simplify when redundancy no longer provides meaningful isolation — when you have two services that both have the same database as a single point of failure, so the service-level redundancy is illusory. When a service exists only to support a legacy integration that no longer exists on the other end. When two systems solve the same problem differently and the difference is historical accident rather than intentional differentiation — the merge will be painful, but the ongoing maintenance cost exceeds the migration cost. When policies overlap without increasing security. When the cost of maintaining a boundary genuinely exceeds its protective value.

The critical qualifier: simplification should be strategic, not aesthetic. The goal is not to produce a clean architecture diagram for the next engineering all-hands. The goal is to reduce accidental complexity while preserving essential complexity. These are different goals and they sometimes point in opposite directions.

Retain complexity when it limits blast radius. When it reduces cross-team coordination during incidents — because the blast radius boundary is also a coordination boundary, and contained incidents don't require the whole organization to mobilize. When it enforces explicit ownership in ways that make on-call handoffs clean. When it protects critical data paths from noisy neighbors. When it creates resilience under load rather than cascading degradation.

Removing these layers will improve short-term velocity. It usually does. The system is simpler to reason about, faster to deploy to, cheaper to run. The cost shows up later, often after the people who made the simplification decision have moved on and the people who inherited the system are dealing with incidents that the old protective complexity would have contained.

The Maturity Shift
Early-stage systems optimize for speed. This is correct. When you don't know what your system needs to be, the cost of elaborate protective mechanisms is too high — you're buying resilience for a system that might not survive to need it. Move fast, accept fragility, keep the system small enough to reason about without the scaffolding.

Mature systems must optimize for survivability. The shift is subtle and most organizations make it too late, or make it rhetorically without making it structurally. From feature velocity to failure containment — accepting that a slower, more careful deployment process is worth it because the cost of failures, in customer trust and incident response and revenue impact, exceeds the cost of the slower cycle. From architectural elegance to operational durability — preferring the solution that is debuggable at 3am over the solution that is beautiful in a whiteboard drawing. From minimal surface area to managed isolation — accepting that more components means more things to maintain, and deciding that the blast radius reduction is worth that overhead.

Complexity, in a mature system, becomes a tool. Not an accident, not something you're apologetic about, but a deliberate choice made with awareness of the trade-off. The circuit breaker isn't there because someone overengineered it. It's there because you analyzed a failure mode, decided it was worth protecting against, and built the protection. The staging environment isn't there because it's traditional. It's there because the cost of production incidents exceeds the cost of maintaining staging, and you know this because you've measured it.

What You'd Change on Monday
The practical question, which is the one that actually matters after all the analysis.

Start with a cognitive complexity audit, not a structural one. Walk through your on-call rotation and ask your engineers where they dread getting paged. Ask them which services they'd rather not touch. Identify the tribal knowledge clusters — the "only one person understands this" systems. Those are your highest-leverage intervention points, because cognitive complexity is what actually degrades reliability, and you can often address it without architectural changes by investing in documentation, runbooks, and observability improvements.

Then distinguish, rigorously, between your accidental complexity and your essential complexity. For every service, every integration, every policy layer, ask: does this provide meaningful isolation, or does it exist because nobody cleaned it up? The answer is often neither clean nor obvious. Some things are essential complexity that's accumulated accidental complexity on top of it — the core mechanism is protective, but it's surrounded by legacy scaffolding that makes it harder to understand and maintain than it should be.

Then, and only then, start simplification — but measure the blast radius before you reduce it. If you're collapsing two services into one, understand what that does to failure isolation. If you're removing a network boundary, understand what lateral movement becomes possible in an incident. If you're turning off staging, understand what your change failure rate might look like without it. These aren't reasons not to simplify. Sometimes the analysis comes back in favor of simplification. But you should be making that call consciously, with a model of the trade-off, rather than following the local pressure toward cleaner-looking systems.

The deepest pattern here isn't really about complexity. It's about the way systems accumulate history, and the way that history contains both drift and intention in proportions that are hard to distinguish from the outside. Your job, as someone who builds and maintains production systems, is to develop the judgment to tell the difference — to see which complexity is protecting you and which complexity is just noise, and to have the organizational patience to remove one while preserving the other.

Systems do not collapse under visible complexity. They collapse under neglected complexity — the complexity nobody is maintaining, nobody is questioning, nobody is sure is still doing what it was originally built to do. The answer is not minimalism. It is attention.

DEV Community

Complexity Is a Liability (Until It Isn't)

Top comments (0)