Iyanu David

Posted on Mar 3

Reliability Is a Socio-Technical Problem

#devops #architecture #sre #security

When systems fail, we reach for the obvious instruments. Logs. Metrics. The deployment timeline. A frantic scroll through configuration diffs at two in the morning while someone on the bridge call asks for an ETA you cannot give. The forensic instinct is understandable — code is legible, traceable, blamable in ways that feel satisfying after an outage. Find the line that caused this. Roll it back. Write the postmortem. Close the ticket.

But I've spent enough time doing this to know that the line is rarely the story.
The line is where the story ended. Where it started is usually somewhere murkier — a Slack thread nobody followed up on, an ownership boundary that two teams interpreted differently, a runbook last touched fourteen months ago by an engineer who left the company in the spring. The code ran exactly as written. The system failed anyway.

The Postmortem That Doesn't Ask Enough
Here's a pattern I keep seeing in postmortems, including ones I've written myself: the contributing factors section lists three or four items — the dependency timeout, the cascading retry amplification, the misconfigured IAM permission — and then, almost as an afterthought, something vague about "communication gaps" or "unclear ownership." The technical findings get RCA tickets and follow-up tasks. The organizational findings get a sentence and a shrug.

This is not cynicism. It's prioritization pressure. Engineers know how to write tickets for code. Writing a ticket for "our team boundaries are ambiguous and nobody owns the alert routing for the payment namespace" is harder to scope, slower to resolve, and invisible to sprint velocity metrics.

So we fix the timeout. We leave the coordination problem intact.

And six months later, a different timeout, a different team, roughly the same failure mode. The postmortem looks strangely familiar.

The technical trigger rotates. The organizational substrate stays.

Conway's Law as a Diagnostic Tool
Most engineers know Conway's Law as a software design observation: systems tend to mirror the communication structures of the organizations that build them. It's often cited as a curiosity. It should be treated as a diagnostic instrument.

When I look at a service topology and see tightly coupled dependencies between services owned by teams that rarely talk to each other, I'm not looking at a technical architecture problem in isolation. I'm looking at a map of organizational friction rendered in YAML and gRPC. The service boundary got drawn where it did because that's where the team boundary happened to be when someone was in a hurry. The tight coupling persists because the two teams have different sprint cycles, different on-call rotations, different definitions of what "breaking change" means. They're not being careless. They're operating in structures that make careful coordination expensive.

Siloed teams produce systems with ownership overlaps and fragmented responsibilities not because engineers are bad at design but because the organizational geometry makes clean boundaries hard to maintain. When a team is perpetually overloaded — ten services, three engineers, an alert queue that never quite drains — documentation is the first thing that ages. Monitoring degrades next. Not because anyone decided to neglect it, but because the backlog of urgent things always outweighs the backlog of important things, and keeping runbooks current is perpetually important and rarely urgent until the moment it becomes both simultaneously.

You cannot architect your way out of an organizational problem. The architecture expresses the organization. If you want a different architecture, you often need a different organization first.

The Cognitive Load Ceiling
There's a ceiling — not a theoretical one, a practical one with hard consequences — to how much system complexity a human brain can hold in operational readiness at any given time.

This sounds obvious. It matters in ways that aren't obvious.

Modern production systems are deeply heterogeneous. A senior SRE on an aggressive incident might need to traverse, in a single investigation: a distributed tracing graph across eight services, a Kubernetes topology they didn't design, a Terraform module authored by a platform team whose Slack channel they're not in, an IAM permission boundary written by security six months ago and never documented in the runbook, a CI/CD pipeline with a conditional artifact promotion step that behaves differently in prod than in staging for reasons that live entirely in one engineer's head.

The engineer on call is not failing if they struggle with this. They're bumping against a cognitive load ceiling that the system's designers never accounted for — in part because no individual designer was responsible for the whole thing, which is itself the problem.

When cognitive load exceeds working capacity, reliability degrades. Diagnosis slows. Remediation becomes tentative. Escalations happen later than they should because the engineer on the hook doesn't yet know which team to escalate to, because service ownership is ambiguous, because the service catalog hasn't been updated since the reorganization.

This is not a training problem. You cannot train humans to have larger working memories. You can design systems to be less cognitively punishing — to surface intent, to contain blast radius, to fail in legible ways that constrain the diagnostic search space. Some teams do this deliberately. Most don't. Most teams are too busy keeping the system running to redesign the system for the humans who run it.

Context as Invisible Infrastructure
The things that allow an incident to be resolved quickly are often not technical. They are contextual. Who owns this service? What does it do, in plain language, and what does it do for downstream systems? What's the right escalation path if the owning team is unavailable at 3am? What changed recently? What's the known failure mode when the upstream dependency degrades?

This context is infrastructure. Not metaphorically — functionally. It is load-bearing. When it erodes, response times increase, investigations branch into dead ends, fixes become localized patches that address symptoms without understanding causes, and institutional knowledge migrates from documentation into the heads of specific individuals who eventually leave.

I've seen teams with genuinely excellent engineers spend forty-five minutes in an incident trying to answer "who do we page?" Because the service had been migrated between teams in a reorg, the PagerDuty routing hadn't been updated, the owner listed in the service catalog was a team name that no longer existed, and the person who knew all of this was on vacation. The code was fine. The context was gone. Those forty-five minutes were not recoverable.

Runbooks are not busywork. Architecture diagrams are not bureaucracy. They are a hedge against the brittleness of human memory in high-pressure situations. They're cheap to maintain and catastrophically expensive to lack when you need them.

Coordination Under Pressure Is a Skill, and Skills Require Practice
Incidents are not normal operating conditions. Time compresses. Decision-making must accelerate. Ambiguity that's tolerable in a planning meeting becomes operationally crippling when you're staring at a customer-facing degradation and a growing incident channel.

Under that pressure, the weaknesses in coordination structures surface rapidly and uncharitably.

Misaligned incentives appear: the infrastructure team wants to understand root cause before doing anything irreversible; the product team wants to restore service immediately regardless of how. Neither is wrong. They just haven't established which posture governs which situations, so the argument happens during the incident rather than before it. Authority boundaries become unclear: who can approve an emergency change? Who can pull the rollback trigger on a service owned by a team that isn't paged? Communication gaps widen: the incident commander doesn't know who to ask because the people who know the system best aren't in the bridge call because nobody knew to page them.

These aren't exotic failure modes. I've watched every one of them happen in organizations with talented, motivated engineers. The problem isn't the people. The problem is that coordination under pressure requires practiced, legible protocols — and most organizations don't treat incident coordination as a learnable skill that needs rehearsal. They treat it as something that should happen naturally when the stakes are high enough.

It doesn't. High stakes make everything harder, including coordination. Especially coordination.

Gamedays, fire drills, tabletop exercises — these feel like organizational overhead when everything is fine. They are exactly what determines whether an organization can function when everything isn't. The teams I've seen recover well from serious incidents are rarely the ones with the most technically sophisticated systems. They're the ones that have practiced being wrong together. That have worked through who says what to whom in the first fifteen minutes. That have identified the coordination breakdowns in advance, in low-stakes simulations, so they don't discover them for the first time during an actual outage.

Automation's Hidden Bargain
Automation has a compelling value proposition: eliminate manual error. Reduce toil. Increase deployment frequency. Make repetitive human judgment unnecessary.

All of this is real. And all of it comes with a less-advertised cost.

Every layer of automation increases cognitive distance between engineers and runtime behavior. When a deployment pipeline runs forty steps across three environments and conditionally applies different configurations based on whether it's a canary or a full rollout, the engineer who wrote the pipeline understands it. The engineer who inherits it, or the one who's on call at eleven on a Friday when it behaves unexpectedly, understands it much less. The automation has removed manual error from the happy path and created a diagnostic labyrinth on every unhappy path.

This is not an argument against automation. It's an argument for automation that is legible — that surfaces its own intent, that fails loudly and clearly, that doesn't distribute its logic across six Lambda functions and a state machine in ways that require a specific kind of architectural archaeology to understand. Engineers still design workflows, still define permissions, still interpret alerts, still approve changes, still diagnose anomalies. Automation changes the surface area of mistakes. It doesn't remove the human element. Sometimes it makes that element harder to find when things go wrong.

The dangerous assumption is that a highly automated system is a self-managing one. It isn't. It's a system whose failure modes have shifted from "someone did the wrong thing" to "someone designed something whose failure mode wasn't anticipated." The second category is often harder to see coming and harder to diagnose after.

Ownership Drift: The Slow Accumulation of Nobody's Problem
As systems grow — and they always grow — ownership becomes unstable. New services appear. Teams reorganize. Responsibilities migrate without documentation. Platform capabilities expand in ways that overlap with what used to be service-team responsibilities but nobody draws the boundary explicitly. An engineer who owned a critical subsystem gets promoted and their knowledge doesn't transfer in any structured way.

The result is ownership drift: a slow accumulation of services that are sort of owned by multiple teams, or sort of owned by none. Each team assumes partial responsibility. Nobody assumes full responsibility. There are no explicit handoffs because nobody recognizes a handoff is needed.

These are your reliability blind spots. When something breaks in a drifted ownership zone, the incident starts with a question that shouldn't need asking: whose problem is this? While that question is being resolved — while people are checking the service catalog, hunting through old Slack messages, reasoning about who probably knows this system — the incident is progressing. Every minute spent on "who?" is a minute not spent on "what?" and "how do we fix it?"

Deliberate ownership tracking is not a bureaucratic luxury. It is a load-bearing operational practice. It is also, frankly, one of the cheapest investments a team can make in reliability and one of the most consistently deferred.

What to Measure That Most Dashboards Don't Show
The reliability metrics most teams track are software metrics: uptime, error rates, deployment frequency, mean time to recovery, change failure rate. These are important. They're also lagging indicators of problems that are often socio-technical in origin.

There are other measurements worth taking:

How many services does each team own? Not as a judgment about team size, but as a signal of cognitive load distribution. A team carrying forty services cannot maintain each of them with the operational attention required for reliable operation of any of them.

How many teams are involved in an average incident? A high number suggests unclear ownership boundaries and tight cross-team coupling. This is not a performance metric for those teams. It's a structural diagnosis.

What percentage of services have clear, current, confirmed owners? Not organizational attribution on paper, but actual acknowledged ownership with real on-call responsibilities. The gap between "someone is listed" and "someone knows they're responsible" is where incidents go to become major outages.

When was the last permission review for each critical service? Identity and access configurations are the kind of thing that accumulate technical debt silently, through incremental additions and never-quite-completed cleanups, until an escalated privilege becomes an exploit vector or an overly broad permission becomes a blast radius amplifier.

What is the alert volume per on-call engineer over a given period? Alert fatigue is not a metaphor. It is a documented degradation in signal detection that occurs when humans are exposed to high volumes of low-quality alerts. An on-call engineer who acknowledged a hundred alerts last week is a less reliable responder than an engineer who processed five meaningful ones. The reliability of the monitoring system is inseparable from the reliability of the service it monitors.

Designing for Human Limits
This is the piece that most systems skip, not from negligence but from a kind of architectural optimism: the idea that if the system is well-designed technically, humans will figure out how to operate it. They usually do, eventually, and at some cost.

Designing for human limits means treating cognitive overhead as a first-class engineering constraint. It means asking, when adding a service: does this have clear ownership, clear failure semantics, a blast radius that a team can contain? It means designing failure modes to be legible — to contain enough diagnostic signal at the moment of failure that a reasonably informed engineer can orient quickly, rather than staring at a wall of correlated alerts with no clear starting point.

It means maintaining ownership as a living artifact, not a field in a YAML file that gets set once and forgotten. It means keeping runbooks close enough to the service that they stay current — reviewed when the service changes, not when someone notices they're stale. It means, periodically, simulating the loss of context: what would a new engineer need to diagnose and recover this service? If the answer involves asking someone specific, that's a fragility worth documenting.

The goal is not to eliminate complexity. Production systems are complex because the problems they solve are complex. The goal is to ensure the complexity is manageable — not just by the engineers who built it, but by the engineers who inherit it, who are on call for it at 2am, who have to reason about it under pressure with imperfect information.

That is a different design target than most teams aim for. It's harder to specify, harder to test, harder to demonstrate in a sprint demo. But it's the one that matters most when something goes wrong. Which it will.

Monday Morning
So what does this actually change about how you build on Monday?

Start by making invisible things visible. Which services in your portfolio have ownership that's genuinely confirmed, not just nominally attributed? Which runbooks have been touched in the last six months? What's your alert volume per on-call engineer? These are not rhetorical questions. They have answers. Generating them is a morning's work, and the findings are usually uncomfortable and immediately actionable.

Then pick the highest-consequence ambiguity and resolve it explicitly. Not in a meeting to plan a future meeting. Decide: this service is owned by this team, this is the escalation path when that team is unavailable, this is what changes require cross-team coordination and this is the protocol for making that fast. Write it down. Put it somewhere engineers will actually find it at 2am, which is not Confluence.

And consider — seriously, not performatively — running an incident simulation. Not a polished gameday with a pre-scripted scenario. A genuinely disruptive exercise: pull the on-call engineer into a fake severity-two, give them a service they know less well than they think they do, and watch where the coordination breaks down. The gaps that emerge are the gaps that will surface in real incidents. Finding them first, in a low-stakes context, is considerably better than finding them in the other kind.

None of this is glamorous. It doesn't show up in your deployment frequency metrics or your architecture diagrams. It doesn't make for a compelling tech blog post or a satisfying sprint retrospective.

But reliability is not a software property. It is an organizational outcome. The code is part of it. The humans are part of it. The structures that shape how humans coordinate, communicate, and carry context — those are part of it too, and they are at least as determinative as the code when things go wrong.

The system will eventually remind you of this. The question is whether you choose to act before it does.

DEV Community

Reliability Is a Socio-Technical Problem

Top comments (0)