Iyanu David

Posted on Mar 4

Automation Scales Decisions, Not Understanding

#devops #architecture #sre #security

There's a particular kind of confidence that settles over an engineering team after they've fully automated their deployment pipeline. The dashboards are green. The pipelines pass. On-call is quiet. And somewhere in the back of their minds, they've filed the whole infrastructure question under solved. This is not hubris, exactly — it's the reasonable conclusion of months of careful work. It just happens to be wrong.

Automation is a force multiplier. What most teams don't sit with long enough is what, precisely, it multiplies.

What You Actually Encoded
Every automated system is a fossilized argument. Someone, at some point, looked at a problem and made a series of judgment calls: how often to retry a failing request, at what CPU threshold to trigger a scale-out event, which IAM roles should cascade from which parent policies, what constitutes a "safe" deployment gate versus an unnecessarily paranoid one. Those judgments got encoded. They became executable. And then — because they worked — they became invisible.

This is the central mechanism people keep missing. Automation doesn't replace human judgment. It preserves human judgment and applies it at machine speed, indefinitely, without asking whether the original judgment still holds.

A retry policy written in 2019 for a monolith that handled 50k requests per day doesn't automatically recalibrate when that same retry logic gets applied, years later, to a distributed mesh of thirty microservices processing 5 million events per hour. It just runs. Faithfully. Catastrophically, sometimes.

The assumption that got encoded — that retrying N times with a fixed backoff was sensible — was probably correct at the time of writing. The system it was written for no longer exists. The automation does.

On Blast Radius, and Why It Compounds
Manual errors have a natural blast radius constraint: a human can only do so much wrong per unit time. They get tired. They second-guess themselves. They stop and ask someone. The failure surface of a person is bounded by human cognitive throughput.

Automation has no such constraint.

A misconfigured Terraform module that gets instantiated across forty services doesn't fail forty times slower than a human would. It fails all at once, consistently, with the full authority of a system that everyone has agreed to trust. A flawed IAM permission template applied organization-wide doesn't announce itself with an error message you'd catch during code review — it propagates quietly through every role that inherits from it, waiting for the exact moment someone tries to do something the template has accidentally made impossible, or worse, accidentally made possible for everyone.

I've seen this pattern play out with auto-scaling configurations more times than I care to count. Someone sets a scaling trigger based on memory utilization, which made sense when the application had a roughly linear relationship between memory and load. Then the application changes — a new caching layer gets introduced, memory utilization flattens out at a lower baseline — and now the auto-scaler is perpetually convinced the system is underloaded. It doesn't provision enough capacity during traffic spikes. The service degrades. On-call gets paged. Someone spends three hours tracing what looks like a load balancer problem before they find a scaling threshold that hasn't been reviewed in fourteen months.

Centralized logic centralizes failure. This is not a bug in automation — it's a structural property of it.

The Retry Storm, in Enough Detail to Be Useful
The retry storm pattern deserves more than a diagram. Here's what it actually looks like from the inside.

A service starts rejecting requests — say, because its database connection pool is exhausted. Upstream callers, all of which have been configured with retry logic (as they should be), begin retrying. Good. That's the intended behavior. But the retry logic doesn't know why requests are failing — it just knows they're failing and that it should try again. So it does. Immediately, or with a brief backoff. The database, which was already saturated, receives more requests. The connection pool stays exhausted. More retries fire.

Meanwhile, the auto-scaler, observing high request latency and elevated CPU on the service, decides to provision more instances. Those new instances spin up and immediately begin processing the backlogged request queue — which means they immediately hit the same saturated database. More connections requested. Same exhausted pool. The new instances are now also retrying, which means the total retry volume has increased. The database is receiving orders of magnitude more traffic than the original load would have generated.

Every component behaved exactly as designed. The retry policy did what it was supposed to do. The auto-scaler did what it was supposed to do. The connection pool limit was set to a reasonable value. The emergent behavior — a cascading amplification loop that makes a localized failure systemic — was not designed. It wasn't modeled. Nobody sat down and worked through what would happen when these three independently reasonable systems ran concurrently under a specific failure condition.

The absence of systemic modeling is the actual problem. Not the retry policy. Not the auto-scaler. The failure to ask: what does this automation look like in conversation with everything else?

Exponential backoff with jitter helps. Circuit breakers help more. But both of those are still point solutions applied to individual services, not a rethinking of how the behaviors interact at a system level.

The Silence Problem
There's a failure mode that doesn't page anyone, doesn't trip any alerts, and doesn't appear on any dashboard. It accrues slowly, over months or years, and by the time it manifests it's been so thoroughly normalized that it's nearly impossible to distinguish from intended behavior.

Automation creates a perception of control that outlasts the validity of the assumptions it was built on.

The thresholds are outdated. But the alerts are quiet, so nobody looks. The monitoring was designed for a system architecture that's been refactored twice since the dashboards were built. The metrics that would indicate something is wrong aren't being collected, because nobody thought to add them when the new service was deployed and nobody thought to ask whether the old metrics still meant what they used to mean. The alerts that do fire get routed to a Slack channel that three people have muted because it pages too frequently and is almost always a false positive.

Automated systems fail quietly first. They degrade. They drift from their intended operating envelope, gradually, in ways that are individually imperceptible. The failure that eventually surfaces isn't usually a sudden collapse — it's the last increment in a long accumulation of small divergences. And when that failure hits, the task of diagnosing it is effectively archaeological. You're not debugging a system; you're reconstructing the history of a system from evidence it didn't know to preserve.

This is what I mean when I say automation becomes invisible. It's not that people forget it exists. It's that they stop interrogating it. It works, so nobody asks whether it's still doing what they think it's doing.

When Automation Outpaces the Team's Mental Model
A team of eight engineers builds a deployment pipeline. They all understand it. They know why the staging gate exists, what the integration test suite is actually validating, why the rollback trigger is configured the way it is. The pipeline reflects their collective understanding of the system.

The team grows. People leave, people join. The system evolves. New services get added. The pipeline gets extended, incrementally, by people who understand the new additions but not necessarily the original context. A year later, you have a pipeline that nobody fully understands end-to-end. Not because the engineers are careless — they're not — but because the complexity of the whole has grown beyond what any individual holds in their head.

Now ask: when something breaks in that pipeline, who can diagnose it?

This is not a hypothetical. I've been in incident bridges where eight engineers were staring at a failed deployment and nobody was confident about what the stage that errored was actually doing, why it was sequenced where it was, or what the safe intervention path looked like. The automation had accumulated organizational knowledge that was no longer distributed across any of the people responsible for it.

Automation abstracts complexity. Abstraction is useful right up until the abstraction fails and you need to understand what's underneath it.

The engineers who built the original system knew. Some of them left. The ones who joined after the fact learned to operate the system — to feed it inputs and read its outputs — without necessarily understanding the reasoning that shaped it. Which is fine, until it isn't.

Guardrails, Not Guarantees
Here's the conceptual error that gets teams into trouble: treating automation as a guarantee rather than a guardrail.

A guardrail is a constraint that makes a bad outcome less likely. A guarantee is a promise that a specific outcome will occur. These are not the same thing, and conflating them is where the real risk lives.

An automated deployment gate that checks for test coverage thresholds is a guardrail. It reduces the probability of shipping undertested code. It doesn't guarantee quality. It doesn't account for tests that pass but don't actually validate the behavior you care about. It doesn't know that the test suite you wrote eighteen months ago was designed to test a system that has since been refactored into something structurally different.

The guardrail is only as good as the model it encodes. And models age.

The dangerous thing about treating automation as a guarantee is that it atrophies the human judgment that should be running in parallel. Teams stop asking whether the deployment gate is checking the right things because they've tacitly decided the deployment gate handles it. The gate passes, so the code ships. The implicit assumption is that if the automation didn't catch it, there's nothing to catch.

This is the illusion of safety. Not that everything is fine. That someone else — the system — is responsible for knowing whether everything is fine.

What Careful Builders Actually Do
None of this is an argument against automation. Automation is correct. The alternative — scaling human operations linearly with system complexity — isn't viable. The question isn't whether to automate; it's whether you're maintaining an honest relationship with what you've automated.

Some patterns that hold up:

Make intent explicit and durable. Every automated rule should have a comment that explains not what it does but why it exists — what failure mode it was designed to prevent, what assumptions it makes about the system, when those assumptions should be revisited. This sounds obvious. Almost nobody does it consistently. The rule that's been "always there" is exactly the one that accumulated without documentation, and it's exactly the one you'll be most confused by when it misfires.

Treat automated thresholds as hypotheses, not facts. A scaling threshold, a retry count, a timeout value — these are predictions about system behavior under specific conditions. Conditions change. The prediction needs to be re-evaluated. Building in a scheduled review cycle for critical automation parameters is not bureaucratic overhead; it's the minimum viable hygiene for systems that run at scale.

Design for legibility, not just correctness. An automated action that's correct but untraceable is a liability. When something goes wrong — and it will — you need to be able to reconstruct what the automation did, when it did it, and why it decided to do it. Every automated action should emit a structured log entry that captures the input condition that triggered it, the decision made, and the outcome. Not just that the auto-scaler scaled out, but what metric reading at what threshold at what time caused it to scale from how many instances to how many.

Preserve the manual override path. Automation should degrade gracefully toward human control, not away from it. A self-healing system that's impossible to pause or override during an incident is not a resilient system — it's a runaway process with administrative access. Manual override paths rust quickly if they're never exercised; some teams have found it useful to deliberately invoke them during game days to confirm they still work and that the team still knows how to use them.

Modularize blast radius. Centralized automation templates are efficient right up until they're catastrophically wrong. The Terraform module reused across forty services is a productivity win until the day someone introduces a misconfiguration and forty services are affected simultaneously. The efficiency-to-blast-radius tradeoff doesn't have a universal answer, but it should be a deliberate choice rather than an accidental consequence of how the automation was structured.

Monday Morning
If I were sitting down on Monday to think about what to actually do with this, the honest answer is that I'd start with an inventory exercise nobody wants to do.

Pull up every piece of automation that touches production — every scaling policy, every CI gate, every IAM template, every retry configuration, every auto-remediation runbook. For each one, ask: does anyone on the current team understand why this exists? When was it last reviewed? What assumptions does it encode, and are those assumptions still accurate?

Most teams, if they're honest, will find they can't answer those questions for a significant fraction of their automation. That fraction is risk.

The goal isn't to tear out the automation and rebuild it. It's to close the gap between what the automation assumes about the system and what the system actually is. In some cases that means updating thresholds. In some cases it means adding observability that was never there. In some cases it means documenting intent that lives only in the head of someone who left eighteen months ago.

The point is not to slow down. It's to stop accumulating invisible debt at machine speed.

Automation preserves yesterday's decisions at tomorrow's scale. That's the core of it. The decisions were probably good when they were made. The question — the one most teams aren't asking rigorously enough — is whether they're still good now, applied to a system that has grown and changed in ways that no single decision-maker fully anticipated.

Revisit what you automated. Not because automation is dangerous, but because the alternative is letting old arguments run your infrastructure indefinitely, at scale, unchallenged.

That's not reliability. That's momentum.

DEV Community

Automation Scales Decisions, Not Understanding

Top comments (0)