Ajay Devineni

Posted on Jul 1

The Production Tax: What SRE Taught Me That Code Reviews Never Could

#sre #terraform #devops #ai

A practitioner response to the "SREs are glorified sysadmins" conversation

I've been paged at 3:00 AM for an unrotated certificate that silently expired and took down an entire checkout flow.

I've watched a Kubernetes cluster pass every chaos test we threw at it, then surprise the entire team the first time a real Availability Zone went dark — on a Friday night, naturally.

I've reviewed IAM policies that looked perfectly clean in a pull request, then watched that same policy lock out a critical service account during a regional failover window because nobody had tested the assumed-role chain under actual failover conditions.

None of those failures were algorithmic. Every single one was operational.

This is the part of the job nobody explains in a system design interview.

The Staging Lie

There is a specific flavor of overconfidence that comes from watching a database migration succeed in staging. The data volumes are smaller. The read replicas aren't under live traffic. The connection pool isn't being competed for by seventeen other services simultaneously. Nobody is retrying failed writes on top of your migration window.

Staging tells you the migration logic is probably correct. It tells you almost nothing about whether the migration will survive contact with production.

The senior SREs I've worked with have a name for this gap: the staging lie. Not because anyone is being dishonest. Because the environment itself lies. It is structurally incapable of simulating the one thing that actually breaks things — real, concurrent, bursty, inconsistent human behavior at scale.

What changes when you've internalized this?

You start writing migrations differently. You design them to be pausable. You add row-count checkpoints. You instrument the migration itself, not just the application on top of it. You test rollback, not as an afterthought, but as part of the success criteria for the migration plan.

You stop treating staging success as a green light. You treat it as a starting point.

What IAM Failures Actually Teach You

The IAM lockout I mentioned above wasn't a permissions misconfiguration in the obvious sense. The policy was correct for the steady state. The problem was that nobody had modeled what the assumed-role chain looked like when the primary region was degraded and traffic was failing over.

The service account had permission to assume a role in the primary region. The role in the primary region had permission to access the resource. In failover, the service account was suddenly trying to assume a role in a region it had never touched, in an account where the trust policy hadn't been updated to reflect the failover path.

Clean in the PR review. Broken in a real incident.

The lesson isn't "audit your IAM policies more carefully." The lesson is: test your failure paths with the same rigor you test your happy paths. The failure path IS a feature. It has its own requirements. It needs its own coverage.

This is the mental shift that separates SRE work from application development in the way it actually matters. A developer's job is to make the system do the right thing. An SRE's job is to make the system do the least wrong thing when conditions are not right — and to know in advance exactly how wrong things can get.

Error Budgets Are Not a Reporting Tool

I want to push back on something I see regularly in teams that have adopted SRE vocabulary without adopting SRE discipline.

Error budgets get treated as a reporting metric. Something that lives in a dashboard, gets reviewed in a monthly reliability review, and occasionally generates a Slack message when it's burning down too fast.

That is not what an error budget is for.

An error budget is a decision-making tool. It answers a specific question in real time: do we have headroom to deploy this change right now, or are we burning reliability faster than our users have agreed to tolerate?

When the budget is healthy, you move fast. You ship features. You run experiments. You take calculated risks because you have margin to absorb them.

When the budget is tight, you slow down. You freeze non-critical deploys. You prioritize reliability work. You explicitly negotiate with the product team about what it means to ship into a degraded error budget.

The budget makes that conversation objective. It takes it out of the territory of "the SRE team is being conservative again" and puts it into the territory of "we made a commitment to our users and here is where we stand against that commitment."

Teams that get this right stop having the "ops vs. product" tension. The budget is the arbiter. It's not personal.

The War Story That Changed How I Write Code

Here is mine.

A service I was responsible for was making synchronous calls to a downstream dependency as part of a critical user-facing flow. The downstream service had a p99 latency of around 200ms under normal conditions. Fine.

During a regional brownout — not a full outage, a brownout — that downstream service's p99 climbed to 8 seconds. My service had no circuit breaker. No timeout that was actually enforced at the HTTP client level (there was a timeout configured; it wasn't working because of how the client library handled connection reuse). No fallback behavior.

The result: every request to my service held a thread for up to 8 seconds. The thread pool exhausted. My service started timing out from the perspective of the services calling it. The incident scope grew from "downstream service brownout" to "three-layer cascade across the critical path."

The brownout was 11 minutes. The recovery from my service's cascading failure took 47 minutes after the downstream came back, because the connection pool was in a state that required a rolling restart to clear.

What did I change after that?

Every external call now has an enforced timeout at the network layer, not just the application layer. Every integration has a circuit breaker. Every integration has a tested fallback path — not a "we'll figure it out" note in the runbook, but a codified fallback that is exercised in pre-production.

And I stopped thinking about retries as "something we add if we need it." Retries with exponential backoff and jitter are now a default expectation for any service I build that calls anything external. They go in on day one.

What the Best Developers Figure Out

The developers who become genuine strategic partners for platform teams are the ones who start asking the right questions before they write a line of feature code.

Not "will this work?" but "how will this fail?"

Not "what's the happy path latency?" but "what's my blast radius if this dependency degrades?"

Not "does this pass CI?" but "do I have a rollback path that doesn't require a hotfix cycle?"

These questions are not SRE questions. They are engineering questions. SRE teams have just been asking them longer, under more pressure, with more immediate feedback when the answers are wrong.

The discipline is learnable. The war stories are the curriculum.

What's yours? Drop it in the comments — the specific incident that changed how you think about reliability. The more specific, the more useful it is for everyone reading.

Tags: #SRE #DevOps #PlatformEngineering #SoftwareEngineering #CloudArchitecture #Reliability #IncidentManagement #ProductionEngineering

DEV Community

The Production Tax: What SRE Taught Me That Code Reviews Never Could

Top comments (0)