DEV Community

Cover image for Why SRE Investment Gets Undervalued (And How to Fix It)
Yuto Takashi
Yuto Takashi

Posted on

Why SRE Investment Gets Undervalued (And How to Fix It)

Why You Should Care

If you've ever heard these questions from management:

  • "The system is working fine, why do we need more SRE budget?"
  • "Can't we just respond to incidents when they happen?"
  • "Why not just hire more developers to speed up development?"

You're not alone. SRE and Platform Engineering investments are often undervalued, and there's a structural reason for it.

This article explores why this happens and provides three concrete frameworks to justify infrastructure investment to non-technical executives.

The Police Department Analogy

Here's an interesting parallel: SRE work is similar to police, fire departments, and disaster response teams.

Think about it:

  • They protect society using public funds
  • When nothing goes wrong, they train and prepare
  • But people often say "what are they even doing?" and cut their budgets
  • When problems occur, everyone asks "why didn't you prevent this?"

The value of prevention is invisible.

  • Ship a new feature → "We contributed to revenue!" (highly visible)
  • System runs 24/7 without issues → "That's expected" (taken for granted)
  • Prevented a major outage → Never happened, so nobody notices

Same with police and fire departments. Low crime rates and no fires are actually the result of their work, but it looks like "they're doing nothing."

The Negative Spiral

Here's what's scary: this structure creates a negative spiral.

Budget cuts → Staff shortage → More incidents → "SRE is incompetent" → Further budget cuts

This happens with police too: "Crime is rising, what are they doing?" → Budget cuts → Less patrol → More crime...

The cycle continues like this:

  1. Budget gets cut
  2. Fewer people, fewer tools
  3. Incidents increase
  4. "Why is SRE failing?"
  5. Even more budget cuts

The Chicken-and-Egg Problem

There's an even trickier issue: No budget until problems occur.

Pattern 1: Reactive funding

  • Major outage happens → Emergency budget approved → Fix implemented → System stable → "We're good now, right?" → Budget cut

Pattern 2: Prevention isn't valued

  • "Our database will hit limits in 6 months"
  • "But it's working now, do we really need this?"
  • (6 months later: outage)
  • "Why didn't you predict this?!"

This is exactly like earthquake-proofing budgets. Before the earthquake: "waste of money." After: "why didn't we do this?"

Platform Engineering: A New Approach

In the past 2-3 years, Platform Engineering has gained attention.

It's about building "self-service infrastructure" so developers can manage infrastructure themselves.

This emerged because of the gap between DevOps ideals and reality.

DevOps ideal (2010s)

  • "Developers do everything: build, deploy, operate!"
  • "You build it, you run it"

Reality

  • Operational burden concentrated on developers
  • Learning curve too steep (Kubernetes, Terraform, monitoring tools...)
  • Each team picks their own tools → chaos
  • Eventually, load concentrates on "the few who can operate"

"It was unrealistic to expect all developers to be infrastructure experts"

SRE vs Platform Engineering

SRE Platform Engineering
Primary Goal Protect service reliability Improve developer productivity
For Whom? End users (customers) Internal developers
Key Activities Incident response, SLO management Self-service platform, tooling

Using the police analogy:

  • Traditional SRE: Patrol cars responding to crimes
  • Platform Engineering: Install streetlights, cameras, empower residents to protect themselves

In other words, SRE is shifting from "protector" to "enabler".

But Budget Issues Remain

Here's what I realized:

Changing the approach doesn't solve the "how much is enough?" problem.

Security camera example:

  • 10 cameras → "Does this even work?"
  • 100 cameras → "Do we really need that many?"
  • 1,000 cameras → "Isn't this overkill?"

Same with Platform Engineering:

  • Build CI/CD pipeline → "Too much effort?"
  • Developer portal → "How much does this license cost?"
  • Monitoring for all services → "Do small services need this?"

Moreover, Platform Engineering's value is harder to prove than SRE's incident response. You're proving "losses that didn't happen" rather than "losses that did happen."

Three Frameworks to Justify Investment

So how do you explain the need for investment?

1. Engineer Ratio Approach

Rule of thumb: 1 SRE per 10-20 developers

  • 50 developers → 3-5 SREs
  • 100 developers → 5-10 SREs

Varies by service scale and complexity

Falling below this ratio increases the risk of entering a negative spiral.

2. Revenue Percentage Approach

Rule of thumb: 10-20% of IT budget for operations (including SRE)

  • Annual IT budget $1M → $100K-$200K

This is a rough industry standard.

3. Downtime Cost Calculation

Formula:

  1. Calculate revenue lost per hour of downtime
  2. Define acceptable annual downtime (e.g., 99.9% = 8.76 hours)
  3. Calculate potential annual loss
  4. Invest 10-20% of that amount

Example:

  • Hourly downtime loss: $50K
  • Annual acceptable downtime: 8.76 hours
  • Potential loss: $438K
  • Investment: $50K-$100K

If investment < expected loss, it's a rational investment.

Making It Clear for Non-Technical Executives

If your CTO or CEO has an engineering background, they'll understand. But when they don't, it gets tough.

You might not even get time to explain everything. And even if you do, they might not fully grasp it.

That's why we need to articulate the necessity of SRE and Platform Engineering at a level that non-engineers can understand.

Something you can say: "Read this first" — a primer that builds foundational understanding.

Executive Guide Available

I created "SRE & Platform Engineering Guide for Executives" with this in mind.

The guide covers:

  • Why digital services are "cities," not "buildings"
  • What SRE is (police/fire department analogy)
  • What Platform Engineering is (roads/utilities analogy)
  • Why investment is necessary (visualizing "invisible losses")
  • How much to invest (three frameworks)
  • Common misconceptions
  • Decision-making checklist
  • Next actions

All written to be understandable by non-engineers in 10 minutes.

The complete guide is available in the original article.

Use it as a resource for conversations with leadership.

Conclusion: Infrastructure is Investment, Not Cost

SRE and Platform Engineering investment is like fire insurance.

Companies pay hundreds of thousands annually for fire insurance. Nobody says "it's wasteful because no fire happened."

Similarly:

  • Expected loss: $3M/year (outage risk)
  • Investment: $500K (SRE)
  • If investment < expected loss, it's rational

But many companies only invest in SRE after experiencing a major outage.

Police and fire departments get budgets before major incidents happen. Because they're recognized as "social infrastructure".

SRE should be recognized as "digital infrastructure" too.

Honestly, there's no absolute answer to "how much is enough?" It becomes a matter of organizational values and priorities.

But at least we can provide "materials for thinking."

How much does your organization invest in SRE and Platform Engineering?


For more on this investment framework and the complete executive guide, check out the original article.

https://tielec.blog/en/tech/sre/why-sre-investment-undervalued

Top comments (0)