Yuto Takashi

Posted on Feb 14

Why SRE Investment Gets Undervalued (And How to Fix It)

Why You Should Care

If you've ever heard these questions from management:

"The system is working fine, why do we need more SRE budget?"
"Can't we just respond to incidents when they happen?"
"Why not just hire more developers to speed up development?"

You're not alone. SRE and Platform Engineering investments are often undervalued, and there's a structural reason for it.

This article explores why this happens and provides three concrete frameworks to justify infrastructure investment to non-technical executives.

The Police Department Analogy

Here's an interesting parallel: SRE work is similar to police, fire departments, and disaster response teams.

Think about it:

They protect society using public funds
When nothing goes wrong, they train and prepare
But people often say "what are they even doing?" and cut their budgets
When problems occur, everyone asks "why didn't you prevent this?"

The value of prevention is invisible.

Ship a new feature → "We contributed to revenue!" (highly visible)
System runs 24/7 without issues → "That's expected" (taken for granted)
Prevented a major outage → Never happened, so nobody notices

Same with police and fire departments. Low crime rates and no fires are actually the result of their work, but it looks like "they're doing nothing."

The Negative Spiral

Here's what's scary: this structure creates a negative spiral.

Budget cuts → Staff shortage → More incidents → "SRE is incompetent" → Further budget cuts

This happens with police too: "Crime is rising, what are they doing?" → Budget cuts → Less patrol → More crime...

The cycle continues like this:

Budget gets cut
Fewer people, fewer tools
Incidents increase
"Why is SRE failing?"
Even more budget cuts

The Chicken-and-Egg Problem

There's an even trickier issue: No budget until problems occur.

Pattern 1: Reactive funding

Major outage happens → Emergency budget approved → Fix implemented → System stable → "We're good now, right?" → Budget cut

Pattern 2: Prevention isn't valued

"Our database will hit limits in 6 months"
"But it's working now, do we really need this?"
(6 months later: outage)
"Why didn't you predict this?!"

This is exactly like earthquake-proofing budgets. Before the earthquake: "waste of money." After: "why didn't we do this?"

Platform Engineering: A New Approach

In the past 2-3 years, Platform Engineering has gained attention.

It's about building "self-service infrastructure" so developers can manage infrastructure themselves.

This emerged because of the gap between DevOps ideals and reality.

DevOps ideal (2010s)

"Developers do everything: build, deploy, operate!"
"You build it, you run it"

Reality

Operational burden concentrated on developers
Learning curve too steep (Kubernetes, Terraform, monitoring tools...)
Each team picks their own tools → chaos
Eventually, load concentrates on "the few who can operate"

→ "It was unrealistic to expect all developers to be infrastructure experts"

SRE vs Platform Engineering

	SRE	Platform Engineering
Primary Goal	Protect service reliability	Improve developer productivity
For Whom?	End users (customers)	Internal developers
Key Activities	Incident response, SLO management	Self-service platform, tooling

Using the police analogy:

Traditional SRE: Patrol cars responding to crimes
Platform Engineering: Install streetlights, cameras, empower residents to protect themselves

In other words, SRE is shifting from "protector" to "enabler".

But Budget Issues Remain

Here's what I realized:

Changing the approach doesn't solve the "how much is enough?" problem.

Security camera example:

10 cameras → "Does this even work?"
100 cameras → "Do we really need that many?"
1,000 cameras → "Isn't this overkill?"

Same with Platform Engineering:

Build CI/CD pipeline → "Too much effort?"
Developer portal → "How much does this license cost?"
Monitoring for all services → "Do small services need this?"

Moreover, Platform Engineering's value is harder to prove than SRE's incident response. You're proving "losses that didn't happen" rather than "losses that did happen."

Three Frameworks to Justify Investment

So how do you explain the need for investment?

1. Engineer Ratio Approach

Rule of thumb: 1 SRE per 10-20 developers

50 developers → 3-5 SREs
100 developers → 5-10 SREs

Varies by service scale and complexity

Falling below this ratio increases the risk of entering a negative spiral.

2. Revenue Percentage Approach

Rule of thumb: 10-20% of IT budget for operations (including SRE)

Annual IT budget $1M → $100K-$200K

This is a rough industry standard.

3. Downtime Cost Calculation

Formula:

Calculate revenue lost per hour of downtime
Define acceptable annual downtime (e.g., 99.9% = 8.76 hours)
Calculate potential annual loss
Invest 10-20% of that amount

Example:

Hourly downtime loss: $50K
Annual acceptable downtime: 8.76 hours
Potential loss: $438K
Investment: $50K-$100K

If investment < expected loss, it's a rational investment.

Making It Clear for Non-Technical Executives

If your CTO or CEO has an engineering background, they'll understand. But when they don't, it gets tough.

You might not even get time to explain everything. And even if you do, they might not fully grasp it.

That's why we need to articulate the necessity of SRE and Platform Engineering at a level that non-engineers can understand.

Something you can say: "Read this first" — a primer that builds foundational understanding.

Executive Guide Available

I created "SRE & Platform Engineering Guide for Executives" with this in mind.

The guide covers:

Why digital services are "cities," not "buildings"
What SRE is (police/fire department analogy)
What Platform Engineering is (roads/utilities analogy)
Why investment is necessary (visualizing "invisible losses")
How much to invest (three frameworks)
Common misconceptions
Decision-making checklist
Next actions

All written to be understandable by non-engineers in 10 minutes.

The complete guide is available in the original article.

Use it as a resource for conversations with leadership.

Conclusion: Infrastructure is Investment, Not Cost

SRE and Platform Engineering investment is like fire insurance.

Companies pay hundreds of thousands annually for fire insurance. Nobody says "it's wasteful because no fire happened."

Similarly:

Expected loss: $3M/year (outage risk)
Investment: $500K (SRE)
If investment < expected loss, it's rational

But many companies only invest in SRE after experiencing a major outage.

Police and fire departments get budgets before major incidents happen. Because they're recognized as "social infrastructure".

SRE should be recognized as "digital infrastructure" too.

Honestly, there's no absolute answer to "how much is enough?" It becomes a matter of organizational values and priorities.

But at least we can provide "materials for thinking."

How much does your organization invest in SRE and Platform Engineering?

For more on this investment framework and the complete executive guide, check out the original article.

https://tielec.blog/en/tech/sre/why-sre-investment-undervalued

DEV Community