The Economics of Reliability: When to Invest, When to Accept Risk

#sre #devops #reliability #strategy

Reliability is not a virtue. It's an investment. Too little and you lose customers. Too much and you can't afford to ship. The question is: where's the right balance?

The error budget framing

The SRE book covers this well. Pick an SLO (say 99.9% uptime). That's 43 minutes of budget per month. If you're at 99.95%, you have budget to spend on risky things. If you're at 99.85%, you need to stop shipping risk.

This works. But it doesn't answer 'what SLO should I pick?' Let me give you a framework.

The three questions

1. What do users expect? A consumer banking app needs 4 nines or more. A developer tool can get away with 3. A beta product can live with 99%. Ask users (or watch churn numbers).

2. What does an outage cost? Dollars of lost revenue + dollars of customer churn + hours of engineering time. For a checkout-heavy product, an hour of downtime might cost $500k. For a B2B internal tool, $5k.

3. What does the next 9 cost? Going from 99% to 99.9% might cost $50k of engineering work. Going from 99.9% to 99.99% often costs $500k or more. Each 9 is 10x harder.

The math

Invest in reliability up to the point where the next $1 invested saves less than $1 of outage cost over the amortization period.

If moving from 99% to 99.9% costs $50k and would save $200k over a year in reduced outage damage, invest. Easy call.

If moving from 99.9% to 99.99% costs $500k and saves $100k, don't invest. Accept the risk, and spend the engineering time on something with better ROI.

The hidden cost

Over-investing in reliability has a hidden cost: team velocity. Teams that chase 99.999% uptime spend so much on tests, canaries, staging environments, and approval gates that they ship slowly. Competitors with 99.5% reliability but 5x your velocity will win the market.

Reliability that kills velocity is bad reliability. Measure both.

The political reality

The hardest part isn't the math. It's defending 'we're not going to fix this' when a VP demands reliability improvements. You need explicit agreement, in writing, on the target SLOs — so 'we're not fixing this' is 'we agreed on 99.9% and we're at 99.92%, which is within budget.'

The pragmatic answer

Most teams I've worked with should pick:

99.9% for production-critical services
99% for internal tools
Best-effort for dev environments

Then measure, spend the error budget, and stop arguing about it. The math is usually clearer than the politics.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community