Tyson Cung

Posted on Apr 1

The Nines Are Lying to You: What 99.9% Uptime Actually Costs

#sre #devops #reliability #systemdesign

Your cloud provider promises 99.9% uptime and you nod along like that's basically perfect. I did too, for years. Then I actually ran the numbers.

The Math Nobody Does

99.9% uptime means your system can be completely dead for 8 hours and 46 minutes per year — an entire workday — and you're still "meeting SLA." That's not a rounding error. That's lunch, two meetings, and a coffee break worth of your service being a 404 page.

Here's the full breakdown:

Nines	Uptime %	Downtime/Year	Downtime/Month
Two	99%	3.65 days	7.3 hours
Three	99.9%	8h 46m	43.2 minutes
Four	99.99%	52.6 minutes	4.32 minutes
Five	99.999%	5.26 minutes	25.9 seconds

That jump from three nines to four isn't a 0.09% improvement. It's 10x less downtime. And every additional nine after that? Another 10x reduction. The percentages make it look incremental. The reality is exponential.

Each Nine Roughly Doubles the Bill

Going from 99.9% to 99.99% doesn't mean spending 0.09% more on infrastructure. It means redundant databases, multi-region failover, automated health checks, load balancers that actually work, and on-call engineers who get paged at 3 AM on a Sunday.

I've seen teams burn through $200K/month in AWS costs chasing a fourth nine they didn't need. Their product was an internal dashboard that 40 people used during business hours. Nobody was checking it at 2 AM. Nobody cared if it took 30 seconds to recover from a blip.

Meanwhile, the engineering team was maintaining a Rube Goldberg machine of health checks, circuit breakers, and multi-AZ deployments — all to prevent downtime that wouldn't have mattered.

The Real-World Price Tag

Downtime costs averaged $14,056 per minute in 2024 across industries. Amazon's one-hour outage cost an estimated $34 million. The 2025 AWS US-EAST-1 incident ran up a tab estimated at $75 million per hour for affected businesses.

But here's what those scary numbers obscure: the cost of downtime depends entirely on what's down. A payment processing system going offline during Black Friday is a five-alarm fire. Your team's internal wiki going down for 20 minutes on a Tuesday? Nobody notices.

The Composite Availability Trap

This one catches people off guard. If your app depends on three services — say a database, a cache layer, and an auth provider — each running at 99.9%, your composite availability isn't 99.9%. It's roughly 99.7%.

The math: 0.999 × 0.999 × 0.999 = 0.997. That triples your expected downtime. Add more dependencies and it gets worse. I've worked on systems with 15+ microservices in the critical path, and the theoretical composite availability was genuinely depressing.

This is why distributed systems are hard. Every network hop, every external API call, every managed service is another multiplier dragging your real availability down.

So What Do You Actually Need?

Two nines (99%) — Fine for dev/staging environments, internal tools nobody relies on critically, hobby projects.

Three nines (99.9%) — Covers most SaaS products, content sites, non-financial APIs. This is where the cost-to-benefit ratio peaks for the majority of companies.

Four nines (99.99%) — E-commerce during peak traffic, healthcare systems, anything where minutes of downtime have direct revenue impact. Expect serious infrastructure investment.

Five nines (99.999%) — Financial trading systems, emergency services, telecom infrastructure. You need dedicated SRE teams, chaos engineering practices, and a budget that makes your CFO nervous. 5.26 minutes of total annual downtime means you can't even do a slow database migration without eating your entire error budget.

Error Budgets Changed How I Think About This

Google's SRE team popularized the idea of error budgets, and it flipped my perspective. Instead of "maximize uptime," the question becomes: "how much downtime can we spend?"

With a 99.9% monthly SLO, you've got a budget of 43.2 minutes. A 15-minute incident burns a third of it. That constraint forces honest conversations: is this feature launch worth the risk of eating 10 minutes of our budget? Should we slow down deployments this month because we already had an incident?

It turns reliability from a vague aspiration into a concrete resource you manage.

Pick Your Number Honestly

Most teams I've worked with overestimate what they need. They put "five nines" on a slide deck because it sounds professional, then spend six months building infrastructure for a reliability target that's wildly out of proportion with their actual user expectations.

Start from the other direction. How long can your service actually be down before someone notices? Before it costs real money? Before users leave? That's your real SLA — not whatever number marketing put on the website.

The nines aren't lying exactly. But they're definitely not telling the whole truth.

I break down more engineering concepts like this on my YouTube channel. If uptime math keeps you up at night, you're in good company.

Top comments (1)

Andre Cytryn • Apr 1

the composite availability trap is the one that consistently surprises teams the most. i've seen product managers proudly quote "99.9% uptime" while their system had 12 downstream dependencies, each at 99.9% -- which means the actual availability was somewhere around 98.8% before they even wrote a line of code.

error budgets are a huge mindset shift. once you frame reliability as a finite resource to spend rather than a target to hit, the whole conversation changes. teams stop optimizing for zero incidents and start making conscious tradeoffs: "this risky deploy is worth 8 minutes of error budget, here's why."

the only thing i'd add: make sure your architecture diagrams actually reflect these dependencies. a lot of the overspending you mention comes from teams not having a clear picture of their full service map when making reliability decisions.