The $150 Million Outage
On July 8, 2015, the New York Stock Exchange halted trading for 3.5 hours due to a software glitch.
On the same day, United Airlines grounded flights due to a router issue.
Two separate incidents. Two industries. One shared lesson: when systems go down, the cost isn't just technical — it's financial, reputational, and sometimes irreversible.
The engineers who built those systems weren't incompetent. They just didn't design for failure explicitly enough.
That's what availability and reliability engineering is about. Not hoping your system stays up — but designing it to stay up, and planning for the moment it doesn't.
Availability vs Reliability: The Difference That Matters
These words get used interchangeably. They shouldn't.
Availability = Is the system reachable right now? It's a percentage: what fraction of time is the system operational?
Reliability = Does the system do what it's supposed to do, consistently? A system can be available (you can connect) but unreliable (it returns wrong data).
Example: A vending machine that's always powered on (available) but sometimes gives you Diet Coke when you pressed Sprite (unreliable). Available ≠ Reliable.
For system design, you need to engineer both — but they require different strategies.
Decoding the Nines
"We have 99.9% uptime" sounds impressive. Let's see what it actually means:
99.9% (three nines) is the baseline for most web applications. But 8.7 hours of downtime per year is still painful if it hits all at once — say, during Black Friday.
99.99% (four nines) is what AWS, Google Cloud, and Azure target for most of their core services. Four nines means your entire year's downtime budget is 52 minutes.
99.999% (five nines) is the gold standard for telecom and financial systems. Less than 6 minutes of downtime per year. This is genuinely hard to achieve and expensive to maintain.
The interviewer trap: If you say "we need five nines," a good interviewer will ask "what does that cost and why do you need it?" 99.999% for a blog is engineering overkill. 99.999% for a cardiac monitor is non-negotiable.
How You Actually Build High Availability
High availability (HA) isn't a setting you turn on. It's an architectural property you design in.
Strategy 1: Replication
The core insight: if there's only one of something, it's a single point of failure (SPOF).
Replication means running multiple identical copies of a component. If one dies, others take over.
Without replication:
[App] → [Database] ← Database dies = full outage
With replication:
[App] → [Primary DB]
↓ replicates
[Replica DB] ← Primary dies = replica takes over
This applies to databases, app servers, message queues, load balancers — everything critical.
Strategy 2: Multi-AZ Deployment
A single data center is itself a SPOF. Power failure, network issue, flood, fire — any of these can take down your entire system.
Availability Zones (AZs) are physically separate data centers within the same region, connected by low-latency links.
AWS Region: us-east-1
├── AZ: us-east-1a [App servers + DB Primary]
├── AZ: us-east-1b [App servers + DB Replica]
└── AZ: us-east-1c [App servers + DB Replica]
If an entire AZ goes down, traffic automatically routes to the others.
AWS RDS Multi-AZ: Your database runs in primary AZ, with a synchronous replica in a standby AZ. Failover happens automatically in 1-2 minutes. You don't change your connection string — it just works.
Strategy 3: Active-Passive vs Active-Active
Active-Passive: One server handles all traffic (active). Another server stands by (passive), ready to take over if the active one fails.
[Load Balancer]
|
[Server A — Active] ←── all traffic
[Server B — Passive] ←── health checks only, standby
Simpler to set up
Passive server is wasted capacity during normal operation
Failover takes a few seconds (some requests may fail during switchover)
Active-Active: All servers handle traffic simultaneously. If one fails, the others absorb its load.
[Load Balancer]
/ \
[Server A] [Server B] ←── both handle real traffic
No wasted capacity
Seamless failover (no interruption)
More complex (sessions, state, consistency across servers)
This is what most large-scale systems use
SLI, SLO, SLA — The Reliability Language of Production
These three acronyms describe how reliability is measured, targeted, and promised. Every senior engineer must be fluent in them.
SLI — Service Level Indicator
A measurement of your system's behavior. A specific metric.
Examples:
Request success rate: successful_requests / total_requests
Latency: % of requests completing in < 200ms
Availability: uptime / total_time
SLO — Service Level Objective
Your internal target for an SLI. What you're aiming for.
Examples:
"99.9% of requests will succeed"
"p99 latency will be under 300ms"
"Availability will be 99.95%"
SLOs are engineering promises to yourselves. They drive architecture decisions.
SLA — Service Level Agreement
A contractual promise to customers, usually with financial penalties for breach.
Examples:
AWS S3 SLA: 99.9% monthly uptime. If breached → service credits.
Your SLA is typically weaker than your SLO — you always have a buffer.
Reality: [SLI] what you measure
Target: [SLO] what you aim for ← internal, drives engineering
Promise: [SLA] what you guarantee ← external, has consequences
Rule of thumb: SLO = 99.9%, SLA = 99.5%. The gap is your error budget.
Error Budgets: The Most Powerful Reliability Concept
Here's an insight that changes how teams think about reliability:
If your SLO is 99.9% availability, you have a 0.1% error budget.
In a 30-day month, that's 43.8 minutes of allowed downtime.
You can spend that budget however you choose:
A planned deployment that takes the system down for 10 minutes
An incident that causes 15 minutes of partial outage
An experiment in production that causes 5 minutes of errors
When the budget is exhausted — no more risky deployments until next month. The team focuses entirely on reliability.
When the budget is healthy — the team can move fast, deploy frequently, run experiments.
This is Google SRE's core insight: Reliability isn't about perfection. It's about managing your unreliability budget deliberately. A product that is 100% reliable is a product that is being developed too conservatively.
The error budget creates alignment between product (wants to ship fast) and SRE (wants stability). They're both managing the same budget — for different reasons.
Chaos Engineering: Breaking Things on Purpose
In 2010, Netflix introduced the world to a radical idea: deliberately break your own production system to find weaknesses before real outages do.
They built Chaos Monkey — a tool that randomly terminates EC2 instances in production during business hours.
The philosophy: If you know your system will fail (and it will), you should be the one causing failures, not random chance.
What chaos engineering teaches:
You discover failures in controlled conditions (daytime, engineers on-call) rather than at 3am.
Teams build systems that expect failure, not systems that assume everything works.
You verify your redundancy actually works — not just that it's theoretically in place.
Netflix's Simian Army goes further:
Chaos Monkey — kills random instances
Chaos Gorilla — takes down an entire availability zone
Chaos Kong — simulates an entire region failure
Latency Monkey — introduces artificial delays to test timeout handling
Google's SRE teams run "disaster recovery tests" quarterly — simulating full region failures, data corruption scenarios, and cascading dependency failures. These tests have found critical bugs before real incidents did.
_For your system designs: _Always ask "what happens when X dies?" for every component. If you can't answer it, your design isn't complete.
Cascading Failures: The Silent Killer
A single server failing is manageable. A cascading failure — where one failure triggers others — can bring down an entire platform.
Classic cascade pattern:
Service A is slow → Service B waits → B's thread pool fills up
→ B starts rejecting requests → Service C (calling B) starts failing
→ C's users see errors → C retries aggressively → makes A even slower
→ entire system collapses
This happened to Amazon, Twitter, and nearly every large distributed system.
How to prevent cascading failures:
Circuit breakers — stop calling a failing service, return cached/default response
Timeouts — never wait forever; fail fast
Bulkheads — isolate failures with separate thread pools per dependency
Rate limiting — prevent retry storms from amplifying failures
We'll go deep on all of these on Day 6 (Fault Tolerance Patterns).
Interview Scenario: "What's the Difference Between HA and Fault Tolerance?"
This is a subtle but important distinction:
High Availability (HA): Minimizes downtime. The system may briefly fail but recovers quickly. Acceptable: a few seconds of unavailability during failover.
Fault Tolerance: The system continues operating without any interruption even when components fail. Zero downtime. Much more complex and expensive to achieve.
Analogy:
A car with a spare tire = High Availability (you stop, change tire, continue)
An airplane with redundant engines = Fault Tolerant (engine fails, you don't notice)
For system design interviews:
Most web applications need HA, not full fault tolerance
Financial transaction systems, medical devices, and aviation need fault tolerance
The cost difference is significant — always justify your choice
Key Takeaways
Availability is measured in "nines" — each nine is harder and more expensive to achieve.
SLI measures reality. SLO is your target. SLA is your promise. The gaps between them are intentional.
Error budgets align speed of development with reliability — spend it wisely.
HA strategies (replication, multi-AZ, active-active) eliminate single points of failure.
Chaos engineering isn't recklessness — it's controlled failure to build confidence.
Every system will fail. The question is whether you designed for it.
Top comments (0)