When people talk about a “Service Level Agreement” (SLA), it rarely points to a single, clear metric. The term is often vague and open to interpretation. You’ll hear statements like “Our service guarantees 99.9% uptime”, but what does that really mean in practice? After firefighting many incidents, I have some thoughts and insights on it.
Let's put together an overview of the most common types of SLAs and share some practical examples and what it means for engineers.
What is an SLA?
An SLA is a promise from a service provider to its customers about quality and reliability.
The most common metric and the one that usually takes the spotlight is availability, often expressed as a percentage of uptime.
Here’s a reference table showing the allowed downtime for various levels of availability, assuming 24/7 operation:
Availability % | Downtime per year | Downtime per month | Downtime per day |
---|---|---|---|
90% ("one nine") | 36.53 days | 73.05 hours | 2.40 hours |
99% ("two nines") | 3.65 days | 7.31 hours | 14.40 minutes |
99.9% ("three nines") | 8.77 hours | 43.83 minutes | 1.44 minutes |
99.99% ("four nines") | 52.60 minutes | 4.38 minutes | 8.64 seconds |
99.999% ("five nines") | 5.26 minutes | 26.30 seconds | 864 milliseconds |
Alternative SLA Measurements
Uptime is the easiest number to quote, but it doesn’t account for when and how (badly) a system fails. Most services fail under peak load, and five minutes of downtime during that window are usually be far more damaging than an hour of scheduled maintenance at midnight. Even worse, a service might appear working (responding with 200 OK
on /health
), while performing terribly for users.
Common peak-load failure patterns include:
- Increased latency - depending on business domain, doing things slowly can be worse than not doing it at all.
- Timeouts - If system has a chain of dependencies, one slow component can snowball into increased load on all downstream services.
- Partial outages - only certain users or actions fail, escaping detection, but breaking clients workflows
Most of these failures wouldn't even necessarily count up to the typical availability metric, but from single users perspective, these can have the same impact as downtime”
There are other ways to measure service quality that can sometimes provide a bit more accurate picture of clients experience:
- Yield: percentage of successful transactions, providing a more accurate reflection of system performance metric for systems with variable usage. Related interesting read: Readings on Yield.
-
Latency thresholds: percentage of requests served within an acceptable latency range (e.g.,
95% under 200 ms
). - Suboptimal result rate: how often the system serves degraded or approximate responses to save resources. Useful for performance/accuracy trade-offs.
Impact of a single monthly incident on 4-nines availability
“Four nines” (99.99%) availability is often a target for high-reliability systems.
This gives downtime budget of just ~52 minutes of downtime per year, or about five minutes per month.
Consider a scenario with a single incident per month:
- Incident occurs
- Monitoring detects and raises an alert, ~30s
- Engineer receives and answers the page, ~3 minutes
- Engineer logs in and begins investigating, ~1 minute
At this point, you’ve already burned through almost your entire monthly downtime budget.
Good luck diagnosing, fixing, and recovering in the 30 seconds that remain.
To make 4-nines uptime even remotely achievable, you need early detection, self-healing mechanisms, and automated recovery long before humans get involved.
Conclusions
- Traditional uptime metrics are useful, but they don't tell the whole story. True reliability means being transparent, measuring the right things, and understanding the real impact of failures on users.
- SLAs are a foundational part of service reliability, and they require careful thought and design. Fixing problems in the design phase, way before they have a chance to surface takes time, but it protects your reputation.
- High availability isn’t achieved through heroics! It’s engineered through prevention, visibility, and automation.
"An SLA is not just a number. It's a commitment to quality and trust."
Top comments (0)