Thoughts on SLA

#sre #webdev #agile

When people talk about a “Service Level Agreement” (SLA), it rarely points to a single, clear metric. The term is often vague and open to interpretation. You’ll hear statements like “Our service guarantees 99.9% uptime”, but what does that really mean in practice? After firefighting many incidents, I have some thoughts and insights on it.

Let's put together an overview of the most common types of SLAs and share some practical examples and what it means for engineers.

What is an SLA?

An SLA is a promise from a service provider to its customers about quality and reliability.
The most common metric and the one that usually takes the spotlight is availability, often expressed as a percentage of uptime.

Here’s a reference table showing the allowed downtime for various levels of availability, assuming 24/7 operation:

Availability %	Downtime per year	Downtime per month	Downtime per day
90% ("one nine")	36.53 days	73.05 hours	2.40 hours
99% ("two nines")	3.65 days	7.31 hours	14.40 minutes
99.9% ("three nines")	8.77 hours	43.83 minutes	1.44 minutes
99.99% ("four nines")	52.60 minutes	4.38 minutes	8.64 seconds
99.999% ("five nines")	5.26 minutes	26.30 seconds	864 milliseconds

Wikipedia - High Availability

Alternative SLA Measurements

Uptime is the easiest number to quote, but it doesn’t account for when and how (badly) a system fails. Most services fail under peak load, and five minutes of downtime during that window are usually be far more damaging than an hour of scheduled maintenance at midnight. Even worse, a service might appear working (responding with 200 OK on /health), while performing terribly for users.

Common peak-load failure patterns include:

Increased latency - depending on business domain, doing things slowly can be worse than not doing it at all.
Timeouts - If system has a chain of dependencies, one slow component can snowball into increased load on all downstream services.
Partial outages - only certain users or actions fail, escaping detection, but breaking clients workflows

Most of these failures wouldn't even necessarily count up to the typical availability metric, but from single users perspective, these can have the same impact as downtime”

There are other ways to measure service quality that can sometimes provide a bit more accurate picture of clients experience:

Yield: percentage of successful transactions, providing a more accurate reflection of system performance metric for systems with variable usage. Related interesting read: Readings on Yield.
Latency thresholds: percentage of requests served within an acceptable latency range (e.g., 95% under 200 ms).
Suboptimal result rate: how often the system serves degraded or approximate responses to save resources. Useful for performance/accuracy trade-offs.

Impact of a single monthly incident on 4-nines availability

“Four nines” (99.99%) availability is often a target for high-reliability systems.
This gives downtime budget of just ~52 minutes of downtime per year, or about five minutes per month.

Consider a scenario with a single incident per month:

Incident occurs
Monitoring detects and raises an alert, ~30s
Engineer receives and answers the page, ~3 minutes
Engineer logs in and begins investigating, ~1 minute

At this point, you’ve already burned through almost your entire monthly downtime budget.
Good luck diagnosing, fixing, and recovering in the 30 seconds that remain.

To make 4-nines uptime even remotely achievable, you need early detection, self-healing mechanisms, and automated recovery long before humans get involved.

Conclusions

Traditional uptime metrics are useful, but they don't tell the whole story. True reliability means being transparent, measuring the right things, and understanding the real impact of failures on users.
SLAs are a foundational part of service reliability, and they require careful thought and design. Fixing problems in the design phase, way before they have a chance to surface takes time, but it protects your reputation.
High availability isn’t achieved through heroics! It’s engineered through prevention, visibility, and automation.