Kenta Takeuchi

Posted on Mar 15 • Originally published at bmf-tech.com

SLI, SLO, and SLA Explained: A Practical Guide for Engineers

#sli #sla #slo

This article was originally published on bmf-tech.com.

About SLI, SLO, and SLA

This post summarizes various findings about SLI, SLO, and SLA.

What are SLO, SLI, and SLA?

SLO, SLI, and SLA are indicators, objectives, and agreements related to service levels. A service level is a measure of the service provided over a certain period, expressed in a specific way.

SLI (Service Level Indicator)
- Service Level Indicator
- Metrics for measuring service levels
- ex. Availability, latency, error rate, throughput
SLO (Service Level Objective)
- Service Level Objective
- Quantitative or qualitative values set as service level goals
- Consider external dependencies
  - Communication with external services, SLOs of managed services, etc.
SLA (Service Level Agreement)
- Service Level Agreement
- Agreements or guarantees regarding service levels between providers and users
- It is better to set looser target values than SLOs

How to Set SLI and SLO

NewRelic's proposed best practices are easy to implement and effective.

newrelic.com - Best Practices for Setting SLOs and SLIs for Modern Complex Systems

The method for formulating SLI and SLO is introduced, including defining system boundaries, defining functions for each boundary, defining availability for each function, and defining SLIs for measuring availability.

When starting to operate SLI and SLO, it is recommended to start with simple and loose values.

cf. sre.google - Chapter 4 - Service Level Objectives

When I actually formulated SLI and SLO in my work, I followed this NewRelic practice but adjusted the functional units to avoid becoming too detailed.

If you make the functional units too detailed from the start, it becomes difficult to operate, so I think it's better to adjust the granularity as needed during operation.

Tips

Tips on keywords related to SLI and SLO.

Difference Between Reliability and Availability

Reliability
- The degree of fault tolerance inherent in a system
Availability
- The degree to which a system can continue to operate

List of Uptime and Downtime, Availability Calculation

Uptime	Annual Downtime	Monthly Downtime
99.0%	87.6 hours	7.6 hours
99.5%	43.8 hours	3.65 hours
99.9%	8.76 hours	43.8 minutes
99.95%	4.38 hours	21.9 minutes
99.99%	52.56 seconds	4.38 minutes
99.999%	5.256 seconds	26.28 seconds
99.9999%	31.536 seconds	2.628 seconds

What is an Error Budget?

An error budget is a permissible reliability indicator calculated based on the SLO.
ex. SLO 99.99% → Error Budget less than 0.01%

Impressions

By making service levels measurable, it becomes possible to observe whether the service users (users or systems) are satisfactorily provided with the service, and it can also serve as an indicator for service providers to determine whether service level improvements are necessary.

DEV Community