This article was originally published on bmf-tech.com.
About SLI, SLO, and SLA
This post summarizes various findings about SLI, SLO, and SLA.
What are SLO, SLI, and SLA?
SLO, SLI, and SLA are indicators, objectives, and agreements related to service levels. A service level is a measure of the service provided over a certain period, expressed in a specific way.
- SLI (Service Level Indicator)
- Service Level Indicator
- Metrics for measuring service levels
- ex. Availability, latency, error rate, throughput
- SLO (Service Level Objective)
- Service Level Objective
- Quantitative or qualitative values set as service level goals
- Consider external dependencies
- Communication with external services, SLOs of managed services, etc.
- SLA (Service Level Agreement)
- Service Level Agreement
- Agreements or guarantees regarding service levels between providers and users
- It is better to set looser target values than SLOs
How to Set SLI and SLO
NewRelic's proposed best practices are easy to implement and effective.
newrelic.com - Best Practices for Setting SLOs and SLIs for Modern Complex Systems
The method for formulating SLI and SLO is introduced, including defining system boundaries, defining functions for each boundary, defining availability for each function, and defining SLIs for measuring availability.
When starting to operate SLI and SLO, it is recommended to start with simple and loose values.
cf. sre.google - Chapter 4 - Service Level Objectives
When I actually formulated SLI and SLO in my work, I followed this NewRelic practice but adjusted the functional units to avoid becoming too detailed.
If you make the functional units too detailed from the start, it becomes difficult to operate, so I think it's better to adjust the granularity as needed during operation.
Tips
Tips on keywords related to SLI and SLO.
Difference Between Reliability and Availability
- Reliability
- The degree of fault tolerance inherent in a system
- Availability
- The degree to which a system can continue to operate
List of Uptime and Downtime, Availability Calculation
| Uptime | Annual Downtime | Monthly Downtime |
|---|---|---|
| 99.0% | 87.6 hours | 7.6 hours |
| 99.5% | 43.8 hours | 3.65 hours |
| 99.9% | 8.76 hours | 43.8 minutes |
| 99.95% | 4.38 hours | 21.9 minutes |
| 99.99% | 52.56 seconds | 4.38 minutes |
| 99.999% | 5.256 seconds | 26.28 seconds |
| 99.9999% | 31.536 seconds | 2.628 seconds |
What is an Error Budget?
An error budget is a permissible reliability indicator calculated based on the SLO.
ex. SLO 99.99% β Error Budget less than 0.01%
Impressions
By making service levels measurable, it becomes possible to observe whether the service users (users or systems) are satisfactorily provided with the service, and it can also serve as an indicator for service providers to determine whether service level improvements are necessary.
References
- newrelic.com - What are SLOs, SLIs, SLAs?
- newrelic.com - New Relic Hands-on: Basics of SLI/SLO Design
- cloud.google.com - Thoughts on SLO, SLI, SLA: Lessons Learned by CRE
- cloud.google.com - SRE Fundamentals (2021 Edition): Comparison of SLI, SLA, SLO
- cloud.google.com - SLOs, SLIs, SLAs, oh myβCRE life lessons
- cloud.google.com - How to Face Availability: Lessons Learned by CRE
- engineering.mericari.com - 2018/12/25 SLI/SLO in Mercari's Web Microservices
- sre.google - sre-book
- qiita.com - Thoughts on Formulating SLI/SLO
- qiita.com - Learning About SRE - Error Budget Edition
- bongineer.net - Difference Between Reliability and Availability
- mathwords.net - How Much Downtime for 99%, 99.9% Availability
- wnkhs.net - Availability Calculation and Assumptions (with Representative Numbers)
Top comments (0)