SRE Fundamentals: Defining SLOs, SLIs, and Error Budgets That Actually Work

#sre #devops #monitoring

Introduction

Site Reliability Engineering (SRE) has transformed how organizations think about system reliability. Central to this framework are Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets.

This guide will walk you through defining SLOs, SLIs, and Error Budgets that actually drive meaningful improvements.

Understanding the Reliability Hierarchy

SLAs (Service Level Agreements): External contracts with customers specifying consequences for failures.

SLOs (Service Level Objectives): Internal targets your team commits to, stricter than SLAs.

SLIs (Service Level Indicators): The actual measurements determining whether you're meeting SLOs.

Error Budgets: How much unreliability you can tolerate while meeting your SLO.

Defining Meaningful SLIs

The Four Golden Signals

Latency: How long requests take
Traffic: Request volume
Errors: Rate of failed requests
Saturation: How "full" your service is

# Availability SLI
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Latency SLI
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

Setting Realistic SLOs

Each additional "nine" dramatically reduces your error budget:

Availability	Monthly Downtime
99%	7.2 hours
99.9%	43.8 minutes
99.99%	4.38 minutes

Error Budgets: The Key to Balance

Error Budget = 1 - SLO

For a 99.9% SLO over 30 days:
Error Budget = 0.1% = 43.2 minutes of downtime

Conclusion

SLOs, SLIs, and Error Budgets aren't just metrics—they're a framework for making better decisions about reliability. The goal is appropriate reliability—enough to keep users happy while maintaining velocity.