Devang Goyal

Posted on May 16 • Originally published at clouddevang.github.io

SLOs, SLIs, and Error Budgets: A Practical Guide for SREs

#sre #observability #reliability

Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets form the foundation of Site Reliability Engineering. Yet many teams struggle to implement them effectively. This guide shares practical lessons from implementing SLO-based reliability practices in production financial systems.

Understanding the SRE Reliability Stack

Before diving into implementation, let's clarify the hierarchy:

SLI (Service Level Indicator): A quantitative measure of service behavior (e.g., "99.2% of requests completed in under 200ms")
SLO (Service Level Objective): The target value for an SLI (e.g., "99.9% of requests should complete in under 200ms")
SLA (Service Level Agreement): A contract with consequences for missing SLOs (e.g., "If we miss 99.9%, customers get credits")
Error Budget: The allowed failure rate (e.g., "0.1% of requests can fail per month")

Choosing the Right SLIs

The most common mistake teams make is tracking too many SLIs. Start with these four golden signals:

1. Availability

availability = successful_requests / total_requests

For an API, this might be: "Percentage of HTTP requests returning 2xx or expected 4xx status codes."

2. Latency

latency_sli = requests_under_threshold / total_requests

Track at multiple percentiles: p50 for typical experience, p99 for tail latency. For financial systems, we use p99.9.

3. Throughput

throughput = successful_requests_per_second

Critical for batch processing systems and data pipelines.

4. Error Rate

error_rate = failed_requests / total_requests

Distinguish between client errors (4xx) and server errors (5xx)—only count 5xx against your error budget.

Setting Realistic SLOs

Here's a framework I use for setting SLOs:

Step 1: Measure Current Performance

Don't guess. Run your system for 2-4 weeks and measure actual performance:

-- Example query for availability over 30 days
SELECT
  COUNT(CASE WHEN status_code < 500 THEN 1 END) * 100.0 / COUNT(*) as availability
FROM request_logs
WHERE timestamp > NOW() - INTERVAL '30 days';

Step 2: Understand User Expectations

Interview stakeholders:

What latency do users notice?
How much downtime is acceptable?
What's the business impact of degradation?

Step 3: Set Achievable Targets

If your current availability is 99.5%, don't set an SLO of 99.99%. Start with 99.7% and improve incrementally.

Pro tip: Your SLO should be slightly below your actual performance. This gives you room to experiment and deploy without constant alerts.

Implementing Error Budgets

Error budgets are the game-changer. They answer: "How much unreliability can we tolerate?"

Calculating Error Budget

For a 99.9% availability SLO over 30 days:

Error Budget = (1 - 0.999) × 30 days × 24 hours × 60 minutes
             = 0.001 × 43,200 minutes
             = 43.2 minutes of downtime allowed

Error Budget Policy

Here's the policy we implemented:

Budget Remaining	Action
> 50%	Normal development velocity
25-50%	Increased review rigor, limit risky changes
10-25%	Feature freeze, focus on reliability
< 10%	All hands on reliability, no new features

Burn Rate Alerts

Instead of alerting on instantaneous errors, alert on burn rate—how fast you're consuming your error budget:

# Prometheus alert for fast burn rate
- alert: HighErrorBudgetBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      / sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)  # 14.4x burn rate = budget exhausted in 5 days
  for: 5m
  labels:
    severity: critical

Real-World Implementation: A Case Study

At BitFlyer, we implemented SLOs for our trading API:

Initial State

No formal SLOs
Alerts on arbitrary thresholds
Constant alert fatigue
No clear prioritization

Implementation Steps

Week 1-2: Instrumentation
We added OpenTelemetry instrumentation to capture:

Request duration histograms
Status code counters
Dependency latencies

Week 3-4: Baseline Measurement
Measured actual performance:

Availability: 99.89%
P99 latency: 180ms
Error rate: 0.08%

Week 5-6: SLO Definition
Set initial SLOs:

Availability SLO: 99.9% (gives 43 min/month budget)
Latency SLO: 99% of requests < 200ms
Error rate SLO: < 0.1% server errors

Week 7-8: Alerting Migration
Replaced 47 arbitrary alerts with 6 SLO-based alerts:

2 availability burn rate alerts (fast/slow)
2 latency burn rate alerts (fast/slow)
2 error rate burn rate alerts (fast/slow)

Results After 3 Months

Alert volume reduced by 73%
MTTR improved by 45%
Engineering velocity increased (fewer interruptions)
Clear prioritization framework for incidents

Common Pitfalls to Avoid

1. SLO Perfection Syndrome

Don't aim for 100% availability. It's:

Mathematically impossible
Prohibitively expensive
Prevents innovation

The difference between 99.9% and 99.99% is a 10x cost increase for most systems.

2. Too Many SLOs

Start with 3-5 SLOs per service. More creates confusion and alert fatigue.

3. Ignoring Dependencies

Your service's SLO is bounded by your dependencies' SLOs. If your database has 99.9% availability, you cannot achieve 99.99% for your API.

4. Set and Forget

Review SLOs quarterly:

Are they still relevant?
Are they too tight (constant alerts) or too loose (not protecting users)?
Has the business context changed?

Tooling Recommendations

For implementing SLOs, consider:

Metrics Collection: Prometheus, Datadog, or Azure Monitor
SLO Tracking: Sloth, Google SLO Generator, or Datadog SLO
Error Budget Visualization: Grafana dashboards, custom Datadog dashboards
Alerting: PagerDuty, Opsgenie integrated with burn rate alerts

Conclusion

SLOs, SLIs, and error budgets aren't just metrics—they're a cultural shift toward data-driven reliability decisions. Start simple:

Instrument your critical paths
Measure for 2-4 weeks
Set conservative SLOs
Implement burn rate alerting
Create an error budget policy
Review and iterate quarterly

The goal isn't perfect reliability—it's appropriate reliability that balances user happiness with engineering velocity.

Have questions about implementing SLOs? Connect with me on LinkedIn or reach out via the contact form.

DEV Community