Reliability engineering used to be the exclusive domain of Site Reliability Engineers and infrastructure teams. But as backend developers take on more ownership of the services they build, from deployment to on-call, understanding Service Level Objectives (SLOs), Service Level Agreements (SLAs), and error budgets has become an essential skill, not an optional one.
These concepts are not bureaucratic paperwork. They are the engineering framework that answers some of the most important questions your team faces: How reliable does this service actually need to be? How do we know when we're spending too much engineering effort on reliability versus new features? And when something breaks, how do we decide whether to drop everything and fix it or continue shipping?
This guide breaks down SLOs, SLAs, and error budgets from first principles, and shows you how to define, implement, and operationalize them in a real backend system.
The Reliability Vocabulary: SLI, SLO, and SLA
These three terms are closely related but play distinct roles. Understanding the difference is the starting point for everything else.
Service Level Indicator (SLI)
An SLI is a quantitative measurement of a specific aspect of your service's behavior. It is the raw signal, the metric you actually measure.
Common SLIs for backend services:
| SLI Type | Example Measurement |
|----------|-------------------|
| Availability | Percentage of HTTP requests returning non-5xx responses |
| Latency | Percentage of requests completing in under 300ms |
| Error Rate | Percentage of requests resulting in application errors |
| Throughput | Number of successful requests processed per second |
| Freshness | Percentage of data reads returning results updated within 60 seconds |
SLIs must be measurable, meaningful to the user, and directly observable from your infrastructure. "The service feels fast" is not an SLI. "95% of requests complete in under 200ms" is.
Service Level Objective (SLO)
An SLO is a target for your SLI, the threshold your service commits to meeting over a defined time window.
SLO = SLI + Target + Time Window
Example: 99.5% of HTTP requests return a non-5xx response measured over a rolling 28-day window
SLOs are internal commitments, they exist within your engineering organization to guide prioritization and define what "good enough" looks like. They are not customer-facing contracts.
Service Level Agreement (SLA)
An SLA is a formal, contractual commitment made to external customers or stakeholders, typically with financial consequences for violation (refunds, service credits, contract termination).
SLA = SLO with commercial consequences
Example: We guarantee 99.9% uptime per calendar month. If we fall below 99.9%, customers receive a 25% service credit.
The critical rule: your SLA must always be weaker than your SLO. If your SLO is 99.5% availability and your SLA promises 99.9%, you have no buffer, the moment you breach your internal target, you've already violated your customer contract. Set your SLA conservatively below your SLO to create an operational safety margin.
Choosing the Right SLIs
The most common mistake in SLO design is measuring the wrong things. SLIs should reflect user experience, not system internals.
User-facing SLIs (what users actually experience):
- Request success rate (non-error responses)
- Request latency at the p95 or p99 percentile
- Data freshness for read-heavy services
- End-to-end transaction completion rate
Avoid these as primary SLIs:
- CPU utilization
- Memory usage
- Disk I/O
These are useful for capacity planning and root cause analysis, but they don't directly measure user experience. High CPU doesn't always mean users are suffering. High error rate always does.
For most backend HTTP services, start with two SLIs:
SLI 1 (Availability): Proportion of successful requests = (requests with status < 500) / (total requests)
SLI 2 (Latency): Proportion of requests meeting the latency target = (requests completing in < 300ms) / (total requests)
Setting Realistic SLO Targets
Setting targets too high creates unnecessary toil. Setting them too low fails your users. The right target is derived from your historical data, not aspirational thinking.
The Four-Step SLO Target Process
Step 1 - Measure your current performance. Query your metrics backend for your chosen SLI over the last 90 days. This is your baseline.
# Current 28-day availability SLI
(
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
) * 100
Step 2 - Identify your natural performance floor. Look at your worst performing week in the last 90 days. Your SLO target should be below your average performance but above your worst week's performance.
Step 3 - Apply the reliability tax. Subtract a buffer for planned maintenance, deployments, and anticipated incidents. A service running at 99.7% average availability might set an SLO of 99.5% to account for this variance.
Step 4 - Validate against user expectations. Does the target actually reflect an acceptable user experience? A 99.0% availability SLO means users encounter roughly 7 hours of downtime per month, acceptable for an internal admin tool, unacceptable for a payment API.
Error Budgets: Turning Reliability Into a Product Decision
An error budget is the inverse of your SLO, it defines how much unreliability your service is allowed to have within a given window.
Error Budget = 1 - SLO Target
99.5% availability SLO over 28 days:
Error Budget = 0.5% of requests
= 0.005 × total requests in 28 days
The error budget is what transforms SLOs from a compliance metric into an engineering tool. It answers the question: how much reliability risk can we afford to take right now?
How Error Budgets Drive Engineering Decisions
When you have plenty of error budget remaining, you can afford to:
- Ship risky features and new deployments
- Run load tests and chaos experiments in production
- Prioritize feature velocity over reliability work
When your error budget is nearly exhausted, the calculus changes:
- Slow down or halt risky deployments
- Prioritize reliability and bug fixes over new features
- Conduct a postmortem to understand what consumed the budget
- Implement additional safeguards before resuming normal velocity
This is the core value of error budgets: they make reliability a shared engineering responsibility, not just an SRE concern, because they directly constrain the team's ability to ship features.
Implementing SLO Tracking in Grafana and Prometheus
Step 1 - Define Recording Rules
Recording rules pre-compute SLI metrics at query time, making SLO dashboards fast and burn rate calculations efficient:
# prometheus/rules/slo-rules.yaml
groups:
- name: slo_recording_rules
rules:
# Availability SLI — 5-minute window
- record: slo:http_availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{status!~"5..",service="orders-service"}[5m]))
/
sum(rate(http_requests_total{service="orders-service"}[5m]))
# Availability SLI — 1-hour window
- record: slo:http_availability:ratio_rate1h
expr: |
sum(rate(http_requests_total{status!~"5..",service="orders-service"}[1h]))
/
sum(rate(http_requests_total{service="orders-service"}[1h]))
# Latency SLI — proportion of requests under 300ms
- record: slo:http_latency_300ms:ratio_rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.3",service="orders-service"}[5m]))
/
sum(rate(http_request_duration_seconds_count{service="orders-service"}[5m]))
Step 2 - Calculate Error Budget Consumption
Track remaining error budget as a percentage of the total allowed:
# Remaining availability error budget (28-day window)
# SLO target: 99.5% → error budget: 0.5%
(
sum(rate(http_requests_total{status=~"5..",service="orders-service"}[28d]))
/
sum(rate(http_requests_total{service="orders-service"}[28d]))
)
/
0.005 # error budget = 1 - 0.995
A value of 0.3 means 30% of the error budget has been consumed. A value above 1.0 means the budget is exhausted, the SLO is being violated.
Step 3 - Burn Rate Alerts
Burn rate alerting is more sophisticated than simple threshold alerting, it detects how fast you're consuming your error budget, allowing you to catch slow burns before the budget is exhausted.
# Fast burn rate alert, consuming budget 14x faster than sustainable
# If sustained, will exhaust 28-day budget in 48 hours
(
1 - slo:http_availability:ratio_rate1h{service="orders-service"}
)
/
(1 - 0.995) # SLO target
> 14.4
The burn rate multiplier of 14.4 means the service is failing at 14.4 times the sustainable rate for a 99.5% SLO. At this rate, the entire 28-day error budget would be consumed in approximately 48 hours, warranting an immediate page.
Pair fast burn rate alerts (short windows, high multipliers) with slow burn rate alerts (long windows, lower multipliers) for complete coverage:
| Alert | Window | Burn Rate | Budget Consumed | Severity |
|---|---|---|---|---|
| Fast burn | 1h | > 14.4x | ~2% in 1h | Critical |
| Fast burn | 6h | > 6x | ~5% in 6h | High |
| Slow burn | 3d | > 3x | ~10% in 3d | Medium |
| Slow burn | 3d | > 1x | Budget at risk | Low |
The Error Budget Policy
The most important artifact in your SLO implementation is the error budget policy, a written document that defines what happens when budget thresholds are crossed. Without it, the error budget is just a number.
A minimal error budget policy covers:
Error Budget Policy - orders-service
SLO: 99.5% availability over a rolling 28-day window
Error Budget: 0.5% (approximately 21.6 minutes equivalent downtime)
When budget consumption reaches 25%:
- Review recent deployments for contributing factors
- Ensure on-call runbooks are up to date
When budget consumption reaches 50%:
- Engineering lead notified
- Reliability review scheduled within 1 week
- No new risky feature deployments without explicit approval
When budget consumption reaches 75%:
- Deployment freeze on non-critical changes
- Engineering team shifts focus to reliability improvements
- Daily sync between engineering lead and on-call
When budget is exhausted (100%):
- Full deployment freeze
- Incident postmortem required before resuming feature work
- SLO target reviewed, may need adjustment
The policy transforms the error budget from a metric into a decision-making framework that the entire team, engineering, product, and leadership, can align around.
Common SLO Pitfalls to Avoid
Too many SLOs. Start with one or two SLIs per service. Ten SLOs per service are impossible to operationalize and dilute focus. Add more only when you have proven the process works.
SLOs without ownership. An SLO with no named owner and no budget policy is decoration. Every SLO needs a team or individual responsible for it.
Ignoring dependencies. If your service calls five downstream services, your availability is bounded by theirs. Account for dependency reliability in your SLO targets, or you'll constantly violate your SLO due to factors outside your control.
Chasing 100% reliability. The marginal cost of reliability increases exponentially as you approach 100%. Going from 99.9% to 99.99% availability is dramatically more expensive than going from 99.0% to 99.9%. Always ask: does the user experience actually require this level of reliability?
Conclusion
SLOs, SLAs, and error budgets are not compliance bureaucracy, they are an engineering framework for making deliberate trade-offs between reliability and velocity. When implemented correctly, they give your team a shared, quantitative language for discussing risk, a principled basis for prioritization decisions, and a feedback loop that continuously improves the reliability of the systems you build.
Start simple: pick two SLIs for your most critical service, set targets based on historical data, implement error budget tracking in Grafana, and write a one-page budget policy. Run it for a quarter, observe how it changes your team's conversations, and expand from there.
Reliability is not a feature you ship once. It is an ongoing engineering practice, and error budgets are how you keep score.
Already using Google's SRE Workbook methodology or OpenSLO for cross-platform SLO definitions? Drop your approach in the comments.
Top comments (0)