ThankGod Chibugwum Obobo

Posted on May 31 • Originally published at actocodes.hashnode.dev

SLOs, SLAs, and Error Budgets: A Backend Developer's Guide to Reliability Engineering

#servicelevelobjectives #sitereliabilityengineering #errorbudget #softwareengineering

Reliability engineering used to be the exclusive domain of Site Reliability Engineers and infrastructure teams. But as backend developers take on more ownership of the services they build, from deployment to on-call, understanding Service Level Objectives (SLOs), Service Level Agreements (SLAs), and error budgets has become an essential skill, not an optional one.

These concepts are not bureaucratic paperwork. They are the engineering framework that answers some of the most important questions your team faces: How reliable does this service actually need to be? How do we know when we're spending too much engineering effort on reliability versus new features? And when something breaks, how do we decide whether to drop everything and fix it or continue shipping?

This guide breaks down SLOs, SLAs, and error budgets from first principles, and shows you how to define, implement, and operationalize them in a real backend system.

The Reliability Vocabulary: SLI, SLO, and SLA

These three terms are closely related but play distinct roles. Understanding the difference is the starting point for everything else.

Service Level Indicator (SLI)

An SLI is a quantitative measurement of a specific aspect of your service's behavior. It is the raw signal, the metric you actually measure.

Common SLIs for backend services:
| SLI Type | Example Measurement |
|----------|-------------------|
| Availability | Percentage of HTTP requests returning non-5xx responses |
| Latency | Percentage of requests completing in under 300ms |
| Error Rate | Percentage of requests resulting in application errors |
| Throughput | Number of successful requests processed per second |
| Freshness | Percentage of data reads returning results updated within 60 seconds |

SLIs must be measurable, meaningful to the user, and directly observable from your infrastructure. "The service feels fast" is not an SLI. "95% of requests complete in under 200ms" is.

Service Level Objective (SLO)

An SLO is a target for your SLI, the threshold your service commits to meeting over a defined time window.

SLO = SLI + Target + Time Window

Example: 99.5% of HTTP requests return a non-5xx response measured over a rolling 28-day window

SLOs are internal commitments, they exist within your engineering organization to guide prioritization and define what "good enough" looks like. They are not customer-facing contracts.

Service Level Agreement (SLA)

An SLA is a formal, contractual commitment made to external customers or stakeholders, typically with financial consequences for violation (refunds, service credits, contract termination).

SLA = SLO with commercial consequences

Example: We guarantee 99.9% uptime per calendar month. If we fall below 99.9%, customers receive a 25% service credit.

The critical rule: your SLA must always be weaker than your SLO. If your SLO is 99.5% availability and your SLA promises 99.9%, you have no buffer, the moment you breach your internal target, you've already violated your customer contract. Set your SLA conservatively below your SLO to create an operational safety margin.

Choosing the Right SLIs

The most common mistake in SLO design is measuring the wrong things. SLIs should reflect user experience, not system internals.

User-facing SLIs (what users actually experience):

Request success rate (non-error responses)
Request latency at the p95 or p99 percentile
Data freshness for read-heavy services
End-to-end transaction completion rate

Avoid these as primary SLIs:

CPU utilization
Memory usage
Disk I/O

These are useful for capacity planning and root cause analysis, but they don't directly measure user experience. High CPU doesn't always mean users are suffering. High error rate always does.

For most backend HTTP services, start with two SLIs:

SLI 1 (Availability): Proportion of successful requests = (requests with status < 500) / (total requests)

SLI 2 (Latency): Proportion of requests meeting the latency target = (requests completing in < 300ms) / (total requests)

Setting Realistic SLO Targets

Setting targets too high creates unnecessary toil. Setting them too low fails your users. The right target is derived from your historical data, not aspirational thinking.

The Four-Step SLO Target Process

Step 1 - Measure your current performance. Query your metrics backend for your chosen SLI over the last 90 days. This is your baseline.

# Current 28-day availability SLI
(
  sum(rate(http_requests_total{status!~"5.."}[28d]))
  /
  sum(rate(http_requests_total[28d]))
) * 100

Step 2 - Identify your natural performance floor. Look at your worst performing week in the last 90 days. Your SLO target should be below your average performance but above your worst week's performance.

Step 3 - Apply the reliability tax. Subtract a buffer for planned maintenance, deployments, and anticipated incidents. A service running at 99.7% average availability might set an SLO of 99.5% to account for this variance.

Step 4 - Validate against user expectations. Does the target actually reflect an acceptable user experience? A 99.0% availability SLO means users encounter roughly 7 hours of downtime per month, acceptable for an internal admin tool, unacceptable for a payment API.

Error Budgets: Turning Reliability Into a Product Decision

An error budget is the inverse of your SLO, it defines how much unreliability your service is allowed to have within a given window.

Error Budget = 1 - SLO Target

99.5% availability SLO over 28 days:
Error Budget = 0.5% of requests
= 0.005 × total requests in 28 days

The error budget is what transforms SLOs from a compliance metric into an engineering tool. It answers the question: how much reliability risk can we afford to take right now?

How Error Budgets Drive Engineering Decisions

When you have plenty of error budget remaining, you can afford to:

Ship risky features and new deployments
Run load tests and chaos experiments in production
Prioritize feature velocity over reliability work

When your error budget is nearly exhausted, the calculus changes:

Slow down or halt risky deployments
Prioritize reliability and bug fixes over new features
Conduct a postmortem to understand what consumed the budget
Implement additional safeguards before resuming normal velocity

This is the core value of error budgets: they make reliability a shared engineering responsibility, not just an SRE concern, because they directly constrain the team's ability to ship features.

Implementing SLO Tracking in Grafana and Prometheus

Step 1 - Define Recording Rules

Recording rules pre-compute SLI metrics at query time, making SLO dashboards fast and burn rate calculations efficient:

# prometheus/rules/slo-rules.yaml
groups:
  - name: slo_recording_rules
    rules:
      # Availability SLI — 5-minute window
      - record: slo:http_availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status!~"5..",service="orders-service"}[5m]))
          /
          sum(rate(http_requests_total{service="orders-service"}[5m]))

      # Availability SLI — 1-hour window
      - record: slo:http_availability:ratio_rate1h
        expr: |
          sum(rate(http_requests_total{status!~"5..",service="orders-service"}[1h]))
          /
          sum(rate(http_requests_total{service="orders-service"}[1h]))

      # Latency SLI — proportion of requests under 300ms
      - record: slo:http_latency_300ms:ratio_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.3",service="orders-service"}[5m]))
          /
          sum(rate(http_request_duration_seconds_count{service="orders-service"}[5m]))

Step 2 - Calculate Error Budget Consumption

Track remaining error budget as a percentage of the total allowed:

# Remaining availability error budget (28-day window)
# SLO target: 99.5% → error budget: 0.5%
(
  sum(rate(http_requests_total{status=~"5..",service="orders-service"}[28d]))
  /
  sum(rate(http_requests_total{service="orders-service"}[28d]))
)
/
0.005  # error budget = 1 - 0.995

A value of 0.3 means 30% of the error budget has been consumed. A value above 1.0 means the budget is exhausted, the SLO is being violated.

Step 3 - Burn Rate Alerts

Burn rate alerting is more sophisticated than simple threshold alerting, it detects how fast you're consuming your error budget, allowing you to catch slow burns before the budget is exhausted.

# Fast burn rate alert, consuming budget 14x faster than sustainable
# If sustained, will exhaust 28-day budget in 48 hours
(
  1 - slo:http_availability:ratio_rate1h{service="orders-service"}
)
/
(1 - 0.995)  # SLO target
> 14.4

The burn rate multiplier of 14.4 means the service is failing at 14.4 times the sustainable rate for a 99.5% SLO. At this rate, the entire 28-day error budget would be consumed in approximately 48 hours, warranting an immediate page.

Pair fast burn rate alerts (short windows, high multipliers) with slow burn rate alerts (long windows, lower multipliers) for complete coverage:

Alert	Window	Burn Rate	Budget Consumed	Severity
Fast burn	1h	> 14.4x	~2% in 1h	Critical
Fast burn	6h	> 6x	~5% in 6h	High
Slow burn	3d	> 3x	~10% in 3d	Medium
Slow burn	3d	> 1x	Budget at risk	Low

The Error Budget Policy

The most important artifact in your SLO implementation is the error budget policy, a written document that defines what happens when budget thresholds are crossed. Without it, the error budget is just a number.

A minimal error budget policy covers:

Error Budget Policy - orders-service

SLO: 99.5% availability over a rolling 28-day window
Error Budget: 0.5% (approximately 21.6 minutes equivalent downtime)

When budget consumption reaches 25%:

Review recent deployments for contributing factors
Ensure on-call runbooks are up to date

When budget consumption reaches 50%:

Engineering lead notified
Reliability review scheduled within 1 week
No new risky feature deployments without explicit approval

When budget consumption reaches 75%:

Deployment freeze on non-critical changes
Engineering team shifts focus to reliability improvements
Daily sync between engineering lead and on-call

When budget is exhausted (100%):

Full deployment freeze
Incident postmortem required before resuming feature work
SLO target reviewed, may need adjustment

The policy transforms the error budget from a metric into a decision-making framework that the entire team, engineering, product, and leadership, can align around.

Common SLO Pitfalls to Avoid

Too many SLOs. Start with one or two SLIs per service. Ten SLOs per service are impossible to operationalize and dilute focus. Add more only when you have proven the process works.

SLOs without ownership. An SLO with no named owner and no budget policy is decoration. Every SLO needs a team or individual responsible for it.

Ignoring dependencies. If your service calls five downstream services, your availability is bounded by theirs. Account for dependency reliability in your SLO targets, or you'll constantly violate your SLO due to factors outside your control.

Chasing 100% reliability. The marginal cost of reliability increases exponentially as you approach 100%. Going from 99.9% to 99.99% availability is dramatically more expensive than going from 99.0% to 99.9%. Always ask: does the user experience actually require this level of reliability?

Conclusion

SLOs, SLAs, and error budgets are not compliance bureaucracy, they are an engineering framework for making deliberate trade-offs between reliability and velocity. When implemented correctly, they give your team a shared, quantitative language for discussing risk, a principled basis for prioritization decisions, and a feedback loop that continuously improves the reliability of the systems you build.

Start simple: pick two SLIs for your most critical service, set targets based on historical data, implement error budget tracking in Grafana, and write a one-page budget policy. Run it for a quarter, observe how it changes your team's conversations, and expand from there.

Reliability is not a feature you ship once. It is an ongoing engineering practice, and error budgets are how you keep score.

Already using Google's SRE Workbook methodology or OpenSLO for cross-platform SLO definitions? Drop your approach in the comments.

DEV Community