DEV Community

DevHelm
DevHelm

Posted on • Originally published at devhelm.io

SLO vs SLA vs SLI: What Each One Means and How to Set Them

Most SLO guides start with the same three-paragraph definitional exercise — SLI is the indicator, SLO is the objective, SLA is the agreement — and then stop. You leave knowing the vocabulary but not how to use it. You can't answer the questions that actually matter: which metric should I measure, what target is realistic for my service, and what happens when I miss it?

This guide starts with the definitions because you need a shared vocabulary, but it spends most of its time on the decisions behind each one: choosing the right SLI for your service, setting an SLO that's strict enough to matter but loose enough to survive, computing and spending an error budget, and knowing when (and when not) to turn an SLO into an SLA.

The three letters, disambiguated

SLI — Service Level Indicator. A quantitative measurement of one dimension of your service's behavior. Latency, availability, throughput, error rate, ticket resolution time. An SLI is always a number with units, derived from real telemetry. "Our API is fast" is not an SLI. "The 95th percentile of API response latency, measured at the load balancer over a 5-minute window" is.

SLO — Service Level Objective. A target you set on an SLI. "p95 latency < 500ms, measured over a rolling 30-day window" is an SLO. It's an internal commitment — your team agrees that the service should meet this bar, and when it doesn't, you treat that as an incident or at least an engineering priority. An SLO is a tool for your team, not a legal document.

SLA — Service Level Agreement. An SLO that's been written into a contract with a customer, usually with financial consequences for missing it. If your SLO says "99.9% availability" and you publish that as an SLA, a customer who experiences more than 43 minutes of downtime in a month has grounds for a credit. SLAs are legal; SLOs are operational. Most internal services should have SLOs and should not have SLAs.

The relationship is directional: you measure an SLI, set an SLO against it, and optionally externalize that SLO as an SLA. Every SLA implies an SLO, but not every SLO should become an SLA.

Choosing the right SLI

The hardest step is the first one: picking what to measure. A service with three SLIs that capture what users actually experience is more useful than one with fifteen SLIs that capture what the infrastructure is doing.

The Google SRE Workbook recommends starting from user journeys:

User journey SLI category Example SLI
"The page loads" Availability Proportion of HTTP requests returning non-5xx, measured at the edge
"The page loads quickly" Latency p95 of response time, measured at the load balancer
"My data is processed" Freshness Age of the most recent successful pipeline run, measured in minutes
"My report is accurate" Correctness Proportion of API responses returning the expected result (requires a canary or known-answer test)

Two rules of thumb:

  1. Measure at the boundary your user sees, not inside your stack. If you measure latency at the application layer and your CDN adds 200ms, you're lying to yourself. Measure at the load balancer or the edge.
  2. Fewer SLIs, more confidence. Start with availability + latency for any request-serving system. Add freshness only if you run a pipeline. Add correctness only if you have a way to verify it. Three SLIs that are trustworthy beat ten that nobody looks at.

A common mistake: using CPU utilization or memory pressure as SLIs. Those are infrastructure signals, not user-facing indicators. A machine running at 95% CPU but serving all requests under 200ms is fine. A machine running at 30% CPU but dropping 5% of connections is not. SLIs are about the user's experience, not the server's.

Setting a realistic SLO

An SLO has three parts: the SLI, the target, and the measurement window.

Example: "99.9% of HTTP requests return a non-5xx response, measured over a rolling 30-day window."

The target is the part teams argue about. Here's a way to pick it that doesn't require a week of meetings.

Step 1: Measure your current SLI for 30 days. Don't set a target yet — just observe. If your service has been running 99.95% availability without anyone trying, setting 99.9% is reasonable. Setting 99.99% is aspirational. Setting 99% is embarrassing.

Step 2: Set the target slightly below your current baseline. If you've been running at 99.95%, set your SLO at 99.9%. This gives you room to breathe. The point of an SLO is not to describe your best day — it's to define the minimum acceptable. If you set it at your best day, every normal fluctuation is a "violation."

Step 3: Convert the target to an error budget. This is where SLOs get useful. A 30-day window contains 43,200 minutes, so:

SLO target Error budget Allowed downtime per 30 days
99.9% 0.1% 43.2 minutes
99.95% 0.05% 21.6 minutes
99.99% 0.01% 4.3 minutes

Those numbers are the entire content of most "what should my SLO be?" debates. A 99.99% SLO on a 30-day window gives you 4.3 minutes of total downtime. If your MTTR is 25 minutes per incident, you can afford zero incidents. That's either an aspirational commitment backed by redundant infrastructure, or it's a lie. Be honest about which one.

The error budget: what it is and how to spend it

The error budget is the gap between 100% and your SLO target. If your SLO is 99.9% availability over 30 days, your error budget is 43.2 minutes. That budget is not "waste allowance" — it's a resource you can spend deliberately.

Useful ways to spend error budget:

  • Deploy a risky change. If you have 30 minutes left in the budget and the deploy might cause 5 minutes of degradation, that's a calculated risk. If you have 2 minutes left, hold the deploy until the window rolls.
  • Run a chaos experiment. Kill a database replica, fail over a region, inject latency on a dependency. Each experiment consumes budget. If you can't afford to run experiments, your SLO is probably too tight.
  • Let a known low-severity issue ride. A p99 latency blip at 3 AM that affects 0.01% of requests is consuming budget, but if the alternative is waking someone up, spending budget is the right call.

The error budget policy is the written agreement about what happens when the budget runs out. Typical policies:

  • Budget exhausted -> feature freeze. All engineering effort goes to reliability until the budget recovers. This is the Google model and it works if leadership actually enforces it.
  • Budget below 50% -> deploy gate. Deploys require explicit approval from the on-call engineer. This slows shipping but prevents the "one more deploy" cascade that burns the remaining budget.
  • Budget healthy -> ship freely. This is the reward for investing in reliability. A team with a full error budget has earned the right to move fast.

The key insight: error budgets turn reliability from a vague mandate ("be more reliable") into a quantitative tradeoff ("we have 20 minutes left this month — is this deploy worth 5 of them?"). Teams that track error budgets make better decisions than teams that track uptime, because uptime has no built-in notion of "how much risk can we take."

When an SLO becomes an SLA

Most internal SLOs should stay internal. An SLA adds legal weight, customer expectations, and credit obligations. Promote an SLO to an SLA only when all three conditions hold:

  1. You've hit the SLO consistently for 3+ months. If you haven't proven you can meet it internally, you definitely can't promise it externally.
  2. You have a remediation path for breaches. What credits do you issue? How are they calculated? Who approves them? If you can't answer these, you don't have an SLA — you have a marketing claim.
  3. The SLA target is looser than your internal SLO. Your SLA should be 99.9% if your SLO is 99.95%. The gap is your operational buffer. If the SLA and SLO are the same number, every SLO breach is also a contract breach, and your team will either burn out or game the measurement.

A public status page (like the ones DevHelm hosts at /status/github) is a middle ground between internal SLOs and contractual SLAs — it shows real uptime data without attaching legal obligations. It builds trust through transparency rather than through contractual obligation.

How DevHelm gives you the data for SLOs

DevHelm doesn't have a first-class SLO resource that you configure with a target and measure against a budget — that's a feature we're building, not one we ship today. What it does give you is the raw material SLOs are made of.

Monitor uptime data. Every monitor computes availability as a weighted daily percentage: (86400 - major_seconds - partial_seconds * 0.3) / 86400 * 100. Major outages count fully against uptime; partial degradations count at 30%. That formula runs across the status page, the dashboard, and the API — all three stay in sync. If your SLI is availability, the monitor's uptime history is the measurement.

Status page uptime bars. The public status page at /status/<service> renders daily uptime per component with a "tracking since" date. An internal team or a customer can see exactly when the service was degraded and for how long — the same data that would feed an error budget computation.

Alert channels for SLO-boundary signals. If your SLI is latency and your monitor checks every 30 seconds, you can set a monitor threshold at the SLO boundary (e.g. p95 > 500ms) and route the alert through DevHelm's notification policies. That's not burn-rate alerting in the formal sense (you'd want a multi-window approach per the Google SRE Workbook), but it catches SLO breaches as they happen rather than at the end of the month.

What we'd tell you honestly: if you need formal error budgets with automated freeze policies, you need a dedicated SLO tool (Nobl9, Sloth, or a Prometheus recording rule setup). DevHelm gives you the uptime data and the alerting layer; the budget math is yours today, ours tomorrow.

Where to start

If you've never set an SLO, start with one. Pick your most important user-facing service, measure its availability SLI for two weeks, then set the SLO 0.05% below the observed baseline. Compute the error budget in minutes. Write it on a whiteboard. The first time someone asks "can we deploy this risky change?" and the answer is "we have 18 minutes of budget left — let's wait until Monday," the SLO has paid for itself.

If your incidents tend to be dependency-driven — AWS degrades, your CDN edge has a regional issue — your SLO's biggest enemy is something outside your stack. A runbook for each known dependency failure mode and a vendor status feed that tells you when the dependency degraded before your monitors notice are the two cheapest investments in protecting your error budget.

Spin up a free account at app.devhelm.io and wire your first monitor in 60 seconds. The uptime data starts accumulating immediately — you'll have your first 30-day SLI baseline before next month's planning meeting.


Originally published on DevHelm.

Top comments (0)