Choosing Your First SLI: A Decision Framework for New SRE Teams

#sre #slo #reliability #devops

New SRE teams ask the same question: what should we measure first? The temptation is to track everything CPU, memory, latency, error rate, queue depth, cache hit ratio. Don't.

You need one good SLI before you need ten mediocre ones.

What a first SLI should be

The first SLI you adopt should:

Map directly to user experience. Not "CPU below 80%." Something more like "checkout requests succeed within 2 seconds." The closer to what users actually feel, the better.
Be measurable from outside the system. If you can only measure it from inside (custom instrumentation in your code), you're going to fight that data forever. Start with HTTP status codes and request latency from a load balancer or CDN log.
Have an obvious failure mode. "What does it mean when this metric drops?" If you can't answer in one sentence, pick a different metric.

The four candidates I'd consider first

For a typical web service:

Availability of the critical path. Successful HTTP responses (2xx, 3xx) divided by total requests, scoped to the routes that matter most. Not the whole API. Pick checkout, sign-in, search, whatever generates revenue or retention.
Latency at the 95th or 99th percentile. Median latency lies during incidents because half your users can still get through. Tail latency tells you when things break.
Job completion rate. If your system has background jobs (emails, reports, async exports), what fraction completes successfully within the time the user expects?
End-to-end synthetic checks. A canary that simulates a real user every minute. If it fails, you know something user-visible broke, even if internal metrics look fine.

Pick one. Just one. Get it stable for a month before adding a second.

What to skip until you're ready

Multi-burn-rate alerts. Powerful but complex. Most teams should start with a single threshold and learn what "normal" looks like before layering on math.
SLOs across multiple regions or services. Aggregating SLIs across regions is a science. Get one region right first.
Custom percentile math. Use whatever percentile your monitoring tool ships with. The difference between p99 calculated three different ways is rarely the gap between "ok" and "broken."

The trap to avoid

Teams over-engineer SLIs at the start because the literature is full of advanced patterns. Resist this. Pick a simple SLI, write down what "good" looks like, watch it for a quarter, then decide what's missing. The boring SLIs catch 80% of real problems for 10% of the effort.