Samson Tanimawo

Posted on Apr 25

Service Level Objectives for Complex Microservices

#sre #slo #microservices #reliability

Why SLOs Break in Microservices

A SLO that works for a monolith often collapses when you distribute the same logic across 30 services. The math of availability is unforgiving.

If your service depends on 5 others, each at 99.9%, your realistic ceiling is 0.999^5 = 99.5%. That 0.4% gap eats your entire error budget before your own code even runs.

The Three Mistakes Teams Make

1. Copying the same SLO to every service

A 99.9% target on a payment service and a batch analytics service are not the same thing. One ruins revenue. One ruins dashboards.

2. Measuring uptime instead of user experience

GET /health returning 200 is not a SLO. Users don't call /health. They check out, log in, view pages. Measure those.

3. Ignoring fan-out

If a user request fans out to 8 downstream calls, and one of them has a 99% SLO, your user-facing reliability is capped at 99% no matter how good your code is.

A Practical SLO Framework for Microservices

user_journey_slos:
critical_path_checkout:
target: 99.95%
window: 30d
metric: "successful_checkouts / total_checkouts"
error_budget: 21.6 minutes / month

user_login:
target: 99.9%
window: 30d
error_budget: 43.2 minutes / month

background_analytics:
target: 99.0%
window: 30d
error_budget: 7.2 hours / month

Notice: we define SLOs on user journeys, not services. This is the biggest mental shift.

Composition Rules

When Service A depends on B and C:

A's SLO must account for B + C availability
If B is 99.9% and C is 99.5%, A's realistic SLO is ~99.4%
Build a dependency graph and calculate compound availability

We use a simple rule: no service promises a SLO tighter than the weakest service it depends on.

The Implementation Pattern

Every service exports three Prometheus metrics:

slo_requests_total{service,journey,status}
slo_budget_remaining{service,journey,window}
slo_burn_rate{service,journey,window}

From these three, you can compute:

Current SLO compliance
Budget remaining (in minutes)
Burn rate (how fast you're consuming budget)

Alert on burn rate, not on individual requests. A 2% error rate for 30 seconds is a blip. A sustained 2% error rate over 10 minutes is an incident.

Budget Policies That Actually Stick

The trick isn't defining SLOs. It's enforcing them. We use a 4-level policy:

budget_exhausted: freeze non-critical deploys, notify product
budget_50_pct: feature freeze, focus on reliability
budget_25_pct: normal operations, monitor carefully
budget_healthy: ship new features, experiment

When the budget is exhausted, product can't ship new features until reliability is restored. This alignment between eng and product is what makes SLOs stick.

Common Anti-Patterns

SLOs nobody looks at if you don't page on budget burn rate, they're dead
SLOs that never fail if you never breach budget, your targets are too loose
SLOs that always fail if you're always in the red, your targets are unrealistic
SLOs without product buy-in engineering-only SLOs get ignored during feature pressure

Final Thoughts

SLOs are a negotiation tool between engineering and product. Without them, every outage becomes a fight. With them, you have a shared contract about what "good enough" means.

Start with one critical journey. Measure it for 30 days. Set a realistic SLO. Enforce budget policies. Then add more journeys.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community