Why SLOs Break in Microservices
A SLO that works for a monolith often collapses when you distribute the same logic across 30 services. The math of availability is unforgiving.
If your service depends on 5 others, each at 99.9%, your realistic ceiling is 0.999^5 = 99.5%. That 0.4% gap eats your entire error budget before your own code even runs.
The Three Mistakes Teams Make
1. Copying the same SLO to every service
A 99.9% target on a payment service and a batch analytics service are not the same thing. One ruins revenue. One ruins dashboards.
2. Measuring uptime instead of user experience
GET /health returning 200 is not a SLO. Users don't call /health. They check out, log in, view pages. Measure those.
3. Ignoring fan-out
If a user request fans out to 8 downstream calls, and one of them has a 99% SLO, your user-facing reliability is capped at 99% no matter how good your code is.
A Practical SLO Framework for Microservices
user_journey_slos:
critical_path_checkout:
target: 99.95%
window: 30d
metric: "successful_checkouts / total_checkouts"
error_budget: 21.6 minutes / month
user_login:
target: 99.9%
window: 30d
error_budget: 43.2 minutes / month
background_analytics:
target: 99.0%
window: 30d
error_budget: 7.2 hours / month
Notice: we define SLOs on user journeys, not services. This is the biggest mental shift.
Composition Rules
When Service A depends on B and C:
- A's SLO must account for B + C availability
- If B is 99.9% and C is 99.5%, A's realistic SLO is ~99.4%
- Build a dependency graph and calculate compound availability
We use a simple rule: no service promises a SLO tighter than the weakest service it depends on.
The Implementation Pattern
Every service exports three Prometheus metrics:
slo_requests_total{service,journey,status}
slo_budget_remaining{service,journey,window}
slo_burn_rate{service,journey,window}
From these three, you can compute:
- Current SLO compliance
- Budget remaining (in minutes)
- Burn rate (how fast you're consuming budget)
Alert on burn rate, not on individual requests. A 2% error rate for 30 seconds is a blip. A sustained 2% error rate over 10 minutes is an incident.
Budget Policies That Actually Stick
The trick isn't defining SLOs. It's enforcing them. We use a 4-level policy:
budget_exhausted: freeze non-critical deploys, notify product
budget_50_pct: feature freeze, focus on reliability
budget_25_pct: normal operations, monitor carefully
budget_healthy: ship new features, experiment
When the budget is exhausted, product can't ship new features until reliability is restored. This alignment between eng and product is what makes SLOs stick.
Common Anti-Patterns
- SLOs nobody looks at if you don't page on budget burn rate, they're dead
- SLOs that never fail if you never breach budget, your targets are too loose
- SLOs that always fail if you're always in the red, your targets are unrealistic
- SLOs without product buy-in engineering-only SLOs get ignored during feature pressure
Final Thoughts
SLOs are a negotiation tool between engineering and product. Without them, every outage becomes a fight. With them, you have a shared contract about what "good enough" means.
Start with one critical journey. Measure it for 30 days. Set a realistic SLO. Enforce budget policies. Then add more journeys.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)