The SLO Translation Problem
You define an SLO: 99.95% availability with p99 latency under 200ms. Engineering loves it. Product managers glaze over.
The problem isn't the SLO. It's how we communicate it.
Speaking Product Language
Translate technical SLOs into business impact:
Technical SLO: Product translation:
─────────────── ──────────────────────
99.95% availability "22 minutes of downtime per month max"
p99 latency < 200ms "The slowest 1% of users wait under 0.2s"
99.9% error-free transactions "For every 1000 purchases, at most 1 fails"
Suddenly, the product manager can make informed tradeoffs.
The SLO Negotiation Framework
SLOs should be negotiated between engineering and product. Here's my framework:
Step 1: Measure Current Performance
def current_performance(service, window_days=30):
metrics = query_prometheus(f'''
avg_over_time(
(1 - rate(http_errors_total{{service="{service}"}}[5m])
/ rate(http_requests_total{{service="{service}"}}[5m]))
[{window_days}d:1h]
)
''')
return {
'availability': f"{metrics * 100:.3f}%",
'monthly_downtime_minutes': round((1 - metrics) * 30 * 24 * 60, 1)
}
# Example output:
# {'availability': '99.847%', 'monthly_downtime_minutes': 66.1}
Step 2: Present the Cost-Reliability Tradeoff
Reliability Level | Monthly Downtime | Eng Investment | Feature Impact
────────────────-─┼─────────────────┼────────────────-┼──────────────
99.5% (current) | 3.6 hours | Baseline | None
99.9% (good) | 43 minutes | +1 SRE | -10% velocity
99.95% (great) | 22 minutes | +2 SREs | -20% velocity
99.99% (amazing) | 4.3 minutes | +4 SREs | -40% velocity
This makes the cost explicit. Most product teams choose 99.9-99.95%.
Step 3: Define SLIs That Map to User Journeys
Don't define SLOs per service. Define them per user journey:
slo_definitions:
- name: "Checkout Success"
description: "Users can complete a purchase"
sli: |
successful_checkouts / total_checkout_attempts
target: 99.9%
window: 30 days
owner: payments-team
product_owner: @sarah
- name: "Search Responsiveness"
description: "Search results appear quickly"
sli: |
search_requests{latency < 500ms} / total_search_requests
target: 99.5%
window: 30 days
owner: search-team
product_owner: @mike
- name: "Login Reliability"
description: "Users can log into their accounts"
sli: |
successful_logins / total_login_attempts
target: 99.99% # Higher because login blocks everything
window: 30 days
owner: identity-team
product_owner: @lisa
Step 4: The Monthly SLO Review
We run a 30-minute monthly meeting with engineering leads AND product managers:
Agenda:
1. SLO status dashboard review (5 min)
- Which SLOs are healthy? (green)
- Which are at risk? (yellow)
- Which were breached? (red)
2. Budget impact (10 min)
- Error budget consumed per SLO
- Projected budget at current burn rate
- Feature freeze triggers
3. Tradeoff decisions (15 min)
- Feature X requires relaxing SLO Y — approve?
- Incident Z consumed 40% of budget — invest in fix?
- New service launching — what SLO target?
The Dashboard That Changed Everything
We built a single-page SLO dashboard with three views:
- Executive view: Traffic lights per user journey. Green/Yellow/Red.
- Product view: Error budget remaining + projected depletion date.
- Engineering view: Burn rate charts + contributing incidents.
Same data, different lens. Everyone gets what they need.
Key Insight
SLOs are a communication tool first, a technical tool second. If only engineers understand your SLOs, they're not working.
If you want SLOs that automatically track, alert, and report in plain language, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)