SLI/SLO Framework
Stop arguing about reliability in abstractions. This framework gives you everything needed to define Service Level Indicators, set Service Level Objectives, calculate error budgets, configure multi-window alerting, and produce executive dashboards that translate uptime into business impact. From Prometheus recording rules to Grafana panels to the Python scripts that tie them together, this is a complete SLI/SLO implementation you can deploy in a day.
Key Features
- SLI definition templates — Pre-built indicators for availability, latency, throughput, and correctness with Prometheus queries
- SLO specification schema — YAML-based SLO definitions with targets, measurement windows, and stakeholder metadata
- Error budget calculator — Python script that computes remaining error budget, burn rate, and projected exhaustion date
- Multi-window burn rate alerts — Prometheus alerting rules implementing Google's recommended 5m/1h/6h/3d burn rate windows
- Grafana dashboard JSON — Import-ready dashboards for SLO status and error budget remaining
- Executive reporting script — Generates weekly/monthly SLO compliance reports in Markdown
- SLO negotiation guide — Framework for negotiating targets between product and engineering
- Error budget policy template — Defines consequences when budget is exhausted (feature freeze, reliability sprint)
Quick Start
unzip sli-slo-framework.zip && cd sli-slo-framework/
# Define your first SLO
cp templates/config.yaml my_slos.yaml
# Edit my_slos.yaml with your service details
# Calculate current error budget
python3 src/sli_slo_framework/core.py budget \
--config my_slos.yaml \
--prometheus-url https://prometheus.example.com
# Generate executive report
python3 src/sli_slo_framework/core.py report \
--config my_slos.yaml --window 30 --output reports/monthly.md
Architecture / How It Works
- Define — Write SLO specs in YAML: SLIs, targets, and measurement windows
- Measure — Prometheus recording rules continuously compute SLI values
- Alert — Multi-window burn rate alerts fire when budget is consumed too fast
- Report — Dashboards and scripts produce status for engineers and executives
Usage Examples
SLO Definition
# my_slos.yaml
slos:
- name: api-gateway-availability
service: api-gateway
sli:
type: availability
good_events: 'http_requests_total{service="api-gateway",status!~"5.."}'
total_events: 'http_requests_total{service="api-gateway"}'
objective: 99.9
window: 30d
- name: api-gateway-latency
service: api-gateway
sli:
type: latency
good_events: 'http_request_duration_seconds_bucket{service="api-gateway",le="0.3"}'
total_events: 'http_requests_total{service="api-gateway"}'
objective: 99.0
window: 30d
Prometheus Recording Rules for SLIs
groups:
- name: sli_api_gateway
interval: 1m
rules:
- record: sli:availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{service="api-gateway",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="api-gateway"}[5m]))
- record: sli:latency:ratio_rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{service="api-gateway",le="0.3"}[5m]))
/
sum(rate(http_requests_total{service="api-gateway"}[5m]))
- record: slo:error_budget_remaining:ratio
expr: |
1 - ((1 - sli:availability:ratio_rate30d) / (1 - 0.999))
Multi-Window Burn Rate Alerts
groups:
- name: slo_burn_rate_alerts
rules:
- alert: SLOHighBurnRate_5m # Fast: 14.4x rate, 5m/1h windows
expr: |
(1 - sli:availability:ratio_rate5m) > (14.4 * (1 - 0.999))
and
(1 - sli:availability:ratio_rate1h) > (14.4 * (1 - 0.999))
for: 2m
labels:
severity: critical
- alert: SLOHighBurnRate_6h # Slow: 3x rate, 6h/3d windows
expr: |
(1 - sli:availability:ratio_rate6h) > (3 * (1 - 0.999))
and
(1 - sli:availability:ratio_rate3d) > (3 * (1 - 0.999))
for: 15m
labels:
severity: warning
Error Budget Calculator
from sli_slo_framework.core import ErrorBudgetCalculator
calc = ErrorBudgetCalculator(prometheus_url="https://prometheus.example.com")
budget = calc.compute(
sli_query='sli:availability:ratio_rate30d{service="api-gateway"}',
objective=0.999, window_days=30,
)
print(f"Remaining: {budget.remaining_pct:.1%} | Burn: {budget.burn_rate:.1f}x")
Configuration
# config.example.yaml
framework:
prometheus_url: https://prometheus.example.com
grafana_url: https://grafana.example.com
defaults:
window: 30d
burn_rate_windows:
fast: { short: 5m, long: 1h, factor: 14.4 }
slow: { short: 6h, long: 3d, factor: 3.0 }
reporting:
schedule: weekly
format: markdown
output_dir: reports/
error_budget_policy:
green: "Normal velocity (>50% budget)"
yellow: "Prioritize reliability (25-50%)"
red: "Feature freeze (0-25%)"
exhausted: "Emergency reliability sprint"
Best Practices
- Start with availability and latency — these two SLIs cover 90% of user-facing reliability
- Set objectives based on user tolerance, not system capability
- Alert on burn rate, not raw SLI — a brief dip is fine if you're within budget
- Report to leadership monthly — translate uptime into "hours of downtime remaining"
Troubleshooting
Error budget shows > 100%
Your reliability exceeds the target. Verify the recording rule window matches your SLO window.
Burn rate alerts never fire
Check that recording rules for all windows (5m, 1h, 6h, 3d) are deployed. Run promtool check rules to validate.
Grafana shows "No Data"
A 30-day SLI needs 30 days of data after rule deployment. Use 7d windows initially.
This is 1 of 7 resources in the SRE Platform Pro toolkit. Get the complete [SLI/SLO Framework] with all files, templates, and documentation for $39.
Or grab the entire SRE Platform Pro bundle (7 products) for $89 — save 30%.
Top comments (0)