DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

SLI/SLO Framework

SLI/SLO Framework

Stop arguing about reliability in abstractions. This framework gives you everything needed to define Service Level Indicators, set Service Level Objectives, calculate error budgets, configure multi-window alerting, and produce executive dashboards that translate uptime into business impact. From Prometheus recording rules to Grafana panels to the Python scripts that tie them together, this is a complete SLI/SLO implementation you can deploy in a day.

Key Features

  • SLI definition templates — Pre-built indicators for availability, latency, throughput, and correctness with Prometheus queries
  • SLO specification schema — YAML-based SLO definitions with targets, measurement windows, and stakeholder metadata
  • Error budget calculator — Python script that computes remaining error budget, burn rate, and projected exhaustion date
  • Multi-window burn rate alerts — Prometheus alerting rules implementing Google's recommended 5m/1h/6h/3d burn rate windows
  • Grafana dashboard JSON — Import-ready dashboards for SLO status and error budget remaining
  • Executive reporting script — Generates weekly/monthly SLO compliance reports in Markdown
  • SLO negotiation guide — Framework for negotiating targets between product and engineering
  • Error budget policy template — Defines consequences when budget is exhausted (feature freeze, reliability sprint)

Quick Start

unzip sli-slo-framework.zip && cd sli-slo-framework/

# Define your first SLO
cp templates/config.yaml my_slos.yaml
# Edit my_slos.yaml with your service details

# Calculate current error budget
python3 src/sli_slo_framework/core.py budget \
  --config my_slos.yaml \
  --prometheus-url https://prometheus.example.com

# Generate executive report
python3 src/sli_slo_framework/core.py report \
  --config my_slos.yaml --window 30 --output reports/monthly.md
Enter fullscreen mode Exit fullscreen mode

Architecture / How It Works

  1. Define — Write SLO specs in YAML: SLIs, targets, and measurement windows
  2. Measure — Prometheus recording rules continuously compute SLI values
  3. Alert — Multi-window burn rate alerts fire when budget is consumed too fast
  4. Report — Dashboards and scripts produce status for engineers and executives

Usage Examples

SLO Definition

# my_slos.yaml
slos:
  - name: api-gateway-availability
    service: api-gateway
    sli:
      type: availability
      good_events: 'http_requests_total{service="api-gateway",status!~"5.."}'
      total_events: 'http_requests_total{service="api-gateway"}'
    objective: 99.9
    window: 30d

  - name: api-gateway-latency
    service: api-gateway
    sli:
      type: latency
      good_events: 'http_request_duration_seconds_bucket{service="api-gateway",le="0.3"}'
      total_events: 'http_requests_total{service="api-gateway"}'
    objective: 99.0
    window: 30d
Enter fullscreen mode Exit fullscreen mode

Prometheus Recording Rules for SLIs

groups:
  - name: sli_api_gateway
    interval: 1m
    rules:
      - record: sli:availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{service="api-gateway",status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="api-gateway"}[5m]))
      - record: sli:latency:ratio_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{service="api-gateway",le="0.3"}[5m]))
          /
          sum(rate(http_requests_total{service="api-gateway"}[5m]))
      - record: slo:error_budget_remaining:ratio
        expr: |
          1 - ((1 - sli:availability:ratio_rate30d) / (1 - 0.999))
Enter fullscreen mode Exit fullscreen mode

Multi-Window Burn Rate Alerts

groups:
  - name: slo_burn_rate_alerts
    rules:
      - alert: SLOHighBurnRate_5m         # Fast: 14.4x rate, 5m/1h windows
        expr: |
          (1 - sli:availability:ratio_rate5m) > (14.4 * (1 - 0.999))
          and
          (1 - sli:availability:ratio_rate1h) > (14.4 * (1 - 0.999))
        for: 2m
        labels:
          severity: critical
      - alert: SLOHighBurnRate_6h         # Slow: 3x rate, 6h/3d windows
        expr: |
          (1 - sli:availability:ratio_rate6h) > (3 * (1 - 0.999))
          and
          (1 - sli:availability:ratio_rate3d) > (3 * (1 - 0.999))
        for: 15m
        labels:
          severity: warning
Enter fullscreen mode Exit fullscreen mode

Error Budget Calculator

from sli_slo_framework.core import ErrorBudgetCalculator

calc = ErrorBudgetCalculator(prometheus_url="https://prometheus.example.com")
budget = calc.compute(
    sli_query='sli:availability:ratio_rate30d{service="api-gateway"}',
    objective=0.999, window_days=30,
)
print(f"Remaining: {budget.remaining_pct:.1%} | Burn: {budget.burn_rate:.1f}x")
Enter fullscreen mode Exit fullscreen mode

Configuration

# config.example.yaml
framework:
  prometheus_url: https://prometheus.example.com
  grafana_url: https://grafana.example.com

defaults:
  window: 30d
  burn_rate_windows:
    fast: { short: 5m, long: 1h, factor: 14.4 }
    slow: { short: 6h, long: 3d, factor: 3.0 }

reporting:
  schedule: weekly
  format: markdown
  output_dir: reports/

error_budget_policy:
  green: "Normal velocity (>50% budget)"
  yellow: "Prioritize reliability (25-50%)"
  red: "Feature freeze (0-25%)"
  exhausted: "Emergency reliability sprint"
Enter fullscreen mode Exit fullscreen mode

Best Practices

  • Start with availability and latency — these two SLIs cover 90% of user-facing reliability
  • Set objectives based on user tolerance, not system capability
  • Alert on burn rate, not raw SLI — a brief dip is fine if you're within budget
  • Report to leadership monthly — translate uptime into "hours of downtime remaining"

Troubleshooting

Error budget shows > 100%
Your reliability exceeds the target. Verify the recording rule window matches your SLO window.

Burn rate alerts never fire
Check that recording rules for all windows (5m, 1h, 6h, 3d) are deployed. Run promtool check rules to validate.

Grafana shows "No Data"
A 30-day SLI needs 30 days of data after rule deployment. Use 7d windows initially.


This is 1 of 7 resources in the SRE Platform Pro toolkit. Get the complete [SLI/SLO Framework] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire SRE Platform Pro bundle (7 products) for $89 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)