DatanestDigital

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

SLI/SLO Framework

#kubernetes #devops #monitoring #sre

SLI/SLO Framework

Stop arguing about reliability in abstractions. This framework gives you everything needed to define Service Level Indicators, set Service Level Objectives, calculate error budgets, configure multi-window alerting, and produce executive dashboards that translate uptime into business impact. From Prometheus recording rules to Grafana panels to the Python scripts that tie them together, this is a complete SLI/SLO implementation you can deploy in a day.

Key Features

SLI definition templates — Pre-built indicators for availability, latency, throughput, and correctness with Prometheus queries
SLO specification schema — YAML-based SLO definitions with targets, measurement windows, and stakeholder metadata
Error budget calculator — Python script that computes remaining error budget, burn rate, and projected exhaustion date
Multi-window burn rate alerts — Prometheus alerting rules implementing Google's recommended 5m/1h/6h/3d burn rate windows
Grafana dashboard JSON — Import-ready dashboards for SLO status and error budget remaining
Executive reporting script — Generates weekly/monthly SLO compliance reports in Markdown
SLO negotiation guide — Framework for negotiating targets between product and engineering
Error budget policy template — Defines consequences when budget is exhausted (feature freeze, reliability sprint)

Quick Start

unzip sli-slo-framework.zip && cd sli-slo-framework/

# Define your first SLO
cp templates/config.yaml my_slos.yaml
# Edit my_slos.yaml with your service details

# Calculate current error budget
python3 src/sli_slo_framework/core.py budget \
  --config my_slos.yaml \
  --prometheus-url https://prometheus.example.com

# Generate executive report
python3 src/sli_slo_framework/core.py report \
  --config my_slos.yaml --window 30 --output reports/monthly.md

Architecture / How It Works

Define — Write SLO specs in YAML: SLIs, targets, and measurement windows
Measure — Prometheus recording rules continuously compute SLI values
Alert — Multi-window burn rate alerts fire when budget is consumed too fast
Report — Dashboards and scripts produce status for engineers and executives

Usage Examples

SLO Definition

# my_slos.yaml
slos:
  - name: api-gateway-availability
    service: api-gateway
    sli:
      type: availability
      good_events: 'http_requests_total{service="api-gateway",status!~"5.."}'
      total_events: 'http_requests_total{service="api-gateway"}'
    objective: 99.9
    window: 30d

  - name: api-gateway-latency
    service: api-gateway
    sli:
      type: latency
      good_events: 'http_request_duration_seconds_bucket{service="api-gateway",le="0.3"}'
      total_events: 'http_requests_total{service="api-gateway"}'
    objective: 99.0
    window: 30d

Prometheus Recording Rules for SLIs

groups:
  - name: sli_api_gateway
    interval: 1m
    rules:
      - record: sli:availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{service="api-gateway",status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total{service="api-gateway"}[5m]))
      - record: sli:latency:ratio_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{service="api-gateway",le="0.3"}[5m]))
          /
          sum(rate(http_requests_total{service="api-gateway"}[5m]))
      - record: slo:error_budget_remaining:ratio
        expr: |
          1 - ((1 - sli:availability:ratio_rate30d) / (1 - 0.999))

Multi-Window Burn Rate Alerts

groups:
  - name: slo_burn_rate_alerts
    rules:
      - alert: SLOHighBurnRate_5m         # Fast: 14.4x rate, 5m/1h windows
        expr: |
          (1 - sli:availability:ratio_rate5m) > (14.4 * (1 - 0.999))
          and
          (1 - sli:availability:ratio_rate1h) > (14.4 * (1 - 0.999))
        for: 2m
        labels:
          severity: critical
      - alert: SLOHighBurnRate_6h         # Slow: 3x rate, 6h/3d windows
        expr: |
          (1 - sli:availability:ratio_rate6h) > (3 * (1 - 0.999))
          and
          (1 - sli:availability:ratio_rate3d) > (3 * (1 - 0.999))
        for: 15m
        labels:
          severity: warning

Error Budget Calculator

from sli_slo_framework.core import ErrorBudgetCalculator

calc = ErrorBudgetCalculator(prometheus_url="https://prometheus.example.com")
budget = calc.compute(
    sli_query='sli:availability:ratio_rate30d{service="api-gateway"}',
    objective=0.999, window_days=30,
)
print(f"Remaining: {budget.remaining_pct:.1%} | Burn: {budget.burn_rate:.1f}x")

Configuration

# config.example.yaml
framework:
  prometheus_url: https://prometheus.example.com
  grafana_url: https://grafana.example.com

defaults:
  window: 30d
  burn_rate_windows:
    fast: { short: 5m, long: 1h, factor: 14.4 }
    slow: { short: 6h, long: 3d, factor: 3.0 }

reporting:
  schedule: weekly
  format: markdown
  output_dir: reports/

error_budget_policy:
  green: "Normal velocity (>50% budget)"
  yellow: "Prioritize reliability (25-50%)"
  red: "Feature freeze (0-25%)"
  exhausted: "Emergency reliability sprint"

Best Practices

Start with availability and latency — these two SLIs cover 90% of user-facing reliability
Set objectives based on user tolerance, not system capability
Alert on burn rate, not raw SLI — a brief dip is fine if you're within budget
Report to leadership monthly — translate uptime into "hours of downtime remaining"

Troubleshooting

Error budget shows > 100%
Your reliability exceeds the target. Verify the recording rule window matches your SLO window.

Burn rate alerts never fire
Check that recording rules for all windows (5m, 1h, 6h, 3d) are deployed. Run promtool check rules to validate.

Grafana shows "No Data"
A 30-day SLI needs 30 days of data after rule deployment. Use 7d windows initially.

This is 1 of 7 resources in the SRE Platform Pro toolkit. Get the complete [SLI/SLO Framework] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire SRE Platform Pro bundle (7 products) for $89 — save 30%.

Get the Complete Bundle →

DEV Community

SLI/SLO Framework

SLI/SLO Framework

Key Features

Quick Start

Architecture / How It Works

Usage Examples

SLO Definition

Prometheus Recording Rules for SLIs

Multi-Window Burn Rate Alerts

Error Budget Calculator

Configuration

Best Practices

Troubleshooting

Related Articles

Top comments (0)