Error Budgets in Practice: A No-BS Guide

#sre #slo #reliability #devops

Everyone Talks About SLOs, Nobody Talks About Error Budgets

Every SRE conference has talks about SLOs. Set a target! 99.9%! Three nines! Standing ovation.

But the actually useful part — what you DO when you're burning your error budget — rarely gets discussed.

What's an Error Budget, Really?

If your SLO is 99.9% availability, your error budget is 0.1%. That's:

Monthly error budget at 99.9%:
  43.2 minutes of downtime

Or in request terms (1M requests/day):
  ~1,000 failed requests per day
  ~30,000 failed requests per month

The error budget isn't a target to hit. It's fuel you spend on shipping features.

The Error Budget Policy

This is the document that actually matters. Ours looks like this:

error_budget_policy:
  budget_remaining:
    above_50_percent:
      deploy_frequency: "unlimited"
      feature_vs_reliability: "80/20"
      review_cadence: "monthly"
    25_to_50_percent:
      deploy_frequency: "2x daily max"
      feature_vs_reliability: "50/50"
      review_cadence: "weekly"
    10_to_25_percent:
      deploy_frequency: "1x daily, with canary"
      feature_vs_reliability: "20/80"
      review_cadence: "daily standup"
    below_10_percent:
      deploy_frequency: "emergency only"
      feature_vs_reliability: "0/100"
      review_cadence: "war room until recovered"
    exhausted:
      action: "feature freeze until budget replenishes"
      escalation: "VP Engineering notified"

Making It Real: The Dashboard

We built a simple burn rate dashboard:

def calculate_burn_rate(slo_target, window_hours, error_count, total_count):
    """Calculate how fast we're burning error budget."""
    error_rate = error_count / total_count
    budget = 1 - slo_target  # e.g., 0.001 for 99.9%

    # Burn rate: how many times faster than allowed
    # burn_rate of 1.0 = exactly on budget
    # burn_rate of 2.0 = burning 2x too fast
    burn_rate = error_rate / budget

    # Time until budget exhausted at current rate
    budget_remaining = budget - error_rate
    hours_left = (budget_remaining / error_rate) * window_hours if error_rate > 0 else float('inf')

    return {
        'burn_rate': round(burn_rate, 2),
        'hours_until_exhausted': round(hours_left, 1),
        'budget_consumed_pct': round((error_rate / budget) * 100, 1)
    }

# Example
result = calculate_burn_rate(
    slo_target=0.999,
    window_hours=24,
    error_count=500,
    total_count=1_000_000
)
print(result)
# {'burn_rate': 0.5, 'hours_until_exhausted': 48.0, 'budget_consumed_pct': 50.0}

The Hardest Part: Enforcing the Freeze

When you tell a product manager "no more features this month because we burned our error budget," expect pushback. Here's how to handle it:

Make it data-driven: Show the burn rate chart. Numbers don't argue.
Connect to money: "Our SLO breach cost us $X in SLA credits last quarter."
Make it a team agreement: The error budget policy should be signed off by engineering AND product leadership.
Automate the gates: CI/CD should block non-critical deploys automatically.

# Example: GitLab CI gate
deploy_production:
  rules:
    - if: '$ERROR_BUDGET_REMAINING < 10'
      when: manual  # Require manual approval
      allow_failure: false
    - when: on_success  # Auto-deploy when budget healthy

The Cultural Shift

The real value of error budgets isn't technical. It reframes reliability from "SRE's job" to "everyone's job." When developers see that their buggy deploy consumed 30% of the monthly error budget, they start writing better tests.

It took us about two quarters to get this cultural shift, but now our product teams actively ask about error budget status before planning sprints.

If you're struggling to operationalize SLOs and error budgets, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community