Everyone Talks About SLOs, Nobody Talks About Error Budgets
Every SRE conference has talks about SLOs. Set a target! 99.9%! Three nines! Standing ovation.
But the actually useful part — what you DO when you're burning your error budget — rarely gets discussed.
What's an Error Budget, Really?
If your SLO is 99.9% availability, your error budget is 0.1%. That's:
Monthly error budget at 99.9%:
43.2 minutes of downtime
Or in request terms (1M requests/day):
~1,000 failed requests per day
~30,000 failed requests per month
The error budget isn't a target to hit. It's fuel you spend on shipping features.
The Error Budget Policy
This is the document that actually matters. Ours looks like this:
error_budget_policy:
budget_remaining:
above_50_percent:
deploy_frequency: "unlimited"
feature_vs_reliability: "80/20"
review_cadence: "monthly"
25_to_50_percent:
deploy_frequency: "2x daily max"
feature_vs_reliability: "50/50"
review_cadence: "weekly"
10_to_25_percent:
deploy_frequency: "1x daily, with canary"
feature_vs_reliability: "20/80"
review_cadence: "daily standup"
below_10_percent:
deploy_frequency: "emergency only"
feature_vs_reliability: "0/100"
review_cadence: "war room until recovered"
exhausted:
action: "feature freeze until budget replenishes"
escalation: "VP Engineering notified"
Making It Real: The Dashboard
We built a simple burn rate dashboard:
def calculate_burn_rate(slo_target, window_hours, error_count, total_count):
"""Calculate how fast we're burning error budget."""
error_rate = error_count / total_count
budget = 1 - slo_target # e.g., 0.001 for 99.9%
# Burn rate: how many times faster than allowed
# burn_rate of 1.0 = exactly on budget
# burn_rate of 2.0 = burning 2x too fast
burn_rate = error_rate / budget
# Time until budget exhausted at current rate
budget_remaining = budget - error_rate
hours_left = (budget_remaining / error_rate) * window_hours if error_rate > 0 else float('inf')
return {
'burn_rate': round(burn_rate, 2),
'hours_until_exhausted': round(hours_left, 1),
'budget_consumed_pct': round((error_rate / budget) * 100, 1)
}
# Example
result = calculate_burn_rate(
slo_target=0.999,
window_hours=24,
error_count=500,
total_count=1_000_000
)
print(result)
# {'burn_rate': 0.5, 'hours_until_exhausted': 48.0, 'budget_consumed_pct': 50.0}
The Hardest Part: Enforcing the Freeze
When you tell a product manager "no more features this month because we burned our error budget," expect pushback. Here's how to handle it:
- Make it data-driven: Show the burn rate chart. Numbers don't argue.
- Connect to money: "Our SLO breach cost us $X in SLA credits last quarter."
- Make it a team agreement: The error budget policy should be signed off by engineering AND product leadership.
- Automate the gates: CI/CD should block non-critical deploys automatically.
# Example: GitLab CI gate
deploy_production:
rules:
- if: '$ERROR_BUDGET_REMAINING < 10'
when: manual # Require manual approval
allow_failure: false
- when: on_success # Auto-deploy when budget healthy
The Cultural Shift
The real value of error budgets isn't technical. It reframes reliability from "SRE's job" to "everyone's job." When developers see that their buggy deploy consumed 30% of the monthly error budget, they start writing better tests.
It took us about two quarters to get this cultural shift, but now our product teams actively ask about error budget status before planning sprints.
If you're struggling to operationalize SLOs and error budgets, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)