SRE Principles: Why 100% Uptime is the Wrong Goal
Google created SRE because traditional ops doesn't scale. More services = more people = unsustainable. SREs are software engineers who treat operations as a code problem.
The two rules that make it work:
Rule 1: The 50% toil cap. No more than half an SRE's time on manual ops. The rest goes to automation.
Rule 2: Error budgets. Pick an SLO. The gap between it and 100% is your failure budget. Spend it on velocity.
Here's a quick deployment gate based on error budget:
package main
import "fmt"
type BudgetStatus struct {
Service string
SLOTarget float64
CurrentAvail float64
BudgetUsedPct float64
}
func (b BudgetStatus) DeployPolicy() string {
remaining := 100.0 - b.BudgetUsedPct
switch {
case remaining > 75:
return "SHIP IT -- budget healthy"
case remaining > 50:
return "PROCEED WITH REVIEW -- budget moderate"
case remaining > 25:
return "RELIABILITY ONLY -- budget low"
default:
return "FROZEN -- budget critical"
}
}
func main() {
status := BudgetStatus{
Service: "auth-service",
SLOTarget: 99.95,
CurrentAvail: 99.91,
BudgetUsedPct: 80.0,
}
fmt.Printf("%s: %s\n", status.Service, status.DeployPolicy())
}
Wire this into CI and reliability becomes a first-class deploy constraint.
Tomorrow: SLIs, SLOs, and SLAs -- the measurement layer.
Top comments (0)