SRE Principles: Why 100% Uptime is the Wrong Goal

#sre #devops #production #reliability

SRE Principles: Why 100% Uptime is the Wrong Goal

Google created SRE because traditional ops doesn't scale. More services = more people = unsustainable. SREs are software engineers who treat operations as a code problem.

The two rules that make it work:

Rule 1: The 50% toil cap. No more than half an SRE's time on manual ops. The rest goes to automation.

Rule 2: Error budgets. Pick an SLO. The gap between it and 100% is your failure budget. Spend it on velocity.

Here's a quick deployment gate based on error budget:

package main

import "fmt"

type BudgetStatus struct {
    Service       string
    SLOTarget     float64
    CurrentAvail  float64
    BudgetUsedPct float64
}

func (b BudgetStatus) DeployPolicy() string {
    remaining := 100.0 - b.BudgetUsedPct
    switch {
    case remaining > 75:
        return "SHIP IT -- budget healthy"
    case remaining > 50:
        return "PROCEED WITH REVIEW -- budget moderate"
    case remaining > 25:
        return "RELIABILITY ONLY -- budget low"
    default:
        return "FROZEN -- budget critical"
    }
}

func main() {
    status := BudgetStatus{
        Service:       "auth-service",
        SLOTarget:     99.95,
        CurrentAvail:  99.91,
        BudgetUsedPct: 80.0,
    }
    fmt.Printf("%s: %s\n", status.Service, status.DeployPolicy())
}