Your 99.99% Uptime SLO Is Probably a Lie — Here's How to Fix It

#aiops #sre #devops #aws

I've been in enough postmortems to know the pattern.

The executive slide says 99.99% uptime. The on-call engineer who
lived through the last three months knows it's closer to 99.7%.
Neither number is wrong exactly. They're measuring completely
different things — and that gap is where trust goes to die.

This is not a theoretical problem. Let me show you the math,
the code that catches it, and the cultural shift that fixes it.

The math nobody does in public

Four nines sounds impressive until you look at what it actually allows:

99.99% uptime over 365 days =
Total minutes in a year: 525,600
Allowed downtime (0.01%): 52.6 minutes per year
Per month: 4.4 minutes
Per week: 1.0 minute

Now count what your team actually experienced last year:

The "elevated error rates" incident that lasted 40 minutes
but got logged as "partial degradation, not full outage"
The auth provider hiccup that dropped regional logins for 22 minutes
but was excluded because "that's a third-party dependency"
The Friday night patch that broke a non-critical service
that everyone actually depends on — 90 minutes, excluded as
"planned maintenance window"

That's 152 minutes. You've already spent 2.9× your entire annual budget
by March, and your dashboard still shows four nines.

This is not fraud exactly. It's a measurement convention that happens
to always favor the measurer.

The three ways SLOs get gamed (usually unintentionally)

Excluding third-party dependencies

"The CDN was down, not us" is technically true and practically useless
to your customer who couldn't load your application.

User-experienced availability includes everything in the request path.
If your SLO excludes your auth provider, your DNS, and your CDN,
you are measuring something your users have never experienced.

Averaging across regions

"We maintained 99.99% globally" while US-East was down for 35 minutes
is one of the most common forms of SLO theater in distributed systems.

A customer in us-east-1 experienced a complete outage.
Your global aggregate made it disappear.

The planned maintenance carve-out

This one is the most honest-sounding and the most problematic.

If you need a maintenance window to deploy safely,
your deployment process is part of your reliability problem.
Excluding it from your SLO means you never fix it.
You just keep scheduling windows.

What honest SLO tracking looks like in code

Here's the implementation I use. It measures what users experience,
not what the infrastructure team finds convenient to measure.

python# honest_slo.py

MIT License — Ajay Devineni (github.com/Ajay150313)

from dataclasses import dataclass, field
from datetime import datetime, timedelta, timezone
from enum import Enum
from typing import Optional
import json

class ExclusionPolicy(Enum):
"""
Controls whether incidents can be excluded from SLO calculation.

STRICT:      Nothing excluded. User experience only.
STANDARD:    Excludes pre-announced maintenance with customer notification.
PERMISSIVE:  Excludes third-party, maintenance, and regional partial impact.
             (This is how most teams get to four nines on paper.)
"""
STRICT = "strict"
STANDARD = "standard"
PERMISSIVE = "permissive"

@dataclass
class Incident:
"""
A single reliability incident, recorded at the moment it is detected.

The 'excluded_reason' field is the audit trail.
Every exclusion must be justified in writing at the time it happens,
not retroactively cleaned up before the monthly review.
"""
incident_id: str
started_at: datetime
resolved_at: Optional[datetime]
severity: str                    # P1, P2, P3
impact_regions: list[str]        # actual affected regions, not "global"
impact_scope: str                # "full_outage" | "partial_degradation" | "elevated_errors"
root_cause_category: str         # "internal" | "third_party" | "infrastructure"
was_planned: bool = False
customer_notified_before: bool = False  # for planned maintenance
excluded_reason: Optional[str] = None  # must be filled at time of exclusion
excluded_by: Optional[str] = None      # who approved the exclusion

@property
def duration_minutes(self) -> float:
    if self.resolved_at is None:
        return (datetime.now(timezone.utc) - self.started_at).total_seconds() / 60
    return (self.resolved_at - self.started_at).total_seconds() / 60

@property
def is_open(self) -> bool:
    return self.resolved_at is None

@dataclass
class SLOCalculator:
"""
Calculates SLO compliance from a list of incidents.

The key design decision: the exclusion policy is set at calculator
construction and applied consistently. You cannot change the policy
after the fact to make your numbers look better.

Usage:
    calc = SLOCalculator(
        target_pct=99.9,
        window_days=30,
        policy=ExclusionPolicy.STANDARD,
    )
    result = calc.calculate(incidents)
"""
target_pct: float           # e.g. 99.9
window_days: int            # rolling window
policy: ExclusionPolicy = ExclusionPolicy.STANDARD
service_regions: list[str] = field(default_factory=lambda: ["global"])

@property
def window_minutes(self) -> float:
    return self.window_days * 24 * 60

@property
def error_budget_minutes(self) -> float:
    return self.window_minutes * (1 - self.target_pct / 100)

def _should_exclude(self, incident: Incident) -> tuple[bool, str]:
    """
    Returns (should_exclude, reason).

    The reason is logged regardless — so you can audit what
    would have been excluded under a stricter policy.
    """
    if incident.excluded_reason:
        # Explicit manual exclusion — always honored if policy allows it
        if self.policy == ExclusionPolicy.STRICT:
            return False, "strict policy: no exclusions allowed"
        return True, incident.excluded_reason

    if self.policy == ExclusionPolicy.STRICT:
        return False, ""

    # STANDARD: only exclude pre-announced maintenance
    # with documented customer notification
    if (self.policy == ExclusionPolicy.STANDARD
            and incident.was_planned
            and incident.customer_notified_before):
        return True, "pre-announced maintenance with customer notification"

    if self.policy == ExclusionPolicy.PERMISSIVE:
        if incident.root_cause_category == "third_party":
            return True, "third-party dependency (permissive policy)"
        if incident.was_planned:
            return True, "planned maintenance (permissive policy)"
        # Partial regional impact excluded under permissive
        if (incident.impact_scope == "partial_degradation"
                and len(incident.impact_regions) < len(self.service_regions)):
            return True, "partial regional impact (permissive policy)"

    return False, ""

def calculate(self, incidents: list[Incident]) -> dict:
    """
    Returns a complete SLO report including:
    - Compliance under the configured policy
    - What it would be under STRICT policy (no exclusions)
    - The gap between them (the "honesty gap")
    - Full audit trail of all exclusions
    """
    cutoff = datetime.now(timezone.utc) - timedelta(days=self.window_days)
    window_incidents = [
        i for i in incidents
        if i.started_at >= cutoff
    ]

    included_minutes = 0.0
    excluded_minutes = 0.0
    exclusion_log = []
    strict_downtime = 0.0

    for incident in window_incidents:
        duration = incident.duration_minutes
        strict_downtime += duration  # always count for strict calculation

        should_exclude, reason = self._should_exclude(incident)
        if should_exclude:
            excluded_minutes += duration
            exclusion_log.append({
                "incident_id": incident.incident_id,
                "duration_minutes": round(duration, 1),
                "reason": reason,
                "excluded_by": incident.excluded_by,
            })
        else:
            included_minutes += duration

    # Compliance under configured policy
    downtime_ratio = included_minutes / self.window_minutes
    achieved_pct = (1 - downtime_ratio) * 100
    budget_consumed_pct = (included_minutes / self.error_budget_minutes) * 100

    # What it would look like with zero exclusions (honest number)
    strict_ratio = strict_downtime / self.window_minutes
    strict_pct = (1 - strict_ratio) * 100
    honesty_gap = achieved_pct - strict_pct

    # Budget status
    budget_remaining = max(0.0, self.error_budget_minutes - included_minutes)
    is_breached = included_minutes > self.error_budget_minutes

    return {
        "slo_target_pct": self.target_pct,
        "window_days": self.window_days,
        "policy": self.policy.value,
        "achieved_pct": round(achieved_pct, 4),
        "strict_achieved_pct": round(strict_pct, 4),
        "honesty_gap_pct": round(honesty_gap, 4),
        "error_budget_minutes_total": round(self.error_budget_minutes, 1),
        "error_budget_minutes_consumed": round(included_minutes, 1),
        "error_budget_minutes_remaining": round(budget_remaining, 1),
        "error_budget_consumed_pct": round(budget_consumed_pct, 1),
        "is_breached": is_breached,
        "total_incidents": len(window_incidents),
        "excluded_incidents": len(exclusion_log),
        "excluded_minutes": round(excluded_minutes, 1),
        "exclusion_log": exclusion_log,
    }

@dataclass

class ErrorBudgetBurnTracker:
"""
Tracks error budget consumption rate over time.

The key insight: it's not just how much budget you've used,
it's how fast you're burning it. A team that burns 80% of their
budget in the first week of the month has a very different problem
than a team that burns it steadily.

Burn rate > 1.0 means you will exhaust the budget before
the window closes at the current rate.
"""
slo_target_pct: float
window_days: int

@property
def _budget_minutes(self) -> float:
    return self.window_days * 24 * 60 * (1 - self.slo_target_pct / 100)

def current_burn_rate(
    self,
    consumed_minutes: float,
    elapsed_days: float,
) -> float:
    """
    Burn rate of 1.0 = consuming budget at exactly the sustainable pace.
    Burn rate of 3.0 = will exhaust budget in 1/3 the remaining time.
    Burn rate of 14.4 = exhausts 30-day budget in ~50 hours (page-worthy).
    """
    if elapsed_days <= 0:
        return 0.0
    elapsed_minutes = elapsed_days * 24 * 60
    actual_rate = consumed_minutes / elapsed_minutes
    sustainable_rate = self._budget_minutes / (self.window_days * 24 * 60)
    if sustainable_rate <= 0:
        return 0.0
    return actual_rate / sustainable_rate

def time_to_exhaustion_hours(
    self,
    remaining_budget_minutes: float,
    current_burn_rate: float,
) -> Optional[float]:
    """
    At the current burn rate, how many hours until the budget is gone?
    Returns None if burn rate <= 1.0 (on track or ahead of pace).
    """
    if current_burn_rate <= 1.0 or remaining_budget_minutes <= 0:
        return None
    sustainable_minutes_per_hour = (
        self._budget_minutes / (self.window_days * 24)
    )
    actual_minutes_per_hour = sustainable_minutes_per_hour * current_burn_rate
    if actual_minutes_per_hour <= 0:
        return None
    return remaining_budget_minutes / actual_minutes_per_hour

def format_report(report: dict) -> str:
"""Print a human-readable SLO compliance report."""
breach_str = "🔴 BREACHED" if report["is_breached"] else "🟢 Within budget"
gap_str = (
f"⚠️ +{report['honesty_gap_pct']:.3f}% vs strict calculation"
if report["honesty_gap_pct"] > 0.001
else "✅ No exclusions applied"
)

lines = [
    f"\n{'═'*56}",
    f"  SLO COMPLIANCE REPORT ({report['window_days']}-day window)",
    f"{'═'*56}",
    f"  Policy:              {report['policy'].upper()}",
    f"  Target:              {report['slo_target_pct']}%",
    f"  Achieved:            {report['achieved_pct']}%  {breach_str}",
    f"  Strict (no excl.):   {report['strict_achieved_pct']}%  {gap_str}",
    f"{'─'*56}",
    f"  Error budget total:  {report['error_budget_minutes_total']} min",
    f"  Consumed:            {report['error_budget_minutes_consumed']} min  "
    f"({report['error_budget_consumed_pct']}%)",
    f"  Remaining:           {report['error_budget_minutes_remaining']} min",
    f"{'─'*56}",
    f"  Incidents (window):  {report['total_incidents']}",
    f"  Excluded:            {report['excluded_incidents']} "
    f"({report['excluded_minutes']} min)",
]

if report["exclusion_log"]:
    lines.append(f"\n  Exclusion audit trail:")
    for excl in report["exclusion_log"]:
        lines.append(
            f"    • {excl['incident_id']}: {excl['duration_minutes']} min — "
            f"{excl['reason']}"
        )
lines.append(f"{'═'*56}\n")
return "\n".join(lines)

── Example usage ─────────────────────────────────────────────────────────────

if name == "main":
from datetime import timezone

now = datetime.now(timezone.utc)

incidents = [
    Incident(
        incident_id="INC-2026-001",
        started_at=now - timedelta(days=25, hours=2),
        resolved_at=now - timedelta(days=25, hours=1, minutes=20),
        severity="P1",
        impact_regions=["us-east-1"],
        impact_scope="full_outage",
        root_cause_category="internal",
    ),
    Incident(
        incident_id="INC-2026-002",
        started_at=now - timedelta(days=18),
        resolved_at=now - timedelta(days=17, hours=23, minutes=37),
        severity="P2",
        impact_regions=["us-east-1", "eu-west-1"],
        impact_scope="elevated_errors",
        root_cause_category="third_party",
        excluded_reason="Auth provider outage — not our infrastructure",
        excluded_by="oncall-lead@company.com",
    ),
    Incident(
        incident_id="INC-2026-003",
        started_at=now - timedelta(days=5, hours=22),
        resolved_at=now - timedelta(days=5, hours=20, minutes=30),
        severity="P2",
        impact_regions=["us-east-1"],
        impact_scope="partial_degradation",
        root_cause_category="internal",
        was_planned=True,
        customer_notified_before=True,
    ),
]

print("\n--- PERMISSIVE (how most teams report) ---")
calc_permissive = SLOCalculator(
    target_pct=99.9,
    window_days=30,
    policy=ExclusionPolicy.PERMISSIVE,
    service_regions=["us-east-1", "eu-west-1"],
)
print(format_report(calc_permissive.calculate(incidents)))

print("--- STANDARD (honest, defensible) ---")
calc_standard = SLOCalculator(
    target_pct=99.9,
    window_days=30,
    policy=ExclusionPolicy.STANDARD,
    service_regions=["us-east-1", "eu-west-1"],
)
print(format_report(calc_standard.calculate(incidents)))

print("--- STRICT (pure user experience) ---")
calc_strict = SLOCalculator(
    target_pct=99.9,
    window_days=30,
    policy=ExclusionPolicy.STRICT,
    service_regions=["us-east-1", "eu-west-1"],
)
print(format_report(calc_strict.calculate(incidents)))

# Burn rate check
tracker = ErrorBudgetBurnTracker(slo_target_pct=99.9, window_days=30)
burn = tracker.current_burn_rate(consumed_minutes=85.0, elapsed_days=15)
hours_left = tracker.time_to_exhaustion_hours(
    remaining_budget_minutes=max(0, 43.2 - 85.0), current_burn_rate=burn
)
print(f"  Current burn rate: {burn:.2f}×")
if hours_left:
    print(f"  Budget exhaustion: {hours_left:.1f} hours at current rate")
else:
    print(f"  Budget already exhausted (consumed 85.0 of 43.2 min budget)"