S, Sanjay

Posted on Mar 29

SRE Explained: Because 'It Works on My Machine' is Not an SLO 🎯

#sre #devops #reliability #monitoring

🎬 The Most Important Number in Your Career

What does 99.9% availability actually mean?

It means your service can be down for 43.8 minutes per month. That's it. That's your entire budget for bad deployments, infrastructure failures, cloud outages, and "oh no, I pushed to main instead of my branch."

Now let me tell you what 99.99% means: 4.38 minutes per month.

That's not even enough time to wake up, open your laptop, and figure out what's happening.

Welcome to SRE — where we stop pretending "it works on my machine" is acceptable and start treating reliability as an engineering discipline.

🏗️ SRE vs DevOps: What's the Difference?

DevOps = A culture of collaboration
  "Dev and Ops should work together!"

SRE = An implementation of DevOps with engineering rigor
  "Here's exactly HOW they work together, with math."

Google's famous quote:
  "SRE is what happens when you ask a software engineer
   to design an operations team."

The key SRE principles:

Embrace risk — perfection is impossible AND wasteful
SLOs define reliability targets — not vibes, not feelings, numbers
Error budgets balance features and reliability — spend wisely
Reduce toil through automation — if you do it twice, automate it
Simplicity is a prerequisite for reliability — complex = fragile

📊 The SLO Framework: SLI → SLO → Error Budget

SLIs: What You Measure

SLI (Service Level Indicator) = a number that measures service quality.

Good SLIs:
  ✅ "What proportion of HTTP requests return non-5xx?"    (Availability)
  ✅ "What proportion of requests complete in < 200ms?"    (Latency)
  ✅ "What proportion of payments process correctly?"      (Correctness)

Bad SLIs:
  ❌ "CPU usage" (users don't care about your CPU)
  ❌ "Server uptime" (server can be up but broken)
  ❌ "Number of deployments" (irrelevant to user experience)

SLOs: What You Promise (To Yourself)

SLO (Service Level Objective) = target value for your SLI.

Example SLOs for a Payment Service:

SLO 1: Availability
  "99.95% of HTTP requests return non-5xx responses"
  Measured over: 30-day rolling window
  Error budget: 21.9 minutes of downtime per month

SLO 2: Latency  
  "99% of requests complete in under 500ms"
  Measured over: 30-day rolling window
  Error budget: 1% of requests can be slow

SLO 3: Correctness
  "99.99% of payments process correctly"
  Measured over: 30-day rolling window
  Error budget: 1 in 10,000 payments can have issues

The Math That Changes Everything

 SLO        │ Error Budget  │ Downtime/month  │ Downtime/year
 ───────────┼───────────────┼─────────────────┼──────────────
 99%        │ 1%            │ 7.3 hours       │ 3.65 days
 99.5%      │ 0.5%          │ 3.65 hours      │ 1.83 days
 99.9%      │ 0.1%          │ 43.8 minutes    │ 8.76 hours
 99.95%     │ 0.05%         │ 21.9 minutes    │ 4.38 hours
 99.99%     │ 0.01%         │ 4.38 minutes    │ 52.6 minutes
 99.999%    │ 0.001%        │ 26.3 seconds    │ 5.26 minutes

Notice the jump from 99.9% to 99.99%: you go from 43 minutes to 4 minutes per month. That ONE extra nine costs exponentially more engineering effort, redundancy, and money.

💡 Principal Insight: The right SLO is NOT "as high as possible." It's "as high as the business needs." Most internal services are fine at 99.5%. Customer-facing APIs need 99.9-99.95%. Payment systems might need 99.99%. Going higher than needed wastes engineering time that could build features.

💰 Error Budgets: Your Reliability Currency

The error budget is the most powerful concept in SRE. It converts reliability from a subjective argument into an objective, data-driven policy.

                Your Error Budget Is Like a Bank Account
                ─────────────────────────────────────────

Starting balance (SLO = 99.9%): 43.8 minutes/month

March 1:   Balance = 43.8 min    🟢 Full speed ahead!
March 5:   Bad deploy → 15 min outage
           Balance = 28.8 min    🟢 Still good, keep shipping

March 12:  Cloud network blip → 5 min errors
           Balance = 23.8 min    🟡 Getting cautious...

March 18:  Another bad deploy → 12 min outage
           Balance = 11.8 min    🟠 SLOW DOWN. Reliability work only.

March 25:  Config error → 15 min outage
           Balance = -3.2 min    🔴 BUDGET EXHAUSTED.
                                    Feature freeze.
                                    All hands on reliability.

The Error Budget Policy

This is the document that makes error budgets actionable:

Budget > 50% remaining:
  → Ship features at full speed
  → Experiment freely
  → Take calculated risks with deployments

Budget 20-50% remaining:
  → Slow down on risky changes
  → Extra testing for deployments
  → Prioritize reliability improvements

Budget < 20% remaining:
  → Only critical fixes and reliability work
  → Additional review for all changes
  → Engineering time shifts to resilience

Budget EXHAUSTED:
  → FULL FREEZE on feature deployments
  → Only reliability fixes allowed
  → Executive-level review
  → Stays frozen until budget recovers

🚨 Real-World Disaster #1: The Team That Didn't Have an Error Budget

Without SLOs/Error Budgets:

Product Manager: "We need to ship Feature X by Friday."
Engineering Manager: "But the service had 3 outages this month..."
Product Manager: "Users are asking for Feature X!"
Engineering Manager: "But reliability..."
Product Manager: "FEATURES!"
Engineering Manager: "OK..."  
[deploys Friday, causes outage #4]

With SLOs/Error Budgets:

Product Manager: "We need to ship Feature X by Friday."
SRE Dashboard: "Error budget remaining: 8% (3.5 minutes)"
Engineering Manager: "Our error budget is nearly exhausted.
  Per our error budget policy, we're in freeze mode.
  Feature X ships when the budget recovers next month."
Product Manager: "...fine. What can we do to recover faster?"
Engineering Manager: "Great question! Let's fix the root causes."

The error budget removes the emotions. It's not "engineering being difficult" — it's math. You can't argue with math. (Well, you can, but you'll lose.)

🚨 Incident Management: When Things Go Wrong

The Incident Lifecycle

Detection          → "Houston, we have a problem"
  └── Automated alert (ideal) or customer report (bad)

Triage (< 5 min)   → "How bad is it?"
  └── Acknowledge alert, assess impact, assign severity

Mobilize           → "Assemble the team"
  └── Incident Commander, Comms lead, War room

Investigate        → "What's happening and how do we stop it?"
  └── Parallel investigation threads
  └── Focus on MITIGATION first, root cause later

Resolve            → "It's fixed"
  └── Service restored, monitoring confirms, customers notified

Review             → "What did we learn?"
  └── Blameless postmortem within 48 hours

The Severity Playbook

P1 - CRITICAL: Complete service outage, data loss risk
  → Page immediately (day or night)
  → Incident Commander within 5 minutes
  → Status page updated within 10 minutes
  → Business stakeholders notified within 15 minutes
  → Updates every 15 minutes until resolved

P2 - HIGH: Major feature degraded, significant user impact
  → Page during business hours
  → Incident Commander within 15 minutes
  → Updates every 30 minutes

P3 - MEDIUM: Minor feature impact, workaround available
  → Ticket created, fix within business hours
  → No page, no war room

P4 - LOW: Cosmetic issues, minor inconvenience
  → Ticket created, fix when convenient

The Blameless Postmortem (This Is How You Actually Get Better)

THE MOST IMPORTANT RULE: Blameless. Not blame-less than usual. Truly blameless.

❌ "Dave deployed without testing"
✅ "The deployment process allowed changes without test results"

❌ "Operations team was too slow to respond"  
✅ "The runbook didn't cover this scenario, extending response time"

❌ "The developer introduced a bug"
✅ "The test suite didn't cover this edge case"

Real Postmortem Example

# Incident Review: Payment Processing Outage
## March 18, 2026

### Summary
- Severity: P1
- Duration: 47 minutes (14:03 - 14:50 UTC)
- Impact: 15% of payment transactions failed
- Detection: SLO burn rate alert (automated)
- Resolution: Rolled back deployment v2.3.1

### Timeline
14:00  Deployment v2.3.1 started (routine release)
14:03  Error rate SLO alert fires (burning 5x normal rate)
14:05  On-call acknowledges, opens incident channel
14:10  Correlates: error spike started with deployment
14:12  Decision: roll back immediately
14:18  Rollback to v2.3.0 complete
14:25  Error rate returning to baseline
14:50  Confirmed fully resolved, incident closed

### Root Cause
Database migration in v2.3.1 added an index on the 'payments'
table (142M rows). During the migration, the table was locked
for write operations under load. Queries queued, connections
exhausted, cascading failure.

### Why It Wasn't Caught
1. Migration tested in staging (10K rows — completed in 0.3s)
2. Production had 142M rows (migration ran for ~20 minutes)
3. No load testing for database migrations exists
4. Deployment happened during peak hours (14:00 UTC)

### Action Items
| # | Action                                    | Owner    | Due        |
|---|-------------------------------------------|----------|------------|
| 1 | Add load test for DB migrations (prod-like data) | @alice   | April 1    |
| 2 | Enforce deployment windows (off-peak only) | @platform | March 25   |
| 3 | Enable canary deployments for payment svc  | @bob     | March 25   |
| 4 | Create online migration playbook (no locks)| @carol   | April 15   |

🐒 Chaos Engineering: Breaking Things on Purpose

"The best way to have confidence in your systems is to regularly try to break them."

The Chaos Engineering Process

1. Define steady state
   → "Normal error rate is < 0.1%, p99 latency < 500ms"

2. Form a hypothesis
   → "If a database replica fails, traffic fails over
      to secondary within 60 seconds with < 1% error increase"

3. Run the experiment
   → Kill the primary database connection
   → Watch what happens

4. Observe & learn
   → Did the system behave as expected?
   → "Failover took 4 minutes, not 60 seconds. Connections
      weren't being pooled. Found the bug!"

5. Fix what you found
   → Fix the connection pooling issue
   → Re-run the experiment to verify

Chaos Maturity Levels

Level 1: Game Days (Start here!)
  "Let's all get together quarterly to break stuff in staging"
  → Manual experiments
  → Team-building and learning
  → Find obvious gaps in runbooks

Level 2: Automated Experiments
  "Chaos Mesh injects pod failures every night in staging"
  → Scheduled chaos experiments
  → Automated steady-state verification
  → Results in dashboards

Level 3: Continuous Chaos in Production
  "Random pods die in production every day and nobody notices"
  → Netflix's Chaos Monkey level
  → Real confidence in system resilience
  → Only for teams with strong observability + fast rollback

🚨 Real-World Disaster #2: The Chaos Experiment That Went Too Far

The Plan: "Let's test what happens when we lose an Availability Zone in staging."

What Actually Happened: The engineer accidentally targeted the production cluster instead of staging. One-third of production nodes became unreachable. The remaining nodes didn't have enough capacity to handle the full load. Pods went into Pending state. Auto-scaling kicked in but took 8 minutes to provision new nodes. 8 minutes of degraded service for all customers.

The Lesson:

Chaos Engineering Safety Rules:
  1. ✅ Define abort conditions BEFORE the experiment
  2. ✅ Start small (1 pod, not 1 AZ)
  3. ✅ Start in non-production
  4. ✅ Double-check the target cluster (use context colors in terminal!)
  5. ✅ Have someone else review the experiment config
  6. ✅ Set blast radius limits

  # kubeconfig context helper (color-code your terminal!)
  if [[ $(kubectl config current-context) == *"prod"* ]]; then
    export PS1="\[\e[31m\]🔴 PROD\[\e[0m\] \w $ "
  else
    export PS1="\[\e[32m\]🟢 dev\[\e[0m\] \w $ "
  fi

🏥 Disaster Recovery: The Plan You Hope You Never Need

RPO and RTO Explained (Simply)

RPO (Recovery Point Objective) = How much data can you lose?
  "If the database is restored from backup, how old is that backup?"

  RPO = 0:       No data loss (synchronous replication)
  RPO = 1 hour:  You might lose up to 1 hour of data
  RPO = 24 hours: Daily backups, worst case lose a full day

RTO (Recovery Time Objective) = How quickly must you recover?
  "How long can the service be down?"

  RTO = 0:       Instant failover (active-active)
  RTO = 1 hour:  Warm standby, automated failover
  RTO = 24 hours: Cold standby, manual restoration

DR Strategies Ranked

Cost & Complexity →→→→→→→→→→→→→→→→→→→→→→→→→→

Active-Active (Multi-Region)    💰💰💰💰💰
  Both regions serve traffic. Instant failover.
  RPO: 0, RTO: ~0
  Use for: Payment processing, critical APIs

Active-Passive (Hot Standby)    💰💰💰
  Standby region ready, switch on failure.
  RPO: minutes, RTO: < 1 hour
  Use for: Main customer-facing services

Warm Standby                     💰💰
  Minimal infrastructure in DR region.
  RPO: hours, RTO: < 4 hours
  Use for: Internal tools, non-critical services

Backup/Restore                   💰
  Backups only, rebuild from scratch.
  RPO: hours-days, RTO: hours-days
  Use for: Dev environments, archival data

🚨 Real-World Disaster #3: The DR Plan That Was Never Tested

What Happened: Company had a "disaster recovery plan" in a SharePoint document written 2 years ago. When an Azure region experienced a significant outage, they pulled out the DR plan. It referenced:

A resource group that had been deleted
A script that used CLI commands from az CLI v2.38 (they were on v2.56)
A recovery process that assumed manual steps from an employee who left the company
DNS records that had been changed 6 months ago

Recovery took 14 hours instead of the documented 2 hours.

The Fix: Test your DR plan regularly.

DR Testing Cadence:
  Monthly:   Table-top exercise (walk through the plan)
  Quarterly: Partial failover test (one service)
  Annually:  Full DR drill (simulate complete region failure)

After every test:
  → Update the runbook with findings
  → Fix any automation that broke
  → Time the recovery and compare to RTO

📉 Toil Reduction: Automate the Boring Stuff

Toil = manual, repetitive operational work that scales with the size of the system and provides no lasting value.

Toil examples:
  🔄 Manually restarting pods that OOMKill
  🔄 Manually scaling nodes before expected traffic
  🔄 Manually rotating secrets every 90 days
  🔄 Manually approving deployments by looking at a dashboard
  🔄 Manually creating namespaces for new services

Not toil (even if boring):
  📝 Writing postmortems (creates lasting value)
  🏗️ Building automation (one-time effort)
  📊 Reviewing SLO dashboards (decision-making)

The Toil Budget Rule

Google's SRE book recommends: No more than 50% of an SRE's time should be toil. If it's higher, you're not doing engineering — you're doing operations with a fancier title.

🎯 Key Takeaways

SLOs are contracts with yourself — pick the right number, not the highest number
Error budgets turn reliability debates into math — you can't argue with math
Blameless postmortems are how organizations learn (blame makes people hide problems)
Chaos engineering starts small — game days before automated chaos in production
Test your DR plan or it's not a plan, it's a wish
Toil above 50% means you're doing ops, not engineering

🔥 Homework

Pick your most important service. Write an SLO for it (availability + latency). Calculate the error budget.
Look at your on-call incidents from last month. How many were repeat issues? Those are automation opportunities.
When was the last time your DR plan was tested? If "never" or "I don't know" — schedule one.

Next up in the series: **From 10x Developer to 10x Multiplier: Surviving the Lead/Principal Glow-Up* — where we decode the mindset shift from writing code to enabling organizations.*

💬 What's the best (or worst) postmortem you've ever participated in? Did it lead to real change? Share below — I want to hear the stories that made organizations better. 📝

DEV Community