🎬 The Most Important Number in Your Career
What does 99.9% availability actually mean?
It means your service can be down for 43.8 minutes per month. That's it. That's your entire budget for bad deployments, infrastructure failures, cloud outages, and "oh no, I pushed to main instead of my branch."
Now let me tell you what 99.99% means: 4.38 minutes per month.
That's not even enough time to wake up, open your laptop, and figure out what's happening.
Welcome to SRE — where we stop pretending "it works on my machine" is acceptable and start treating reliability as an engineering discipline.
🏗️ SRE vs DevOps: What's the Difference?
DevOps = A culture of collaboration
"Dev and Ops should work together!"
SRE = An implementation of DevOps with engineering rigor
"Here's exactly HOW they work together, with math."
Google's famous quote:
"SRE is what happens when you ask a software engineer
to design an operations team."
The key SRE principles:
- Embrace risk — perfection is impossible AND wasteful
- SLOs define reliability targets — not vibes, not feelings, numbers
- Error budgets balance features and reliability — spend wisely
- Reduce toil through automation — if you do it twice, automate it
- Simplicity is a prerequisite for reliability — complex = fragile
📊 The SLO Framework: SLI → SLO → Error Budget
SLIs: What You Measure
SLI (Service Level Indicator) = a number that measures service quality.
Good SLIs:
✅ "What proportion of HTTP requests return non-5xx?" (Availability)
✅ "What proportion of requests complete in < 200ms?" (Latency)
✅ "What proportion of payments process correctly?" (Correctness)
Bad SLIs:
❌ "CPU usage" (users don't care about your CPU)
❌ "Server uptime" (server can be up but broken)
❌ "Number of deployments" (irrelevant to user experience)
SLOs: What You Promise (To Yourself)
SLO (Service Level Objective) = target value for your SLI.
Example SLOs for a Payment Service:
SLO 1: Availability
"99.95% of HTTP requests return non-5xx responses"
Measured over: 30-day rolling window
Error budget: 21.9 minutes of downtime per month
SLO 2: Latency
"99% of requests complete in under 500ms"
Measured over: 30-day rolling window
Error budget: 1% of requests can be slow
SLO 3: Correctness
"99.99% of payments process correctly"
Measured over: 30-day rolling window
Error budget: 1 in 10,000 payments can have issues
The Math That Changes Everything
SLO │ Error Budget │ Downtime/month │ Downtime/year
───────────┼───────────────┼─────────────────┼──────────────
99% │ 1% │ 7.3 hours │ 3.65 days
99.5% │ 0.5% │ 3.65 hours │ 1.83 days
99.9% │ 0.1% │ 43.8 minutes │ 8.76 hours
99.95% │ 0.05% │ 21.9 minutes │ 4.38 hours
99.99% │ 0.01% │ 4.38 minutes │ 52.6 minutes
99.999% │ 0.001% │ 26.3 seconds │ 5.26 minutes
Notice the jump from 99.9% to 99.99%: you go from 43 minutes to 4 minutes per month. That ONE extra nine costs exponentially more engineering effort, redundancy, and money.
💡 Principal Insight: The right SLO is NOT "as high as possible." It's "as high as the business needs." Most internal services are fine at 99.5%. Customer-facing APIs need 99.9-99.95%. Payment systems might need 99.99%. Going higher than needed wastes engineering time that could build features.
💰 Error Budgets: Your Reliability Currency
The error budget is the most powerful concept in SRE. It converts reliability from a subjective argument into an objective, data-driven policy.
Your Error Budget Is Like a Bank Account
─────────────────────────────────────────
Starting balance (SLO = 99.9%): 43.8 minutes/month
March 1: Balance = 43.8 min 🟢 Full speed ahead!
March 5: Bad deploy → 15 min outage
Balance = 28.8 min 🟢 Still good, keep shipping
March 12: Cloud network blip → 5 min errors
Balance = 23.8 min 🟡 Getting cautious...
March 18: Another bad deploy → 12 min outage
Balance = 11.8 min 🟠 SLOW DOWN. Reliability work only.
March 25: Config error → 15 min outage
Balance = -3.2 min 🔴 BUDGET EXHAUSTED.
Feature freeze.
All hands on reliability.
The Error Budget Policy
This is the document that makes error budgets actionable:
Budget > 50% remaining:
→ Ship features at full speed
→ Experiment freely
→ Take calculated risks with deployments
Budget 20-50% remaining:
→ Slow down on risky changes
→ Extra testing for deployments
→ Prioritize reliability improvements
Budget < 20% remaining:
→ Only critical fixes and reliability work
→ Additional review for all changes
→ Engineering time shifts to resilience
Budget EXHAUSTED:
→ FULL FREEZE on feature deployments
→ Only reliability fixes allowed
→ Executive-level review
→ Stays frozen until budget recovers
🚨 Real-World Disaster #1: The Team That Didn't Have an Error Budget
Without SLOs/Error Budgets:
Product Manager: "We need to ship Feature X by Friday."
Engineering Manager: "But the service had 3 outages this month..."
Product Manager: "Users are asking for Feature X!"
Engineering Manager: "But reliability..."
Product Manager: "FEATURES!"
Engineering Manager: "OK..."
[deploys Friday, causes outage #4]
With SLOs/Error Budgets:
Product Manager: "We need to ship Feature X by Friday."
SRE Dashboard: "Error budget remaining: 8% (3.5 minutes)"
Engineering Manager: "Our error budget is nearly exhausted.
Per our error budget policy, we're in freeze mode.
Feature X ships when the budget recovers next month."
Product Manager: "...fine. What can we do to recover faster?"
Engineering Manager: "Great question! Let's fix the root causes."
The error budget removes the emotions. It's not "engineering being difficult" — it's math. You can't argue with math. (Well, you can, but you'll lose.)
🚨 Incident Management: When Things Go Wrong
The Incident Lifecycle
Detection → "Houston, we have a problem"
└── Automated alert (ideal) or customer report (bad)
Triage (< 5 min) → "How bad is it?"
└── Acknowledge alert, assess impact, assign severity
Mobilize → "Assemble the team"
└── Incident Commander, Comms lead, War room
Investigate → "What's happening and how do we stop it?"
└── Parallel investigation threads
└── Focus on MITIGATION first, root cause later
Resolve → "It's fixed"
└── Service restored, monitoring confirms, customers notified
Review → "What did we learn?"
└── Blameless postmortem within 48 hours
The Severity Playbook
P1 - CRITICAL: Complete service outage, data loss risk
→ Page immediately (day or night)
→ Incident Commander within 5 minutes
→ Status page updated within 10 minutes
→ Business stakeholders notified within 15 minutes
→ Updates every 15 minutes until resolved
P2 - HIGH: Major feature degraded, significant user impact
→ Page during business hours
→ Incident Commander within 15 minutes
→ Updates every 30 minutes
P3 - MEDIUM: Minor feature impact, workaround available
→ Ticket created, fix within business hours
→ No page, no war room
P4 - LOW: Cosmetic issues, minor inconvenience
→ Ticket created, fix when convenient
The Blameless Postmortem (This Is How You Actually Get Better)
THE MOST IMPORTANT RULE: Blameless. Not blame-less than usual. Truly blameless.
❌ "Dave deployed without testing"
✅ "The deployment process allowed changes without test results"
❌ "Operations team was too slow to respond"
✅ "The runbook didn't cover this scenario, extending response time"
❌ "The developer introduced a bug"
✅ "The test suite didn't cover this edge case"
Real Postmortem Example
# Incident Review: Payment Processing Outage
## March 18, 2026
### Summary
- Severity: P1
- Duration: 47 minutes (14:03 - 14:50 UTC)
- Impact: 15% of payment transactions failed
- Detection: SLO burn rate alert (automated)
- Resolution: Rolled back deployment v2.3.1
### Timeline
14:00 Deployment v2.3.1 started (routine release)
14:03 Error rate SLO alert fires (burning 5x normal rate)
14:05 On-call acknowledges, opens incident channel
14:10 Correlates: error spike started with deployment
14:12 Decision: roll back immediately
14:18 Rollback to v2.3.0 complete
14:25 Error rate returning to baseline
14:50 Confirmed fully resolved, incident closed
### Root Cause
Database migration in v2.3.1 added an index on the 'payments'
table (142M rows). During the migration, the table was locked
for write operations under load. Queries queued, connections
exhausted, cascading failure.
### Why It Wasn't Caught
1. Migration tested in staging (10K rows — completed in 0.3s)
2. Production had 142M rows (migration ran for ~20 minutes)
3. No load testing for database migrations exists
4. Deployment happened during peak hours (14:00 UTC)
### Action Items
| # | Action | Owner | Due |
|---|-------------------------------------------|----------|------------|
| 1 | Add load test for DB migrations (prod-like data) | @alice | April 1 |
| 2 | Enforce deployment windows (off-peak only) | @platform | March 25 |
| 3 | Enable canary deployments for payment svc | @bob | March 25 |
| 4 | Create online migration playbook (no locks)| @carol | April 15 |
🐒 Chaos Engineering: Breaking Things on Purpose
"The best way to have confidence in your systems is to regularly try to break them."
The Chaos Engineering Process
1. Define steady state
→ "Normal error rate is < 0.1%, p99 latency < 500ms"
2. Form a hypothesis
→ "If a database replica fails, traffic fails over
to secondary within 60 seconds with < 1% error increase"
3. Run the experiment
→ Kill the primary database connection
→ Watch what happens
4. Observe & learn
→ Did the system behave as expected?
→ "Failover took 4 minutes, not 60 seconds. Connections
weren't being pooled. Found the bug!"
5. Fix what you found
→ Fix the connection pooling issue
→ Re-run the experiment to verify
Chaos Maturity Levels
Level 1: Game Days (Start here!)
"Let's all get together quarterly to break stuff in staging"
→ Manual experiments
→ Team-building and learning
→ Find obvious gaps in runbooks
Level 2: Automated Experiments
"Chaos Mesh injects pod failures every night in staging"
→ Scheduled chaos experiments
→ Automated steady-state verification
→ Results in dashboards
Level 3: Continuous Chaos in Production
"Random pods die in production every day and nobody notices"
→ Netflix's Chaos Monkey level
→ Real confidence in system resilience
→ Only for teams with strong observability + fast rollback
🚨 Real-World Disaster #2: The Chaos Experiment That Went Too Far
The Plan: "Let's test what happens when we lose an Availability Zone in staging."
What Actually Happened: The engineer accidentally targeted the production cluster instead of staging. One-third of production nodes became unreachable. The remaining nodes didn't have enough capacity to handle the full load. Pods went into Pending state. Auto-scaling kicked in but took 8 minutes to provision new nodes. 8 minutes of degraded service for all customers.
The Lesson:
Chaos Engineering Safety Rules:
1. ✅ Define abort conditions BEFORE the experiment
2. ✅ Start small (1 pod, not 1 AZ)
3. ✅ Start in non-production
4. ✅ Double-check the target cluster (use context colors in terminal!)
5. ✅ Have someone else review the experiment config
6. ✅ Set blast radius limits
# kubeconfig context helper (color-code your terminal!)
if [[ $(kubectl config current-context) == *"prod"* ]]; then
export PS1="\[\e[31m\]🔴 PROD\[\e[0m\] \w $ "
else
export PS1="\[\e[32m\]🟢 dev\[\e[0m\] \w $ "
fi
🏥 Disaster Recovery: The Plan You Hope You Never Need
RPO and RTO Explained (Simply)
RPO (Recovery Point Objective) = How much data can you lose?
"If the database is restored from backup, how old is that backup?"
RPO = 0: No data loss (synchronous replication)
RPO = 1 hour: You might lose up to 1 hour of data
RPO = 24 hours: Daily backups, worst case lose a full day
RTO (Recovery Time Objective) = How quickly must you recover?
"How long can the service be down?"
RTO = 0: Instant failover (active-active)
RTO = 1 hour: Warm standby, automated failover
RTO = 24 hours: Cold standby, manual restoration
DR Strategies Ranked
Cost & Complexity →→→→→→→→→→→→→→→→→→→→→→→→→→
Active-Active (Multi-Region) 💰💰💰💰💰
Both regions serve traffic. Instant failover.
RPO: 0, RTO: ~0
Use for: Payment processing, critical APIs
Active-Passive (Hot Standby) 💰💰💰
Standby region ready, switch on failure.
RPO: minutes, RTO: < 1 hour
Use for: Main customer-facing services
Warm Standby 💰💰
Minimal infrastructure in DR region.
RPO: hours, RTO: < 4 hours
Use for: Internal tools, non-critical services
Backup/Restore 💰
Backups only, rebuild from scratch.
RPO: hours-days, RTO: hours-days
Use for: Dev environments, archival data
🚨 Real-World Disaster #3: The DR Plan That Was Never Tested
What Happened: Company had a "disaster recovery plan" in a SharePoint document written 2 years ago. When an Azure region experienced a significant outage, they pulled out the DR plan. It referenced:
- A resource group that had been deleted
- A script that used CLI commands from az CLI v2.38 (they were on v2.56)
- A recovery process that assumed manual steps from an employee who left the company
- DNS records that had been changed 6 months ago
Recovery took 14 hours instead of the documented 2 hours.
The Fix: Test your DR plan regularly.
DR Testing Cadence:
Monthly: Table-top exercise (walk through the plan)
Quarterly: Partial failover test (one service)
Annually: Full DR drill (simulate complete region failure)
After every test:
→ Update the runbook with findings
→ Fix any automation that broke
→ Time the recovery and compare to RTO
📉 Toil Reduction: Automate the Boring Stuff
Toil = manual, repetitive operational work that scales with the size of the system and provides no lasting value.
Toil examples:
🔄 Manually restarting pods that OOMKill
🔄 Manually scaling nodes before expected traffic
🔄 Manually rotating secrets every 90 days
🔄 Manually approving deployments by looking at a dashboard
🔄 Manually creating namespaces for new services
Not toil (even if boring):
📝 Writing postmortems (creates lasting value)
🏗️ Building automation (one-time effort)
📊 Reviewing SLO dashboards (decision-making)
The Toil Budget Rule
Google's SRE book recommends: No more than 50% of an SRE's time should be toil. If it's higher, you're not doing engineering — you're doing operations with a fancier title.
🎯 Key Takeaways
- SLOs are contracts with yourself — pick the right number, not the highest number
- Error budgets turn reliability debates into math — you can't argue with math
- Blameless postmortems are how organizations learn (blame makes people hide problems)
- Chaos engineering starts small — game days before automated chaos in production
- Test your DR plan or it's not a plan, it's a wish
- Toil above 50% means you're doing ops, not engineering
🔥 Homework
- Pick your most important service. Write an SLO for it (availability + latency). Calculate the error budget.
- Look at your on-call incidents from last month. How many were repeat issues? Those are automation opportunities.
- When was the last time your DR plan was tested? If "never" or "I don't know" — schedule one.
Next up in the series: **From 10x Developer to 10x Multiplier: Surviving the Lead/Principal Glow-Up* — where we decode the mindset shift from writing code to enabling organizations.*
💬 What's the best (or worst) postmortem you've ever participated in? Did it lead to real change? Share below — I want to hear the stories that made organizations better. 📝
Top comments (0)