Our checkout service went down at 2 AM on a Friday. By the time we got it back, we'd lost six hours of sales during a promotional weekend. The technical cause was a database connection pool exhaustion. The real cause was that we had no runbook for this failure mode, no clear escalation path, and our monitoring didn't alert until users were already seeing errors.
The postmortem was brutal. But it was also the turning point. We stopped treating incidents as fires to extinguish and started treating reliability as an engineering discipline.
This article covers what we learned: how to define SLOs that actually matter, write runbooks that get used, run incidents without chaos, and conduct postmortems that prevent recurrence.
SLOs, SLIs, and Error Budgets: Getting Them Right
Most teams either have no SLOs or have fake ones — numbers picked from the air that don’t connect to user experience or engineering decisions.
Good SLOs change how you prioritize work. If your error budget is healthy, ship features. If it's burning, focus on reliability.
SLIs: What You Actually Measure
A Service Level Indicator (SLI) reflects user experience, not server health.
Bad SLIs:
- CPU utilization
- Memory usage
- Number of pods running
Good SLIs:
- Request success rate (non-5xx / total)
- Request latency at p99
- Data freshness
Availability SLI:
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Latency SLI:
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
SLOs: Your Target
| SLO | Allowed Downtime/Month | Allowed Errors (1M requests) |
|---|---|---|
| 99% | 7.2 hours | 10,000 |
| 99.5% | 3.6 hours | 5,000 |
| 99.9% | 43.8 minutes | 1,000 |
| 99.95% | 21.9 minutes | 500 |
| 99.99% | 4.3 minutes | 100 |
Our approach:
- Checkout / Auth: 99.95%, p99 < 200ms
- Internal APIs: 99.9%, p99 < 500ms
- Batch systems: 99.5%
Start lower. Tighten later.
SLOs Based on User Journeys
Users care about journeys, not services.
journeys:
- name: checkout
slo:
availability: 99.95%
latency_p99: 3s
components:
- api-gateway
- auth-service
- cart-service
- inventory-service
- payments-service
- order-service
measurement:
endpoint: /api/v1/checkout/health
rum_event: checkout_completed
Error Budgets: Making SLOs Actionable
thresholds:
- budget_remaining: 50%
actions:
- notify: slack
- budget_remaining: 25%
actions:
- freeze: non_critical_deployments
- budget_remaining: 10%
actions:
- freeze: all_deployments
- meeting: reliability_review
- budget_remaining: 0%
actions:
- focus: reliability_only
Implementing SLO Monitoring
slos:
- name: requests-availability
objective: 99.95
sli:
events:
error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
total_query: sum(rate(http_requests_total[{{.window}}]))
Writing Runbooks That Actually Get Used
Good runbooks are:
- Scannable
- Actionable
- Tested
Runbook Structure
# Runbook: Payments API High Error Rate
## Detection
Alert: PaymentsAPIHighErrorRate
## Step 1: Check provider status
curl -s https://status.stripe.com/api/v2/summary.json
## Step 2: Check recent deploys
kubectl rollout history deployment/payments-api
## Step 3: Rollback if needed
kubectl rollout undo deployment/payments-api
## Step 4: Check DB pool
curl http://payments-api/debug/metrics | grep db_pool
Testing Runbooks (PHP / Laravel Example)
public function testDatabaseConnectionExhaustionRunbook(): void
{
// Simulate connection exhaustion
$this->simulateDbPoolExhaustion();
// Verify alert condition
$metrics = $this->fetchMetrics('/debug/metrics');
$this->assertLessThan(5, $metrics['db_pool_available']);
// Apply mitigation
$this->scaleServiceReplicas(10);
// Verify recovery
$this->assertTrue($this->serviceRecovered());
}
Incident Response: A Structured Approach
Severity Levels
SEV1: Complete outage or data loss
SEV2: Major degradation
SEV3: Minor impact
Incident Roles
- Incident Commander — coordinates
- Tech Lead — debugs
- Comms Lead — communicates
Incident Channel Template
🔴 INCIDENT: Checkout Errors
Severity: SEV2
Impact: Success rate 82%
Roles:
IC: @alice
Tech: @bob
Comms: @carol
Timeline:
14:32 Alert fired
14:40 Stripe returning 503s
14:45 Circuit breaker engaged
15:15 Resolved
Incident Automation (PHP)
class IncidentBot
{
public function declareIncident(array $data): Incident
{
$incident = Incident::create([
'title' => $data['title'],
'severity' => $data['severity'],
'status' => 'investigating',
]);
$this->createSlackChannel($incident);
$this->notifyPagerDuty($incident);
return $incident;
}
public function resolveIncident(Incident $incident): void
{
$incident->update(['status' => 'resolved']);
$this->schedulePostmortem($incident);
}
}
Blameless Postmortems
Don’t ask: Who caused this?
Ask: What allowed this?
Postmortem Template
## Summary
Checkout degraded for 43 minutes.
## Root Cause
Circuit breaker threshold too high.
## Action Items
| Action | Owner | Deadline |
|--------|-------|----------|
| Lower threshold | @bob | Jan 22 |
| Add alert | @alice | Jan 23 |
Tracking Action Items (PHP)
class ActionItemTracker
{
public function weeklyDigest(): void
{
$overdue = ActionItem::overdue()->get()->groupBy('owner');
foreach ($overdue as $owner => $items) {
$this->notifyOwner($owner, $items);
}
}
}
Measuring Reliability Improvement
| Metric | Before | After |
|---|---|---|
| MTTR | 4 hours | 35 min |
| Repeat incidents | 4/q | 1/q |
| Error budget remaining | 12% | 58% |
Conclusion
Reliability is a discipline:
- SLOs tell you when things are wrong
- Runbooks help you fix them
- Incident roles prevent chaos
- Postmortems prevent recurrence
Key takeaways:
- Define SLOs around journeys
- Use error budgets to guide decisions
- Write actionable runbooks
- Test runbooks regularly
- Keep incidents structured
- Keep postmortems blameless
- Track action items relentlessly
Top comments (0)