Sepehr Mohseni

Posted on Feb 3

Incident response & blameless post-mortems: writing better runbooks and SLO/SLI definitions

#software #architecture #backend

Our checkout service went down at 2 AM on a Friday. By the time we got it back, we'd lost six hours of sales during a promotional weekend. The technical cause was a database connection pool exhaustion. The real cause was that we had no runbook for this failure mode, no clear escalation path, and our monitoring didn't alert until users were already seeing errors.

The postmortem was brutal. But it was also the turning point. We stopped treating incidents as fires to extinguish and started treating reliability as an engineering discipline.

This article covers what we learned: how to define SLOs that actually matter, write runbooks that get used, run incidents without chaos, and conduct postmortems that prevent recurrence.

SLOs, SLIs, and Error Budgets: Getting Them Right

Most teams either have no SLOs or have fake ones — numbers picked from the air that don’t connect to user experience or engineering decisions.

Good SLOs change how you prioritize work. If your error budget is healthy, ship features. If it's burning, focus on reliability.

SLIs: What You Actually Measure

A Service Level Indicator (SLI) reflects user experience, not server health.

Bad SLIs:

CPU utilization
Memory usage
Number of pods running

Good SLIs:

Request success rate (non-5xx / total)
Request latency at p99
Data freshness

Availability SLI:

sum(rate(http_requests_total{status!~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m]))

Latency SLI:

sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

SLOs: Your Target

SLO	Allowed Downtime/Month	Allowed Errors (1M requests)
99%	7.2 hours	10,000
99.5%	3.6 hours	5,000
99.9%	43.8 minutes	1,000
99.95%	21.9 minutes	500
99.99%	4.3 minutes	100

Our approach:

Checkout / Auth: 99.95%, p99 < 200ms
Internal APIs: 99.9%, p99 < 500ms
Batch systems: 99.5%

Start lower. Tighten later.

SLOs Based on User Journeys

Users care about journeys, not services.

journeys:
  - name: checkout
    slo:
      availability: 99.95%
      latency_p99: 3s
    components:
      - api-gateway
      - auth-service
      - cart-service
      - inventory-service
      - payments-service
      - order-service
    measurement:
      endpoint: /api/v1/checkout/health
      rum_event: checkout_completed

Error Budgets: Making SLOs Actionable

thresholds:
  - budget_remaining: 50%
    actions:
      - notify: slack
  - budget_remaining: 25%
    actions:
      - freeze: non_critical_deployments
  - budget_remaining: 10%
    actions:
      - freeze: all_deployments
      - meeting: reliability_review
  - budget_remaining: 0%
    actions:
      - focus: reliability_only

Implementing SLO Monitoring

slos:
  - name: requests-availability
    objective: 99.95
    sli:
      events:
        error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total[{{.window}}]))

Writing Runbooks That Actually Get Used

Good runbooks are:

Scannable
Actionable
Tested

Runbook Structure

# Runbook: Payments API High Error Rate

## Detection
Alert: PaymentsAPIHighErrorRate

## Step 1: Check provider status
curl -s https://status.stripe.com/api/v2/summary.json

## Step 2: Check recent deploys
kubectl rollout history deployment/payments-api

## Step 3: Rollback if needed
kubectl rollout undo deployment/payments-api

## Step 4: Check DB pool
curl http://payments-api/debug/metrics | grep db_pool

Testing Runbooks (PHP / Laravel Example)

public function testDatabaseConnectionExhaustionRunbook(): void
{
    // Simulate connection exhaustion
    $this->simulateDbPoolExhaustion();

    // Verify alert condition
    $metrics = $this->fetchMetrics('/debug/metrics');
    $this->assertLessThan(5, $metrics['db_pool_available']);

    // Apply mitigation
    $this->scaleServiceReplicas(10);

    // Verify recovery
    $this->assertTrue($this->serviceRecovered());
}

Incident Response: A Structured Approach

Severity Levels

SEV1: Complete outage or data loss
SEV2: Major degradation
SEV3: Minor impact

Incident Roles

Incident Commander — coordinates
Tech Lead — debugs
Comms Lead — communicates

Incident Channel Template

🔴 INCIDENT: Checkout Errors

Severity: SEV2
Impact: Success rate 82%

Roles:
IC: @alice
Tech: @bob
Comms: @carol

Timeline:
14:32 Alert fired
14:40 Stripe returning 503s
14:45 Circuit breaker engaged
15:15 Resolved

Incident Automation (PHP)

class IncidentBot
{
    public function declareIncident(array $data): Incident
    {
        $incident = Incident::create([
            'title' => $data['title'],
            'severity' => $data['severity'],
            'status' => 'investigating',
        ]);

        $this->createSlackChannel($incident);
        $this->notifyPagerDuty($incident);

        return $incident;
    }

    public function resolveIncident(Incident $incident): void
    {
        $incident->update(['status' => 'resolved']);
        $this->schedulePostmortem($incident);
    }
}

Blameless Postmortems

Don’t ask: Who caused this?
Ask: What allowed this?

Postmortem Template

## Summary
Checkout degraded for 43 minutes.

## Root Cause
Circuit breaker threshold too high.

## Action Items
| Action | Owner | Deadline |
|--------|-------|----------|
| Lower threshold | @bob | Jan 22 |
| Add alert | @alice | Jan 23 |

Tracking Action Items (PHP)

class ActionItemTracker
{
    public function weeklyDigest(): void
    {
        $overdue = ActionItem::overdue()->get()->groupBy('owner');

        foreach ($overdue as $owner => $items) {
            $this->notifyOwner($owner, $items);
        }
    }
}

Measuring Reliability Improvement

Metric	Before	After
MTTR	4 hours	35 min
Repeat incidents	4/q	1/q
Error budget remaining	12%	58%

Conclusion

Reliability is a discipline:

SLOs tell you when things are wrong
Runbooks help you fix them
Incident roles prevent chaos
Postmortems prevent recurrence

Key takeaways:

Define SLOs around journeys
Use error budgets to guide decisions
Write actionable runbooks
Test runbooks regularly
Keep incidents structured
Keep postmortems blameless
Track action items relentlessly

DEV Community