DEV Community

Cover image for Incident response & blameless post-mortems: writing better runbooks and SLO/SLI definitions
Sepehr Mohseni
Sepehr Mohseni

Posted on

Incident response & blameless post-mortems: writing better runbooks and SLO/SLI definitions

Our checkout service went down at 2 AM on a Friday. By the time we got it back, we'd lost six hours of sales during a promotional weekend. The technical cause was a database connection pool exhaustion. The real cause was that we had no runbook for this failure mode, no clear escalation path, and our monitoring didn't alert until users were already seeing errors.

The postmortem was brutal. But it was also the turning point. We stopped treating incidents as fires to extinguish and started treating reliability as an engineering discipline.

This article covers what we learned: how to define SLOs that actually matter, write runbooks that get used, run incidents without chaos, and conduct postmortems that prevent recurrence.


SLOs, SLIs, and Error Budgets: Getting Them Right

Most teams either have no SLOs or have fake ones — numbers picked from the air that don’t connect to user experience or engineering decisions.

Good SLOs change how you prioritize work. If your error budget is healthy, ship features. If it's burning, focus on reliability.

SLIs: What You Actually Measure

A Service Level Indicator (SLI) reflects user experience, not server health.

Bad SLIs:

  • CPU utilization
  • Memory usage
  • Number of pods running

Good SLIs:

  • Request success rate (non-5xx / total)
  • Request latency at p99
  • Data freshness

Availability SLI:

sum(rate(http_requests_total{status!~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m]))
Enter fullscreen mode Exit fullscreen mode

Latency SLI:

sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
Enter fullscreen mode Exit fullscreen mode

SLOs: Your Target

SLO Allowed Downtime/Month Allowed Errors (1M requests)
99% 7.2 hours 10,000
99.5% 3.6 hours 5,000
99.9% 43.8 minutes 1,000
99.95% 21.9 minutes 500
99.99% 4.3 minutes 100

Our approach:

  • Checkout / Auth: 99.95%, p99 < 200ms
  • Internal APIs: 99.9%, p99 < 500ms
  • Batch systems: 99.5%

Start lower. Tighten later.


SLOs Based on User Journeys

Users care about journeys, not services.

journeys:
  - name: checkout
    slo:
      availability: 99.95%
      latency_p99: 3s
    components:
      - api-gateway
      - auth-service
      - cart-service
      - inventory-service
      - payments-service
      - order-service
    measurement:
      endpoint: /api/v1/checkout/health
      rum_event: checkout_completed
Enter fullscreen mode Exit fullscreen mode

Error Budgets: Making SLOs Actionable

thresholds:
  - budget_remaining: 50%
    actions:
      - notify: slack
  - budget_remaining: 25%
    actions:
      - freeze: non_critical_deployments
  - budget_remaining: 10%
    actions:
      - freeze: all_deployments
      - meeting: reliability_review
  - budget_remaining: 0%
    actions:
      - focus: reliability_only
Enter fullscreen mode Exit fullscreen mode

Implementing SLO Monitoring

slos:
  - name: requests-availability
    objective: 99.95
    sli:
      events:
        error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total[{{.window}}]))
Enter fullscreen mode Exit fullscreen mode

Writing Runbooks That Actually Get Used

Good runbooks are:

  • Scannable
  • Actionable
  • Tested

Runbook Structure

# Runbook: Payments API High Error Rate

## Detection
Alert: PaymentsAPIHighErrorRate

## Step 1: Check provider status
curl -s https://status.stripe.com/api/v2/summary.json

## Step 2: Check recent deploys
kubectl rollout history deployment/payments-api

## Step 3: Rollback if needed
kubectl rollout undo deployment/payments-api

## Step 4: Check DB pool
curl http://payments-api/debug/metrics | grep db_pool
Enter fullscreen mode Exit fullscreen mode

Testing Runbooks (PHP / Laravel Example)

public function testDatabaseConnectionExhaustionRunbook(): void
{
    // Simulate connection exhaustion
    $this->simulateDbPoolExhaustion();

    // Verify alert condition
    $metrics = $this->fetchMetrics('/debug/metrics');
    $this->assertLessThan(5, $metrics['db_pool_available']);

    // Apply mitigation
    $this->scaleServiceReplicas(10);

    // Verify recovery
    $this->assertTrue($this->serviceRecovered());
}
Enter fullscreen mode Exit fullscreen mode

Incident Response: A Structured Approach

Severity Levels

SEV1: Complete outage or data loss
SEV2: Major degradation
SEV3: Minor impact
Enter fullscreen mode Exit fullscreen mode

Incident Roles

  • Incident Commander — coordinates
  • Tech Lead — debugs
  • Comms Lead — communicates

Incident Channel Template

🔴 INCIDENT: Checkout Errors

Severity: SEV2
Impact: Success rate 82%

Roles:
IC: @alice
Tech: @bob
Comms: @carol

Timeline:
14:32 Alert fired
14:40 Stripe returning 503s
14:45 Circuit breaker engaged
15:15 Resolved
Enter fullscreen mode Exit fullscreen mode

Incident Automation (PHP)

class IncidentBot
{
    public function declareIncident(array $data): Incident
    {
        $incident = Incident::create([
            'title' => $data['title'],
            'severity' => $data['severity'],
            'status' => 'investigating',
        ]);

        $this->createSlackChannel($incident);
        $this->notifyPagerDuty($incident);

        return $incident;
    }

    public function resolveIncident(Incident $incident): void
    {
        $incident->update(['status' => 'resolved']);
        $this->schedulePostmortem($incident);
    }
}
Enter fullscreen mode Exit fullscreen mode

Blameless Postmortems

Don’t ask: Who caused this?
Ask: What allowed this?

Postmortem Template

## Summary
Checkout degraded for 43 minutes.

## Root Cause
Circuit breaker threshold too high.

## Action Items
| Action | Owner | Deadline |
|--------|-------|----------|
| Lower threshold | @bob | Jan 22 |
| Add alert | @alice | Jan 23 |
Enter fullscreen mode Exit fullscreen mode

Tracking Action Items (PHP)

class ActionItemTracker
{
    public function weeklyDigest(): void
    {
        $overdue = ActionItem::overdue()->get()->groupBy('owner');

        foreach ($overdue as $owner => $items) {
            $this->notifyOwner($owner, $items);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Measuring Reliability Improvement

Metric Before After
MTTR 4 hours 35 min
Repeat incidents 4/q 1/q
Error budget remaining 12% 58%

Conclusion

Reliability is a discipline:

  • SLOs tell you when things are wrong
  • Runbooks help you fix them
  • Incident roles prevent chaos
  • Postmortems prevent recurrence

Key takeaways:

  • Define SLOs around journeys
  • Use error budgets to guide decisions
  • Write actionable runbooks
  • Test runbooks regularly
  • Keep incidents structured
  • Keep postmortems blameless
  • Track action items relentlessly

Top comments (0)