DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Postmortem Framework

Postmortem Framework

Incidents are inevitable — learning from them is optional. This framework provides blameless postmortem templates, structured root cause analysis methods (5 Whys, Fishbone, Fault Tree), action item tracking with ownership and deadlines, and trend analysis dashboards that surface systemic patterns across incidents. Stop repeating the same failures and start building a genuine learning culture.

Key Features

  • Blameless postmortem template — Structured Markdown template covering timeline, impact, root cause, action items, and lessons learned
  • Root cause analysis toolkit — Guided worksheets for 5 Whys, Ishikawa (Fishbone), and Fault Tree Analysis methods
  • Action item tracker — YAML-based tracking with assignee, priority, deadline, and completion status
  • Trend analysis queries — Prometheus and Grafana configs to track incident frequency, MTTR, and category distributions
  • Severity classification guide — Decision tree for consistently classifying SEV1 through SEV4
  • Facilitator's guide — Step-by-step instructions for running effective postmortem meetings
  • Executive summary generator — Python script that produces stakeholder-friendly summaries from postmortems

Quick Start

unzip postmortem-framework.zip && cd postmortem-framework/

# Create a new postmortem from template
python3 src/postmortem_framework/core.py create \
  --incident-id INC-2026-0142 \
  --title "Payment Service Latency Spike" \
  --severity SEV2 --output postmortems/

# Analyze trends across all postmortems
python3 src/postmortem_framework/utils.py trends \
  --input-dir postmortems/ --window 90
Enter fullscreen mode Exit fullscreen mode

Architecture / How It Works

CAPTURE → ANALYZE → TRACK → LEARN
Timeline   5 Whys /    Action items   Trend analysis
builder    Fishbone    with owners    across all
+ template analysis    + deadlines    postmortems
Enter fullscreen mode Exit fullscreen mode
  1. Capture — Within 24 hours of resolution, create the postmortem doc with timeline events.
  2. Analyze — Use structured RCA methods to identify contributing factors.
  3. Track — Action items get assigned owners and deadlines. Weekly review ensures follow-through.
  4. Learn — Quarterly trend analysis surfaces systemic issues across incidents.

Usage Examples

Postmortem Template (YAML)

# postmortems/INC-2026-0142.yaml
incident:
  id: INC-2026-0142
  title: "Payment Service Latency Spike"
  severity: SEV2
  date: 2026-03-15
  duration_minutes: 47
  services_affected: [payment-service, checkout-ui]

  impact: |
    Payment processing latency increased from p99=180ms to p99=4200ms
    for 47 minutes. ~12% of checkout attempts timed out.

timeline:
  - time: "14:23 UTC"
    event: "Deploy payment-service v2.14.3 to production"
  - time: "14:31 UTC"
    event: "Alert: payment_latency_p99 > 1s"
  - time: "14:45 UTC"
    event: "Root cause: missing DB index on new query path"
  - time: "14:52 UTC"
    event: "Rollback to v2.14.2 initiated"
  - time: "15:10 UTC"
    event: "Latency normal, incident resolved"

root_cause:
  method: five_whys
  analysis:
    - why: "Payment latency spiked to 4.2s"
      because: "New query path did full table scan on orders"
    - why: "Full table scan"
      because: "Missing composite index on (user_id, created_at)"
    - why: "Missing index not caught"
      because: "No query plan review in deployment checklist"

action_items:
  - id: AI-001
    action: "Add composite index on orders(user_id, created_at)"
    owner: alice@example.com
    priority: P1
    deadline: 2026-03-17
    status: completed
  - id: AI-002
    action: "Add EXPLAIN ANALYZE step to deployment checklist"
    owner: bob@example.com
    priority: P2
    deadline: 2026-03-22
    status: in_progress
Enter fullscreen mode Exit fullscreen mode

Incident Trend Prometheus Rules

groups:
  - name: incident_trends
    rules:
      - record: incidents:total:count_30d
        expr: count(incident_resolved_timestamp > (time() - 30*24*3600))
      - record: incidents:mttr:avg_30d
        expr: avg(incident_resolved_timestamp - incident_created_timestamp) / 60
Enter fullscreen mode Exit fullscreen mode

Trend Analysis

from postmortem_framework.utils import TrendAnalyzer

analyzer = TrendAnalyzer(postmortem_dir="postmortems/")
report = analyzer.analyze(window_days=90)
print(f"Incidents (90d): {report.total_incidents} | MTTR: {report.avg_mttr_minutes:.0f}m")
print(f"Top service: {report.top_service}")
print(f"Action completion: {report.action_completion_rate:.0%}")
Enter fullscreen mode Exit fullscreen mode

Configuration

# config.example.yaml
postmortem:
  template: templates/postmortem.yaml
  output_dir: postmortems/
  require_rca: true
  require_action_items: true
  max_days_to_complete: 5

severity_classification:
  sev1:
    criteria: "Revenue-impacting, >10% users affected, or data loss"
    postmortem_required: true
  sev2:
    criteria: "Degraded service, 1-10% users affected"
    postmortem_required: true
  sev3:
    criteria: "Minor impact, workaround available"
    postmortem_required: optional

action_tracking:
  review_cadence: weekly
  stale_threshold_days: 14
Enter fullscreen mode Exit fullscreen mode

Best Practices

  • File within 3 business days — memories fade fast
  • Blameless means blameless — focus on systems, never individuals
  • Require at least one preventive action item — preventing recurrence is the goal
  • Track action item completion — if items go unfinished, the process is theater
  • Review trends quarterly — individual postmortems fix incidents, trends fix systems

Troubleshooting

Timeline builder shows no deploy events
Ensure your CI/CD writes to the expected deploy log path. Check deploy_log_path in config.

Trend analysis returns empty results
Verify postmortem YAML files have incident.date in ISO 8601 format. Run with --verbose to see parse warnings.

Action item reminders not sending
Check that notify_owners: true is set and SMTP is configured. The reminder runs as a cron job — verify with crontab -l.


This is 1 of 7 resources in the SRE Platform Pro toolkit. Get the complete [Postmortem Framework] with all files, templates, and documentation for $19.

Get the Full Kit →

Or grab the entire SRE Platform Pro bundle (7 products) for $89 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)