Postmortem Framework
Incidents are inevitable — learning from them is optional. This framework provides blameless postmortem templates, structured root cause analysis methods (5 Whys, Fishbone, Fault Tree), action item tracking with ownership and deadlines, and trend analysis dashboards that surface systemic patterns across incidents. Stop repeating the same failures and start building a genuine learning culture.
Key Features
- Blameless postmortem template — Structured Markdown template covering timeline, impact, root cause, action items, and lessons learned
- Root cause analysis toolkit — Guided worksheets for 5 Whys, Ishikawa (Fishbone), and Fault Tree Analysis methods
- Action item tracker — YAML-based tracking with assignee, priority, deadline, and completion status
- Trend analysis queries — Prometheus and Grafana configs to track incident frequency, MTTR, and category distributions
- Severity classification guide — Decision tree for consistently classifying SEV1 through SEV4
- Facilitator's guide — Step-by-step instructions for running effective postmortem meetings
- Executive summary generator — Python script that produces stakeholder-friendly summaries from postmortems
Quick Start
unzip postmortem-framework.zip && cd postmortem-framework/
# Create a new postmortem from template
python3 src/postmortem_framework/core.py create \
--incident-id INC-2026-0142 \
--title "Payment Service Latency Spike" \
--severity SEV2 --output postmortems/
# Analyze trends across all postmortems
python3 src/postmortem_framework/utils.py trends \
--input-dir postmortems/ --window 90
Architecture / How It Works
CAPTURE → ANALYZE → TRACK → LEARN
Timeline 5 Whys / Action items Trend analysis
builder Fishbone with owners across all
+ template analysis + deadlines postmortems
- Capture — Within 24 hours of resolution, create the postmortem doc with timeline events.
- Analyze — Use structured RCA methods to identify contributing factors.
- Track — Action items get assigned owners and deadlines. Weekly review ensures follow-through.
- Learn — Quarterly trend analysis surfaces systemic issues across incidents.
Usage Examples
Postmortem Template (YAML)
# postmortems/INC-2026-0142.yaml
incident:
id: INC-2026-0142
title: "Payment Service Latency Spike"
severity: SEV2
date: 2026-03-15
duration_minutes: 47
services_affected: [payment-service, checkout-ui]
impact: |
Payment processing latency increased from p99=180ms to p99=4200ms
for 47 minutes. ~12% of checkout attempts timed out.
timeline:
- time: "14:23 UTC"
event: "Deploy payment-service v2.14.3 to production"
- time: "14:31 UTC"
event: "Alert: payment_latency_p99 > 1s"
- time: "14:45 UTC"
event: "Root cause: missing DB index on new query path"
- time: "14:52 UTC"
event: "Rollback to v2.14.2 initiated"
- time: "15:10 UTC"
event: "Latency normal, incident resolved"
root_cause:
method: five_whys
analysis:
- why: "Payment latency spiked to 4.2s"
because: "New query path did full table scan on orders"
- why: "Full table scan"
because: "Missing composite index on (user_id, created_at)"
- why: "Missing index not caught"
because: "No query plan review in deployment checklist"
action_items:
- id: AI-001
action: "Add composite index on orders(user_id, created_at)"
owner: alice@example.com
priority: P1
deadline: 2026-03-17
status: completed
- id: AI-002
action: "Add EXPLAIN ANALYZE step to deployment checklist"
owner: bob@example.com
priority: P2
deadline: 2026-03-22
status: in_progress
Incident Trend Prometheus Rules
groups:
- name: incident_trends
rules:
- record: incidents:total:count_30d
expr: count(incident_resolved_timestamp > (time() - 30*24*3600))
- record: incidents:mttr:avg_30d
expr: avg(incident_resolved_timestamp - incident_created_timestamp) / 60
Trend Analysis
from postmortem_framework.utils import TrendAnalyzer
analyzer = TrendAnalyzer(postmortem_dir="postmortems/")
report = analyzer.analyze(window_days=90)
print(f"Incidents (90d): {report.total_incidents} | MTTR: {report.avg_mttr_minutes:.0f}m")
print(f"Top service: {report.top_service}")
print(f"Action completion: {report.action_completion_rate:.0%}")
Configuration
# config.example.yaml
postmortem:
template: templates/postmortem.yaml
output_dir: postmortems/
require_rca: true
require_action_items: true
max_days_to_complete: 5
severity_classification:
sev1:
criteria: "Revenue-impacting, >10% users affected, or data loss"
postmortem_required: true
sev2:
criteria: "Degraded service, 1-10% users affected"
postmortem_required: true
sev3:
criteria: "Minor impact, workaround available"
postmortem_required: optional
action_tracking:
review_cadence: weekly
stale_threshold_days: 14
Best Practices
- File within 3 business days — memories fade fast
- Blameless means blameless — focus on systems, never individuals
- Require at least one preventive action item — preventing recurrence is the goal
- Track action item completion — if items go unfinished, the process is theater
- Review trends quarterly — individual postmortems fix incidents, trends fix systems
Troubleshooting
Timeline builder shows no deploy events
Ensure your CI/CD writes to the expected deploy log path. Check deploy_log_path in config.
Trend analysis returns empty results
Verify postmortem YAML files have incident.date in ISO 8601 format. Run with --verbose to see parse warnings.
Action item reminders not sending
Check that notify_owners: true is set and SMTP is configured. The reminder runs as a cron job — verify with crontab -l.
This is 1 of 7 resources in the SRE Platform Pro toolkit. Get the complete [Postmortem Framework] with all files, templates, and documentation for $19.
Or grab the entire SRE Platform Pro bundle (7 products) for $89 — save 30%.
Top comments (0)