MTTR Is a Lagging Indicator
Everyone tracks Mean Time to Resolve. Few understand what actually drives it.
MTTR isn't one metric — it's four:
MTTR = MTTD + MTTA + MTTI + MTTF
MTTD: Mean Time to Detect (monitoring fired)
MTTA: Mean Time to Acknowledge (human engaged)
MTTI: Mean Time to Identify (root cause found)
MTTF: Mean Time to Fix (resolution deployed)
Optimizing MTTR means optimizing each phase independently.
Lever 1: Faster Detection (MTTD)
Most teams' MTTD is dominated by alert delay and check intervals.
# Before: 5-minute check interval + 10-minute alerting threshold
# MTTD: up to 15 minutes
# After: 1-minute check interval + 3-minute threshold
# MTTD: up to 4 minutes
check_interval: 60s # Was 300s
alert_threshold: 3m # Was 10m
# But! Use multi-window approach to avoid false positives:
alert_rules:
- name: high_error_rate
rules:
# Fast burn: fires quickly for severe issues
- expr: error_rate > 10% for 2m
severity: critical
# Slow burn: fires eventually for gradual issues
- expr: error_rate > 2% for 15m
severity: warning
Lever 2: Faster Acknowledgment (MTTA)
The gap between alert and human engagement is often 5-15 minutes.
Fix: Multi-channel escalation with auto-escalation:
0 min: Push notification + Slack
3 min: Phone call (if not acknowledged)
6 min: Escalate to secondary on-call
10 min: Escalate to engineering manager
Our MTTA dropped from 8 minutes to 2 minutes.
Lever 3: Better Context in Alerts
The fastest way to reduce MTTI is giving the responder context upfront:
# Bad alert
alert: HighLatency
message: "API latency is high"
# Good alert
alert: HighLatency
message: |
API p99 latency: 2.3s (threshold: 500ms)
Started: 5 minutes ago
Recent deploy: v2.4.1 by @jane (12 min ago)
Related alerts: DB connection pool at 90%
Runbook: https://wiki.internal/runbooks/api-latency
Dashboard: https://grafana.internal/d/api-golden-signals
This single change reduced our MTTI by 40%.
Lever 4: Runbook Links in Every Alert
Every alert should link to a runbook. No exceptions.
# We enforce this with a CI check
alert_lint_rules:
- every_alert_must_have:
- runbook_url
- dashboard_url
- severity_label
- team_label
Lever 5: Pre-Built Investigation Queries
Don't make the on-call engineer write queries at 3am. Pre-build them:
# Runbook: API Latency High
## Quick Investigation
# 1. Check which endpoints are slow
SELECT endpoint, p99_latency_ms
FROM metrics
WHERE service = 'api' AND timestamp > NOW() - INTERVAL '30 min'
ORDER BY p99_latency_ms DESC
LIMIT 10;
# 2. Check for recent deploys
SELECT service, version, deployed_by, deployed_at
FROM deployments
WHERE deployed_at > NOW() - INTERVAL '2 hours'
ORDER BY deployed_at DESC;
# 3. Check dependency health
SELECT dependency, avg_latency_ms, error_rate
FROM dependency_health
WHERE service = 'api' AND timestamp > NOW() - INTERVAL '30 min';
Lever 6: Rollback Speed
The fastest fix is often a rollback. Optimize for it:
# Target: rollback in < 2 minutes
# Before: Manual rollback (8 minutes)
git revert HEAD
git push
# Wait for CI/CD (5+ minutes)
# After: One-command rollback (90 seconds)
rollback api-service --to previous
# Pre-built image, skip CI, just deploy
Lever 7: Post-Incident Automation
Every incident should produce:
- Automatic timeline from alert and Slack data
- Auto-generated post-mortem template
- Suggested action items based on root cause category
def generate_postmortem(incident_id):
alerts = get_alerts_in_window(incident_id)
slack_messages = get_incident_channel_messages(incident_id)
deploys = get_deploys_in_window(incident_id)
return {
'timeline': build_timeline(alerts, slack_messages, deploys),
'impact': calculate_impact(incident_id),
'suggested_category': classify_root_cause(alerts),
'similar_past_incidents': find_similar(incident_id),
'template': fill_postmortem_template(incident_id)
}
Our Results
| Phase | Before | After | Improvement |
|---|---|---|---|
| MTTD | 12 min | 3 min | 75% |
| MTTA | 8 min | 2 min | 75% |
| MTTI | 25 min | 8 min | 68% |
| MTTF | 18 min | 6 min | 67% |
| Total MTTR | 63 min | 19 min | 70% |
If you want to optimize every phase of your incident response with AI, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)