Samson Tanimawo

Posted on Apr 19

MTTR Optimization: The 7 Levers That Actually Move the Needle

#sre #mttr #incidents #devops

MTTR Is a Lagging Indicator

Everyone tracks Mean Time to Resolve. Few understand what actually drives it.

MTTR isn't one metric — it's four:

MTTR = MTTD + MTTA + MTTI + MTTF

MTTD: Mean Time to Detect     (monitoring fired)
MTTA: Mean Time to Acknowledge (human engaged)
MTTI: Mean Time to Identify    (root cause found)
MTTF: Mean Time to Fix         (resolution deployed)

Optimizing MTTR means optimizing each phase independently.

Lever 1: Faster Detection (MTTD)

Most teams' MTTD is dominated by alert delay and check intervals.

# Before: 5-minute check interval + 10-minute alerting threshold
# MTTD: up to 15 minutes

# After: 1-minute check interval + 3-minute threshold
# MTTD: up to 4 minutes

check_interval: 60s  # Was 300s
alert_threshold: 3m   # Was 10m

# But! Use multi-window approach to avoid false positives:
alert_rules:
  - name: high_error_rate
    rules:
      # Fast burn: fires quickly for severe issues
      - expr: error_rate > 10% for 2m
        severity: critical
      # Slow burn: fires eventually for gradual issues  
      - expr: error_rate > 2% for 15m
        severity: warning

Lever 2: Faster Acknowledgment (MTTA)

The gap between alert and human engagement is often 5-15 minutes.

Fix: Multi-channel escalation with auto-escalation:

0 min:  Push notification + Slack
3 min:  Phone call (if not acknowledged)
6 min:  Escalate to secondary on-call
10 min: Escalate to engineering manager

Our MTTA dropped from 8 minutes to 2 minutes.

Lever 3: Better Context in Alerts

The fastest way to reduce MTTI is giving the responder context upfront:

# Bad alert
alert: HighLatency
message: "API latency is high"

# Good alert
alert: HighLatency  
message: |
  API p99 latency: 2.3s (threshold: 500ms)
  Started: 5 minutes ago
  Recent deploy: v2.4.1 by @jane (12 min ago)
  Related alerts: DB connection pool at 90%
  Runbook: https://wiki.internal/runbooks/api-latency
  Dashboard: https://grafana.internal/d/api-golden-signals

This single change reduced our MTTI by 40%.

Lever 4: Runbook Links in Every Alert

Every alert should link to a runbook. No exceptions.

# We enforce this with a CI check
alert_lint_rules:
  - every_alert_must_have:
      - runbook_url
      - dashboard_url
      - severity_label
      - team_label

Lever 5: Pre-Built Investigation Queries

Don't make the on-call engineer write queries at 3am. Pre-build them:

# Runbook: API Latency High
## Quick Investigation

# 1. Check which endpoints are slow
SELECT endpoint, p99_latency_ms
FROM metrics
WHERE service = 'api' AND timestamp > NOW() - INTERVAL '30 min'
ORDER BY p99_latency_ms DESC
LIMIT 10;

# 2. Check for recent deploys
SELECT service, version, deployed_by, deployed_at
FROM deployments
WHERE deployed_at > NOW() - INTERVAL '2 hours'
ORDER BY deployed_at DESC;

# 3. Check dependency health
SELECT dependency, avg_latency_ms, error_rate
FROM dependency_health
WHERE service = 'api' AND timestamp > NOW() - INTERVAL '30 min';

Lever 6: Rollback Speed

The fastest fix is often a rollback. Optimize for it:

# Target: rollback in < 2 minutes

# Before: Manual rollback (8 minutes)
git revert HEAD
git push
# Wait for CI/CD (5+ minutes)

# After: One-command rollback (90 seconds)
rollback api-service --to previous
# Pre-built image, skip CI, just deploy

Lever 7: Post-Incident Automation

Every incident should produce:

Automatic timeline from alert and Slack data
Auto-generated post-mortem template
Suggested action items based on root cause category

def generate_postmortem(incident_id):
    alerts = get_alerts_in_window(incident_id)
    slack_messages = get_incident_channel_messages(incident_id)
    deploys = get_deploys_in_window(incident_id)

    return {
        'timeline': build_timeline(alerts, slack_messages, deploys),
        'impact': calculate_impact(incident_id),
        'suggested_category': classify_root_cause(alerts),
        'similar_past_incidents': find_similar(incident_id),
        'template': fill_postmortem_template(incident_id)
    }

Our Results

Phase	Before	After	Improvement
MTTD	12 min	3 min	75%
MTTA	8 min	2 min	75%
MTTI	25 min	8 min	68%
MTTF	18 min	6 min	67%
Total MTTR	63 min	19 min	70%

If you want to optimize every phase of your incident response with AI, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community