DEV Community

DevHelm
DevHelm

Posted on • Originally published at devhelm.io

MTTA, MTTR, MTBF, MTTF — The Four Incident Metrics, Compared

Four acronyms show up in every incident management conversation: MTTA, MTTR, MTBF, and MTTF. They get jumbled together in slide decks, confused in retro discussions, and mixed up in job interviews. They measure four different things, from four different timestamps, with four different improvement levers.

This guide puts all four side by side, traces them through a single incident timeline, and answers the question that matters: which one should you track, and when?

One incident, four metrics

The cleanest way to understand the four metrics is to walk through one incident and label where each measurement starts and stops.

Time Event
14:00:00 Service starts returning 500 errors (failure begins)
14:02:30 Monitor fires an alert (detection)
14:05:00 On-call engineer acknowledges the page (acknowledgment)
14:42:00 Service restored, incident resolved (recovery)
Next failure occurs 12 days later

From these timestamps:

Metric Measures Start End This incident
MTTD Time to detect Failure begins (14:00) Alert fires (14:02:30) 2.5 min
MTTA Time to acknowledge Alert fires (14:02:30) Engineer acks (14:05) 2.5 min
MTTR Time to recovery Failure begins (14:00) Recovery (14:42) 42 min
MTTF Time to next failure Recovery (14:42) Next failure start ~12 days
MTBF Full cycle This failure start Next failure start ~12 days + 42 min

MTTD and MTTA are the early-warning metrics — they tell you how fast you detected and responded. MTTR is the incident-duration metric — it measures total impact time. MTTF and MTBF are the reliability metrics — they measure how often failures happen.

MTTA — Mean Time To Acknowledge

MTTA measures the gap between an alert firing and a human confirming they're working on it. It's a measure of your on-call process, not your technical system.

What it captures: pager responsiveness, on-call discipline, alert routing effectiveness.

What it misses: everything after acknowledgment. A team with a 30-second MTTA and a 4-hour resolution time has a fast paging system and a slow debugging process.

Improvement levers: better alert routing (fewer false positives means alerts get trusted and acknowledged faster), escalation policies that page a backup after N minutes, on-call overlap during shift handoffs.

DevHelm tracks confirmedAt (multi-region incident confirmation) but does not yet record a separate human acknowledgment timestamp — the acknowledgment step lives in your PagerDuty, Opsgenie, or Slack integration today.

MTTR — Mean Time To Recovery

MTTR is the most widely tracked incident metric. It measures the total elapsed time from failure start to service recovery. This is the metric your SLO error budget cares about: every minute in MTTR is a minute of downtime consumed.

The deep dive is in the MTTR full form guide, but the key point for comparison: MTTR includes detection time, acknowledgment time, diagnosis, and fix. It's an end-to-end metric, which makes it the most useful for external stakeholders but the hardest to improve because the bottleneck could be anywhere in the chain.

What it captures: total customer-facing impact time.

What it misses: failure frequency. A service with a 5-minute MTTR that fails ten times a month has a fundamentally different problem than one with a 5-minute MTTR that fails once a year.

Improvement levers: faster detection (monitoring with short check intervals), better runbooks (reduce diagnosis time), automated remediation, multi-region failover.

MTBF — Mean Time Between Failures

MTBF measures the average time from the end of one failure to the start of the next. It's the reliability metric: high MTBF means the system is stable; low MTBF means it breaks often.

The deep dive is in the MTBF full form guide. The key point here: MTBF = MTTF + MTTR. It spans the entire failure cycle.

What it captures: system stability, failure frequency, whether your reliability investments are working.

What it misses: failure severity. A service with 100 sev4 flaps per month has a terrible MTBF but no real reliability problem. Filter by severity level (sev1+sev2 only) to keep MTBF meaningful.

Improvement levers: root cause elimination, dependency isolation (circuit breakers, fallbacks), better testing, capacity planning.

MTTF — Mean Time To Failure

MTTF measures the operating time between recovery and the next failure — "how long does the system run before it breaks again?"

In hardware reliability, MTTF is for non-repairable components (light bulbs, hard drives) while MTBF is for repairable systems. In software, everything is repairable, so MTTF is the uptime component of MTBF: MTTF = MTBF - MTTR.

What it captures: the same thing as MTBF minus the recovery time. For services with low MTTR (minutes), MTTF and MTBF are nearly identical.

What it misses: recovery quality. If you "fix" an incident by restarting a pod and the root cause is still there, MTTF will be short because the failure recurs quickly. MTTF rewards durable fixes.

When MTTF matters more than MTBF: when your MTTR is highly variable. If some incidents take 5 minutes and others take 5 hours, MTBF averages the downtime in, masking the variance. MTTF isolates the operating-time question from the recovery-time question.

When you need which one

Not every team needs all four metrics. Here's the decision framework:

Question you're asking Metric Why
"How fast do we respond to alerts?" MTTA Measures on-call process health
"How long are our customers affected?" MTTR Measures total incident duration
"How often do things break?" MTBF Measures failure frequency
"Are our fixes durable?" MTTF Isolates operating time from recovery
"Is our monitoring fast enough?" MTTD Measures detection lag

Start with MTTR. Every team should track it, because it directly maps to customer impact and error budgets. The Google SRE Workbook centers its SLO framework on availability — and availability is the inverse of cumulative MTTR.

Add MTBF when MTTR is stable but incidents are too frequent. If your MTTR is 15 minutes but you're having incidents three times a week, the problem isn't response speed — it's system stability. MTBF makes that visible.

Add MTTA when you suspect paging is the bottleneck. If incidents take 45 minutes to resolve but 20 of those minutes are "waiting for someone to respond," MTTA makes the on-call gap visible.

Track MTTF when you suspect fixes aren't durable. If the same incident recurs within days of being "resolved," MTTF will be conspicuously low while MTBF might still look acceptable (because it averages in the stable periods between recurrence clusters).

Common pitfalls

Averaging across severity levels. A fleet of 10 sev4 flaps and 1 sev1 outage produces an "MTTR of 8 minutes" that hides the 2-hour sev1. Always segment metrics by severity level.

Counting self-healing as incidents. If your system auto-recovers in 30 seconds, is that a "failure" for MTBF purposes? Most teams exclude incidents that resolve within the confirmation window (e.g., DevHelm's multi-region confirmation requires failures across at least 2 regions before opening an incident). If you don't exclude auto-recoveries, MTBF becomes noise.

Comparing MTBF across services. A payment service and a notification service have fundamentally different blast radii. Comparing their MTBF is like comparing a car engine's MTBF to a light switch's. Track each service independently.

Ignoring partial recoveries. An incident where the service is "up but slow" for 2 hours, then fully recovered, has a different MTTR depending on whether you measure to partial recovery or full recovery. Define your measurement convention and stick to it.

Where to start

If you're tracking nothing, start with MTTR. Pull the last 90 days of sev1+sev2 incidents, compute the average duration from start to resolution, and write that number down. Next month, compute it again. The trend matters more than the absolute number.

Once MTTR is stable, add MTBF. Together they tell you whether you're dealing with a fragile-but-fast-recovering system (invest in prevention) or a stable-but-slow-recovering system (invest in runbooks and detection speed). That diagnostic drives your reliability roadmap more than any single metric could.

Set up monitoring at app.devhelm.io — every incident records the timestamps you need for both metrics. The 30-day rolling MTTR is already in your dashboard; MTBF is a script away from the incident API.


Originally published on DevHelm.

Top comments (0)