DevHelm

Posted on May 27 • Edited on Jun 2 • Originally published at devhelm.io

MTTR Full Form: Meaning, Formula, and How to Reduce It

#guides #reliability

Ask three SREs what MTTR stands for and you'll get three answers. Mean Time To Recovery. Mean Time To Repair. Mean Time To Respond. Sometimes Resolve. They are not the same metric. They measure different parts of an incident, they imply different ownership boundaries, and conflating them is the single most common reason an engineering team's "MTTR is improving" slide is meaningless. This guide pins down the MTTR full form, walks through the formula with real numbers, distinguishes MTTR from MTBF, MTTF, MTTA, and MTTD, and ends with the six changes that actually reduce the number — including one that most monitoring tools cannot do at all.

What MTTR stands for

MTTR is one of four metrics that share the same letters and very different definitions:

Acronym	Full form	What it measures	Typical owner
MTTR	Mean Time To Recovery	From the moment the service degrades to the moment it is fully back to normal, averaged across incidents	SRE / on-call
MTTR	Mean Time To Repair	From the moment an engineer starts working on a fix to the moment the fix is deployed	Engineering
MTTR	Mean Time To Respond	From alert to first human acknowledgement	On-call rotation
MTTR	Mean Time To Resolve	From incident creation to the incident being closed in the tracker (includes paperwork)	Incident commander

The version you almost always want, and the one used in the SRE literature and the Google SRE Workbook, is Mean Time To Recovery — the customer-visible downtime metric. The other three are sub-stages inside it. When you see a public number that's "MTTR is 8 minutes" without further qualification, assume Recovery. When a vendor pitches you a 30-second MTTR, ask which one they mean — they almost certainly mean Mean Time To Respond, which is the easiest to game.

In this guide, MTTR means Mean Time To Recovery unless explicitly noted.

The MTTR formula

The formula is simple: MTTR = total downtime / number of incidents.

A worked example. In May 2026 a hypothetical SaaS had four customer-impacting incidents:

Incident	Detected	Recovered	Duration
Stripe webhook backlog	02:14	02:51	37 min
Postgres failover	09:22	09:34	12 min
OpenAI 429 spike	14:01	14:48	47 min
CDN cert expiry	18:30	18:34	4 min
Total			100 min

MTTR = 100 min / 4 incidents = 25 minutes.

Two warnings before you start tracking this.

Warning 1: averages hide the long tail. A team with one 4-hour incident and ten 5-minute incidents has the same MTTR as a team with eleven 25-minute incidents. The first team has a tail problem; the second has a baseline problem. Always look at MTTR alongside the p95 incident duration and the count of incidents over 1 hour.

Warning 2: the start time is contested. Customer-visible downtime starts when the first customer experiences a problem, not when your alert fires. If your detection has a 2-minute lag and your MTTR is 8 minutes from alert, the customer-visible MTTR is 10 minutes. Most teams quietly start the clock at alert time because it makes the number smaller. Don't.

MTTR vs MTBF, MTTF, MTTA, and MTTD

The reliability metric family is large and the acronyms overlap enough that even seasoned SREs slip up. Here's the canonical lineup:

Acronym	Full form	What it measures	Typical good value (SaaS)
MTBF	Mean Time Between Failures	Average uptime between two consecutive incidents	Weeks to months
MTTF	Mean Time To Failure	Average uptime of a component before it fails (used for non-repairable parts)	Years (hardware), N/A for services
MTTD	Mean Time To Detect	From the moment a problem starts to the moment it's detected	< 1 minute
MTTA	Mean Time To Acknowledge	From alert fired to first human acknowledgement	< 5 minutes
MTTR	Mean Time To Recovery	From problem start to service restored	< 30 minutes

MTTD + MTTA + MTTR can be added together for a fuller picture of the incident lifecycle. The shorthand is MTTD + MTTA + the time spent on repair = MTTR (Recovery). If you split it that way, you can see which sub-stage is dragging the number — and they almost always need different fixes.

For an internal dashboard, track all five and let each have its own threshold. The teams that ship the fastest improvements treat MTTD as a detection problem, MTTA as an alerting and routing problem, and MTTR-the-repair as an engineering practice problem.

What's a good MTTR?

There is no industry-wide answer. The right MTTR depends on what your service does, how many customers it has, and what you're willing to pay to move the number. A few benchmarks:

DORA elite performers (state of DevOps report, 2023): MTTR under 1 hour. Low performers: more than 1 week.
Google SRE textbook: target depends on the service's error budget. A 99.9% SLO over a 30-day window allows 43 minutes of downtime — your MTTR per incident must fit inside what's left of that budget.
Atlassian incident management benchmark (source): teams at the median run an MTTR around 4 hours. Top quartile is under 30 minutes.
Banking and trading platforms: regulators sometimes mandate MTTR thresholds tied to capital reserves. 5 minutes is not unusual for top-tier financial services.
Internal-only B2B SaaS with no real-time SLA: 1-4 hours is acceptable.

Use these as anchors, not as targets. The right MTTR for your service comes from your SLO and your customer impact model, not from a benchmark.

How to actually reduce MTTR

Six changes consistently move the number. They're listed in order of how quickly they pay off for a small team that's just starting to take this seriously.

1. Write runbooks for the top five recurring incidents

Most incidents repeat. If you've fought the same Postgres failover behavior twice this quarter, you'll fight it again next quarter. Pick the five most common incident types in your tracker and write a runbook for each. The runbook needs five sections: trigger, symptoms, diagnosis steps, mitigation, verification. Aim for 200-400 words per runbook. Store them where on-call can find them in 30 seconds at 3 AM — not in a wiki you have to log into, not in a Notion folder buried three clicks deep. We keep ours in the same git repository as the service code under runbooks/, linked directly from every alert.

2. Tune alert routing to skip people who can't act

If a Stripe webhook backlog wakes up the whole team but only one person knows how to clear it, every other pager-recipient adds noise without adding hands. Route severity-2 alerts to a primary; promote to a wider group only if the primary doesn't acknowledge in 5 minutes. The MTTA drops, the MTTR drops, the team stops resenting on-call.

3. Make rollback the cheap option

If your rollback path takes 30 minutes and your forward-fix path takes 20 minutes, every incident becomes "let's try to fix it forward." Forward-fixes under pressure produce more incidents. Cut your rollback path to under 5 minutes and the first response to most incidents becomes "roll back, then debug calmly." Every CI/CD platform supports this — most teams just don't drill it.

4. Run blameless post-mortems and actually follow the action items

A post-mortem that produces an action-items list which nobody owns is theatre. Assign each action item to a single named person with a due date, track them in your issue tracker, and report on closed-vs-open in your weekly engineering review. The action items from last quarter's incidents are the cheapest way to reduce next quarter's MTTR — they're already pre-prioritized by the fact that the incident hurt enough to investigate.

5. Detect dependency failures before your monitors do

This is the change most monitoring tools cannot make for you — and it's one of the key differences when comparing monitoring platforms. A substantial share of customer-visible SaaS incidents — many teams report something near a third of them — are caused by an upstream dependency degradation: Stripe slows down, OpenAI rate-limits, your CDN edge has a regional issue, your database provider has a partial outage. When that happens, your own monitors fire — but you spend the first 15 minutes deciding whose problem it is. Status pages that watch your vendors and surface their incidents alongside your own check failures collapse that diagnostic window. We'll come back to this in the next section.

6. Schedule game days

A team that has never practiced an incident response will respond slowly. A team that runs a chaos exercise once a quarter — kill the database primary, simulate a CDN outage, page someone who isn't on-call — recovers from real incidents noticeably faster. The investment is one engineering day per quarter. The payback is measured in customer hours.

How DevHelm reduces MTTR

DevHelm is built around one specific MTTR problem: incidents where the root cause is a vendor your service depends on, not your service itself. When Stripe degrades, your checkout monitor fires. Your billing-webhook monitor fires. Your subscription-sync monitor fires. Three pages. Three Slack pings. Twenty minutes of "is anyone seeing the Stripe dashboard?" before someone confirms that Stripe themselves posted a status update.

DevHelm shortens that pattern in two concrete ways today.

1. Vendor status pages, watched continuously. We aggregate 100+ vendor status pages — GitHub, AWS, Cloudflare, Datadog, and every major dependency. Each page has its own incident feed; you can subscribe each one to the same Slack channel your own monitors page to, so an AWS degradation lands in your incident channel within seconds of AWS posting it, alongside your monitor alerts. That collapses the diagnostic loop: instead of fifteen minutes of "is it the dependency or is it us?", the answer is in the same channel.

2. Resource groups, for collapsing self-noise. When several of your monitors share a single failure mode (e.g. all of them call Stripe), you put them in a resource group with a single notification policy. One incident, not three. The recovery clock keeps ticking until Stripe themselves recover, but your on-call isn't paged three times for one root cause. The YAML for the on-call-friendly version looks like:

# devhelm.yml
resourceGroups:
  - name: stripe-fleet
    monitors:
      - checkout-page-uptime
      - billing-webhook-uptime
      - subscription-sync-uptime
notificationPolicies:
  - name: stripe-fleet-sev1
    matchRules:
      - type: monitor_id_in
        monitorNames:
          - checkout-page-uptime
          - billing-webhook-uptime
          - subscription-sync-uptime
    escalation:
      steps:
        - channels: [oncall-slack]

The piece DevHelm doesn't yet do automatically is the cross-correlation step — when Stripe goes down, your in-flight monitor alerts don't get auto-marked as "probably caused by that vendor" without you wiring up the resource group + Slack subscription yourself. That auto-correlation is on the roadmap. Until it ships, the manual wire-up is the work — but the work pays back at every dependency incident, which is a meaningful slice of total incidents for most modern SaaS teams.

The other action items in the previous section — runbooks, alert routing, rollback discipline, post-mortem follow-through, game days — DevHelm can't do for you. But it can give you back the time you currently spend correlating vendor outages by hand. If you're troubleshooting a specific failure mode right now, the DNS resolution guide and the vendor status feeds (GitHub, AWS, Cloudflare, and 100+ more) are useful starting points.

If your last incident was a vendor problem and your team spent the first 20 minutes figuring out whose problem it was, that diagnostic loop is the cheapest fix target. Spin up a free account at app.devhelm.io and connect your first vendor in 60 seconds — no credit card.

Originally published on DevHelm.

DEV Community