DevHelm

Posted on Jun 2 • Originally published at devhelm.io

MTBF Full Form: Mean Time Between Failures — Meaning, Formula, and When It Matters

#guides #reliability

Most reliability conversations start with uptime percentage and stop there. "We're at 99.95% availability" feels like enough — until you realize that a service with 99.95% availability could be down for 22 minutes once a month, or for 11 seconds every hour. Both hit the same uptime number. MTBF tells you which pattern you actually have.

What MTBF stands for

MTBF — Mean Time Between Failures — measures the average elapsed time from the end of one failure to the start of the next. It's a frequency metric: high MTBF means failures are rare; low MTBF means they're frequent. A service with an MTBF of 720 hours (30 days) averages one failure per month. A service with an MTBF of 24 hours averages one failure per day. Both might have the same uptime percentage if the frequent failures resolve quickly, but the operational burden is completely different.

The term originates in hardware reliability engineering — MIL-HDBK-217, published by the US Department of Defense in 1961, defined MTBF for electronic components. In software, we borrow the concept but adapt it: a "failure" is an incident that crosses a severity threshold (typically sev1 or sev2), not a hardware component burning out.

The formula

For a repairable system (which every software service is), MTBF equals total operating time divided by the number of failures during that period:

Metric	Formula
MTBF	Total uptime / Number of failures
MTTR	Total downtime / Number of failures
MTTF	Time from last recovery to next failure (= MTBF - MTTR)

The three metrics are related: MTBF = MTTF + MTTR. MTTF is the time the system runs without failing; MTTR is the time it takes to recover. Together they span the full cycle from one failure to the next.

Worked example. A payment processing service runs for 30 days (720 hours). During that window, it experiences 3 incidents:

Incident	Started	Resolved	Downtime
#1	Day 4, 14:00	Day 4, 14:45	45 min
#2	Day 12, 03:20	Day 12, 04:10	50 min
#3	Day 25, 09:00	Day 25, 09:30	30 min

Total downtime: 125 minutes (2.08 hours). Total uptime: 720 - 2.08 = 717.92 hours.

Metric	Calculation	Result
MTBF	717.92 / 3	239.3 hours (~10 days)
MTTR	2.08 / 3	41.6 minutes
MTTF	239.3 - 0.69	238.6 hours
Uptime %	717.92 / 720	99.71%

The uptime number (99.71%) tells stakeholders the service was available most of the time. The MTBF (10 days) tells the engineering team that failures happen roughly every week and a half — often enough to warrant investment in prevention, not just faster recovery.

MTBF vs MTTR vs MTTF — side by side

Teams often confuse these three metrics. Here's the clean distinction:

Metric	Full form	Measures	Improves by
MTBF	Mean Time Between Failures	How often failures occur	Preventing failures (better testing, redundancy, dependency isolation)
MTTR	Mean Time To Recovery	How fast you recover	Faster detection, better runbooks, automation
MTTF	Mean Time To Failure	How long the system runs before failing	Same as MTBF — it's the operating-time component

Two less common but useful metrics:

Metric	Full form	Measures
MTTD	Mean Time To Detect	Lag between failure start and the first alert firing
MTTA	Mean Time To Acknowledge	Lag between alert firing and a human acknowledging it

The full incident timeline runs: failure occurs -> MTTD -> alert fires -> MTTA -> responder acknowledges -> works the issue -> MTTR -> recovery. MTBF spans the gap between recoveries.

How to measure MTBF in practice

MTBF requires two inputs: a time window and a count of failures. Both are harder to pin down than they sound.

Defining "failure." Not every alert is a failure. A monitor that flaps for 30 seconds and self-recovers is an anomaly, not an incident. Most teams count only incidents above a severity threshold — sev1 and sev2 — when computing MTBF. If you include sev3 and sev4, your MTBF drops dramatically but the number stops being useful because it mixes service-impacting failures with noise.

Choosing the window. A 7-day window is too noisy — one bad week skews the number. A 365-day window smooths out seasonality but hides recent trends. The sweet spot for most teams is a rolling 30-day or 90-day window, reported weekly. If your MTBF is trending down over 4 consecutive weeks, something systemic is degrading — even if no single week looks alarming.

Handling planned maintenance. Exclude planned maintenance windows from the failure count and from operating time. A team that takes its service down for 2 hours every Sunday for database maintenance should not count those windows as "failures." If you include them, MTBF becomes a meaningless number that punishes disciplined operations.

Per-service, not fleet-wide. A fleet-wide MTBF that averages your payment service (fails once a quarter) with your notification service (fails weekly) tells you nothing actionable. Compute MTBF per service or per component. The payment team needs to know their MTBF, not the company average.

What MTBF tells you that uptime percentage does not

Uptime percentage compresses all failure information into a single number. Two services can have identical uptime (99.9%) with completely different failure patterns:

Service	Uptime	Failures/month	Avg downtime per failure	MTBF
Service A	99.9%	1	43 minutes	720 hours
Service B	99.9%	10	4.3 minutes	72 hours

Service A is stable but slow to recover. Service B is fragile but recovers fast. The SLO error budget treats them identically — both consume the same 43 minutes of downtime per month. But the operational strategies are opposite: Service A needs faster MTTR (better runbooks, automation). Service B needs higher MTBF (better testing, dependency isolation, circuit breakers).

If you only track uptime, you prescribe the same medicine to both patients. MTBF and MTTR together give you the diagnosis.

When MTBF is misleading

MTBF is an average, and averages lie when the underlying distribution is skewed.

Rare catastrophic failures. A service that runs perfectly for 364 days and then suffers a 12-hour outage on day 365 has an MTBF of 8,736 hours. That number sounds excellent — until the one failure costs $2M in lost revenue. MTBF doesn't capture tail risk. For services where the cost of a single failure is extreme, pair MTBF with a worst-case downtime metric (longest incident duration over the window).

Heterogeneous failure modes. If your service fails due to three completely different root causes (DNS resolution, database connection pool exhaustion, and a memory leak), averaging them into a single MTBF obscures the fact that one cause dominates. Compute MTBF per root cause category when possible.

Early-life systems. A new service has no meaningful MTBF. You need at least 3-5 failure cycles to compute a statistically useful average. Reporting MTBF after one incident in two weeks is technically correct ("MTBF = 336 hours") but practically useless — the confidence interval is enormous.

How DevHelm gives you the raw data

DevHelm does not expose MTBF as a dashboard metric today — that's on the roadmap. What it does give you is the incident history that MTBF is computed from.

Every incident records a created_at timestamp (when the incident opened), a resolved_at timestamp (when it closed), and a severity (DOWN or DEGRADED). The dashboard's incident summary already computes MTTR over a rolling 30-day window from these timestamps. MTBF is the complementary calculation: take the resolved_at of incident N and the created_at of incident N+1, average those gaps, and you have MTTF — then add MTTR to get MTBF.

If you need the number today, the API returns the full incident list for any monitor at GET /api/v1/incidents, filtered by severity and date range. A script that walks the list and computes MTBF per monitor is straightforward — and when we ship native MTBF in the analytics view, the underlying calculation will use the same timestamps.

Where to start

If you're tracking uptime but not MTBF, pick your highest-traffic service, pull its sev1+sev2 incident history for the last 90 days, and compute MTBF with the formula above. Compare it to your MTTR. If MTBF is low and MTTR is low, you have a fragile-but-fast-recovering service — invest in prevention. If MTBF is high and MTTR is high, you have a stable-but-slow-recovering service — invest in runbooks and detection speed.

Set up monitoring for the service at app.devhelm.io and let the incident history accumulate. After 30 days you'll have enough data points to compute your first real MTBF — and the trend line from that point forward is worth more than any snapshot.

Originally published on DevHelm.

DEV Community