The MTTR -94% claim, with receipts

#ai #agents #mttr #operations

Earlier posts cite a "median MTTR drop of 94.1% across 47 paired P0 incidents." This post is the receipts. The full methodology is also in docs/benchmarks/MTTR.md — this post explains why the number is what it is, and the four cases it does not capture.

What got measured

Setup:

12 production repositories (mix of fintech, voice-AI, clinical, dev-tools).
P0 incident defined as: user-facing, paged a human, took ≥15 minutes to resolve.
Window: rolling 6 months. Pre-treatment + post-treatment.
Treatment: the project installed GreatCTO and started persisting (pattern_hash, detection_order_that_worked, rationale) after each P0 resolved.
Outcome: time from page to root-cause identified.

I measured detection time, not full resolution time. Resolution depends on rollout speed, blast radius, customer comms — too many confounds. Detection time is the part where memory could conceivably help, and it is the part where humans burn the most calendar hours on recurring bugs.

The number

47 paired incidents. "Paired" means: same shape (same pattern_hash) seen at least twice across the 6-month window, once before persistence, once after.

Stat	Pre	Post	Delta
Median detection time	178 min	11 min	−94.1%
Mean detection time	224 min	17 min	−92.6%
90th percentile	412 min	41 min	−90.0%
Worst case (post)	n/a	89 min	n/a
Best case (post)	n/a	4 min	n/a

Skewed by a couple of near-100% cases (postgres pool exhaustion and a connection-string typo that the agent matched to a prior incident's commit diff and flagged in under 5 minutes). I report median because it is less misleading than mean for skewed distributions. The 90th percentile is probably the number you should care about — it is the "still 6× faster on the bad cases" claim.

How the mechanism works

The agent stores, for each resolved incident:

pattern_hash:   sha256(normalized_log_signature + topology_hint)
detection_order: ["check_pg_pool_size", "check_connections", "check_query_count"]
rationale:      "connection_refused logs + pool > 80% utilization → pool exhaustion, not network"

On a new incident, the agent's Step 0 is: hash the current incident's signature, look up in ~/.great_cto/incident_memory.jsonl, if pattern hits, try the prior detection_order first. If it identifies the root cause: log "memory hit." If it does not: fall back to systematic exploration.

There is no inference. The agent is not "smarter" — it is just skipping hypothesis exploration time because someone (you, last time) already paid for that exploration.

⚠ The 4 honest misses

Memory-based detection is not magic. Four cases in the 47 had pattern matches that pointed in the wrong direction and burned 10-30 minutes before the agent gave up and fell back to systematic.

Miss #1. Pattern matched on log signature "OOMKilled in worker pool." Prior detection order was "check worker memory limits." Reality: this time, the OOM was a memory leak in a different worker that pushed the wrong worker over its limit. Agent spent 18 minutes confirming the wrong worker's limits before noticing the leak. Total detection time: 34 minutes vs ~80 minutes baseline. Net positive but ugly.

Miss #2. Pattern matched "5xx spike from API gateway." Prior cause was upstream DB lag. Reality: this time it was a misconfigured rate-limiter that started rejecting requests after a deploy. Agent ran "check DB lag" for 12 minutes before pivoting. 28 minutes total vs ~140 baseline. Still a win, but called a "miss" because the prior path was wrong.

Miss #3. Pattern matched "auth failures after deploy." Prior cause was OAuth client secret rotation. Reality: a clock skew on one node caused JWT signature validation to fail. Agent's prior detection order led it through token store inspection first. 41 minutes total vs ~200 baseline.

Miss #4. Worst case. Pattern matched "DNS resolution failures." Prior detection order was "check Route 53 health checks." Reality: a third-party CDN had an outage. The agent's path was completely wrong, did not give up early enough, and a human had to manually override at minute 22. 89 minutes total vs ~150 baseline. Win on absolute time, but I would not call this a "memory worked" case.

If I report the 47 cases as "94.1% median drop," I owe the audience the 4 cases where the mechanism worked badly. They are 8.5% of the sample. The remaining 91.5% of cases saw memory either help significantly (74%) or be irrelevant (no pattern hit, fell straight to systematic exploration — 17%).

How to replicate in your own repo

Three steps, no GreatCTO required:

Persist incident memory. After each P0 resolves, write (pattern_hash, detection_order, rationale) to a markdown file in your repo. Plain text. Git-trackable.
At incident start, ask your agent to read that file before doing anything else. Even Claude Code with no plugins will use the file if you point it at one.
Track detection time. Page-to-RC-identified, in minutes. Spreadsheet is fine.

Run for one quarter. If you see a consistent reduction in detection time on recurring patterns, you have your own version of this mechanism. If you do not see reduction, your incidents are too unique or your pattern hash is too coarse.

The hash I use is sha256(top_3_log_lines_normalized + topology_hint) where topology_hint is the service name. This gets ~70% recall on similar incidents and very few false hits. You can tune for your domain.

What I will not do

Some readers ask for the raw data (anonymized incidents). I will not publish it — even anonymized, customers can be re-identified from incident shapes and timing. I will share the synthetic test cases in tests/incident_memory.test.mjs and the aggregate statistics in docs/benchmarks/MTTR.md. That is enough to verify the mechanism without leaking client incident data.

What this is not

Not an RCT. Observational. Twelve repos is small. The selection bias is real — the repos that adopted GreatCTO early were also the ones with the best L3 culture. A worse team might see 30% drop instead of 94%.

The number I would defend to your board: on recurring incident patterns, memory-driven detection compresses detection time by 5-10× median, with a long tail of near-zero-improvement cases. That is more honest than "94%." But "94%" is what shows up in the data.

About: I build GreatCTO — a multi-agent SDLC plugin for Claude Code. MIT, runs locally. Memory layer source is in packages/cli/src/memory.ts. The full benchmark methodology is at docs/benchmarks/MTTR.md.