Ian Parent

Posted on Mar 23 • Edited on Jul 7 • Originally published at iris-eval.com

Eval Coverage: The Metric Your AI Agents Are Missing

#mcp #aiagents #observability #opensource

Every serious codebase measures test coverage. CI pipelines enforce minimums. Pull requests get rejected when coverage drops. The industry spent two decades making this a standard practice.

For AI agents, the equivalent metric doesn't exist yet. It should. It's called eval coverage — the percentage of agent executions that receive an evaluation.

The Current State: Nearly Zero

The numbers are stark. From LangChain's State of Agent Engineering survey (1,340 respondents, late 2025):

Only 52% of organizations run offline evaluations on test sets
Only 37% run online evals on real production traffic
89% have infrastructure observability — but observability tells you if the call completed, not if the answer was good
Only a small minority of teams evaluate 90%+ of their production agent executions

The majority of companies building AI agents in production are running at effectively 0% eval coverage on live traffic. They are paying the eval tax on every unscored execution. They're shipping code without tests — except the code is non-deterministic, the failures are silent, and the consequences are user-facing.

Why Agent Eval Coverage Is Different from Test Coverage

In traditional software, test coverage measures what percentage of code paths your test suite exercises. Tools like Istanbul and Coverage.py make this measurable. The industry settled on 80-85% as the pragmatic target — high enough to catch most regressions, not so exhaustive that tests cost more than the code they protect.

For AI agents, coverage is structurally different. It's not about code paths — it's about executions. An agent can have 100% code test coverage — every function tested — and still produce garbage outputs in production, because the behavior lives in the model's probability distribution, not in deterministic code.

This means coverage must be measured at the output level: what percentage of actual agent outputs were evaluated for quality, safety, and cost?

Why 100% Eval Coverage Matters

In software, 80% test coverage is considered good. An uncovered branch might be dead code that never runs. But with agent outputs, there is no dead code. Every call is a real user interaction with real consequences.

Spot-checking 25% of runs is not "mostly covered." It means 75% of your production failures are invisible. The failure that leaks PII, the hallucination that sends a customer wrong data, the $40 API call that should have been $0.12 — these live in the long tail, and they're the ones that generate lawsuits, churn, and trust destruction.

The Coverage Spectrum

Level	What It Means	What You Miss
0%	No eval, ever	Everything. Flying blind.
25%	Spot checks, manual review	75% of failures invisible
50%	Sampling — eval 1-in-2 calls	Half your production failures
80%	What software considers "good"	20% blind spots — still risky for agents
100%	Every execution evaluated inline	Full visibility. Drift detectable from day one.

The Test Coverage History Parallel

The journey from "tests are optional" to "shipping without tests is unprofessional" took about 15 years:

1994: Kent Beck published SUnit — the first test framework formalization
1999: Extreme Programming codified TDD as a core practice
2003: "TDD: By Example" published — the codification artifact
2005-2010: CI/CD adoption made test gates structural, not optional
2010+: Not having tests became a professional red flag
Today: 80%+ coverage is expected in any serious codebase

A joint IBM and Microsoft study shows TDD reduces post-release bugs by 40-90%.

Where are we with agent eval? Somewhere around 1999. The practice exists. A few leading teams use it. The tooling is emerging. The industry standard hasn't formed yet.

History is about to rhyme. The discipline that accelerates adoption is Eval-Driven Development — writing eval rules before prompts, the same way TDD writes tests before code.

How to Get to 100%

The reason most teams run at 0% eval coverage is that adding per-call evaluation is manual, fragile, and easy to forget. As we show in How to Evaluate Agent Output Without Calling Another LLM, heuristic rules make per-call evaluation fast and free enough to run on every execution. The same reason test coverage was low before CI made it structural.

The path to 100% follows the same pattern:

Make it structural, not discretionary. If evaluation requires developers to add per-call instrumentation, coverage will always be incomplete. If evaluation is built into the protocol layer — the communication channel every agent already uses — coverage is automatic.
Measure it. You can't improve what you don't measure. Track your eval coverage as a metric: (evaluated executions / total executions) × 100.
Alert on drops. When eval coverage drops below 100%, something is misconfigured. Treat it like test coverage: a metric that goes in one direction.

The Iris Approach

Iris enables high eval coverage by integrating at the MCP protocol layer. Agents call Iris eval tools inline — the same way they call any other MCP tool — keeping evaluation within the agent's own workflow rather than requiring a separate instrumentation pass.

The architectural advantage: when eval is an MCP tool the agent can invoke on any output, adding coverage doesn't require per-call instrumentation in your application code. You configure Iris once, and the agent has access to eval on every execution.

This is why the coverage framing matters: protocol-native eval makes high coverage a matter of agent configuration, not developer discipline. The same way CI pipelines made test coverage structural, MCP-native eval makes agent eval coverage structural.

For the complete picture, see our Agent Eval: The Definitive Guide.

Iris is the agent eval standard for MCP. Add it to your MCP config and start scoring agent outputs inline. Try it: iris-eval.com/playground

Top comments (1)

Armorer Labs • Jun 13

I like eval coverage as a metric, but I would tie it to runtime evidence, not just prompt coverage. For agents, coverage should include tools, action classes, memory paths, approval paths, retries, and failure modes.

The surprising gaps usually live in side-effect paths: partial failures, stale context, blocked actions, and resume behavior. Disclosure: I'm building Armorer/Armorer Guard, so I'm biased toward evals that include operational receipts.