Self-Calibrating Eval: The End of Manual Threshold Tuning

#mcp #aiagents #observability #opensource

You set a cost threshold at $0.50 per agent call. On day one, 12% of outputs exceed it — the expensive outliers, the runaway loops, the calls that need investigation. Reasonable.

Three months later, that same threshold is flagging 47% of outputs. Nothing in your code changed. Your eval rules are identical. But your model provider raised API prices, or a minor model update shifted token usage patterns, or your agent started handling a different distribution of user queries. The threshold that once caught outliers is now crying wolf on nearly half your traffic.

Is the agent getting worse? Or is the threshold miscalibrated?

This is the static threshold problem. And it's the reason most eval systems degrade from useful to noisy within months.

The Threshold Decay Curve

Every hardcoded threshold has an expiration date. The environment around your agent is constantly shifting:

Model provider changes. Upstream providers update pricing, model weights, and decoding parameters without announcement. A Stanford/Berkeley study (Chen et al., 2023) found that GPT-4's rate of directly executable code generations dropped from 52% to 10% in just three months — with no changelog, no API version bump. If your quality thresholds were calibrated to March outputs, they were wrong by June.

Input distribution shifts. Your users don't send the same queries month over month. Seasonal patterns, feature launches, and user growth all change the distribution of inputs your agent handles. A cost threshold calibrated on developer queries breaks when your agent starts handling customer support.

Pricing changes. Token costs are not static. When Anthropic, OpenAI, or Google adjust pricing — sometimes mid-quarter — every cost threshold in your eval system is instantly stale. Your $0.50 threshold might have been the 95th percentile at launch. After a price increase, it could be the 60th percentile. Same dollar figure, completely different meaning.

The result is eval drift manifesting not in the agent itself, but in the eval system that's supposed to catch it. Your quality gate is decaying alongside the thing it's measuring. The LangChain State of Agent Engineering survey (1,340 respondents, late 2025) found only 37% of teams run online evals on production traffic — and the tooling ecosystem offers little support for revisiting those configurations after deployment.

Threshold Drift vs. Actual Quality Drift

This is the core diagnostic problem: when your failure rate spikes, you need to distinguish between two fundamentally different situations.

Actual quality drift: The agent is producing worse outputs. Model weights changed. A prompt regression slipped through. The failure rate increase reflects real degradation that demands investigation.

Threshold drift: The agent's outputs are the same quality — or even better — but the environment shifted and the threshold no longer represents what it used to. The failure rate increase is noise from a miscalibrated instrument.

If you can't tell the difference, you either ignore real quality problems (because you've been trained to distrust the alerts) or you waste engineering hours investigating phantom failures. Both are expensive. Both are forms of the eval tax.

The Self-Calibrating Eval Pattern

Self-calibrating eval is the pattern where the eval system monitors its own scoring distributions and recommends threshold adjustments when it detects anomalies.

The mechanism has four steps:

1. Monitor scoring distributions. Track not just pass/fail rates, but the full distribution of scores over time. A rolling window of quality scores, cost figures, and safety metrics — bucketed by day or week — reveals the shape of normal operation.

2. Detect distribution shifts. When the scoring distribution changes shape — the mean shifts, the variance widens, the failure rate departs from its historical baseline — flag it. The anomaly isn't that individual outputs failed. The anomaly is that the pattern of failures changed.

3. Recommend adjustments. This is where self-calibrating eval diverges from simple alerting. Instead of just saying "failure rate increased," the system says: "Your cost threshold of $0.50 is now at the 60th percentile of outputs, up from the 95th percentile at calibration. The median cost per call increased from $0.18 to $0.34, consistent with the API pricing change on March 1. Recommended adjustment: $0.72 to restore 95th percentile targeting."

4. Human approves. The system recommends. A human decides. Always.

This is not auto-tuning. Auto-adjusting thresholds without human approval is dangerous — it can mask genuine quality degradation by silently loosening standards. Self-calibrating eval provides the diagnosis and the recommendation. The human provides the judgment.

The Eval Advisor

The diagnostic layer that powers self-calibrating eval is what I'm calling the eval advisor — a component that doesn't just say FAIL but explains WHY the failure happened and WHAT to do about it.

Today, most eval systems are binary gates. Output crosses a threshold? Fail. Output stays below? Pass. No context. No diagnosis. No actionable guidance.

An eval advisor adds three capabilities:

Attribution: This output failed the cost threshold because token usage was 3.2x the historical median, driven by a retry loop in the tool-calling chain.
Trend context: This is the 14th cost failure in the last hour, up from a baseline of 2 per hour. The pattern started at 2:14 PM, coinciding with the model endpoint switching from gpt-4-0125 to gpt-4-0314.
Recommendation: Adjust cost threshold from $0.50 to $0.68 to account for the new model's token consumption pattern, or investigate the retry loop that's inflating costs.

The difference between "eval as a gate" and "eval as a co-pilot" is the difference between a check engine light and a mechanic who tells you what's wrong. Both tell you something failed. Only one helps you fix it.

The Adaptive Cruise Control Analogy

Self-calibrating eval is to agent quality what adaptive cruise control is to driving.

Standard cruise control holds a fixed speed. Hit a hill, the engine strains. Traffic slows ahead, you're closing the gap dangerously. The setting was right when you set it. The road changed.

Adaptive cruise control monitors the environment — distance to the car ahead, road conditions, incline — and adjusts speed continuously. But you set the target following distance. You can override at any time. You're still driving.

Self-calibrating eval works the same way. The system monitors the scoring environment and adjusts its recommendations. But you set the quality bar. You approve every change. The eval system helps you maintain your standards in a shifting environment — it doesn't decide what those standards should be.

Where This Is Going

Iris currently detects eval drift through scoring patterns — every eval result is persisted with a timestamp, and the dashboard surfaces trends over time. When your scores trend downward over the past 7 days, you can see it. The scoring distribution data that makes self-calibrating eval possible is already being collected.

We're building toward eval advisor capabilities — the diagnostic layer that turns "your cost failure rate spiked" into "here's why, and here's what to adjust." This is what we're working on next. The pattern described in this post is the design target.

The broader principle: an eval system that can't explain its own judgments is just a more sophisticated alert. The industry needs eval infrastructure that participates in the diagnostic process — that helps teams maintain quality standards as the environment shifts under them, rather than silently becoming noise.

If you're running agent eval today with static thresholds, start tracking your scoring distributions over time. When the failure rate changes, ask: is the agent getting worse, or is the threshold stale? That question — and the infrastructure to answer it — is the difference between eval as a gate and eval as a co-pilot.

For the complete picture, see our Agent Eval: The Definitive Guide.

Iris is the agent eval standard for MCP. Start scoring agent outputs inline and see how your eval distributions trend over time. Try it: iris-eval.com/playground