We reduced AI agent failure rate from 36% to 0% — here's the data
Autonomous AI agents pick tools blindly. Without trust signals, 36% of tool selections fail — dead endpoints, untrusted code, abandoned projects. We built a preflight trust check and benchmarked it: 100 iterations, 50 real agents, Welch's t-test. The result: 0% failure rate, p < 0.00000001.
The Problem
When an autonomous agent needs tools — say, for "Bitcoin market analysis" — it selects from a registry. Most registries have no trust signals. The agent picks randomly. Some tools are maintained, some are abandoned, some don't exist anymore.
We wanted to quantify: how bad is blind selection, and can a single API call fix it?
Methodology
- Pool: 50 agents from the Nerq index (15 high-trust, 15 medium-trust, 10 low-trust, 10 dead/not-found)
- Task: Select 5 tools per iteration
- N: 100 iterations per scenario
- Statistical test: Welch's two-sample t-test (unequal variances), significance threshold p < 0.05
Scenario A (Without Nerq): Randomly select 5 tools from the pool. Call the KYA endpoint for each. Tools with trust < 40 or not found = failure.
Scenario B (With Nerq): Call /v1/preflight on all 50 candidates. Filter to PROCEED recommendations. Sort by trust descending. Pick top 5.
Results (N=100 iterations)
| Metric | Without Nerq | With Nerq | Delta |
|---|---|---|---|
| Failure rate (mean +/- SD) | 35.6 +/- 19.8% | 0.0 +/- 0.0% | -35.6% |
| Failure rate 95% CI | [31.7, 39.5]% | [0.0, 0.0]% | |
| Trust score (mean +/- SD) | 68.6 +/- 9.5 | 92.2 +/- 0.0 | +23.6 |
| Trust score 95% CI | [66.8, 70.5] | [92.2, 92.2] | |
| Avg API time | 0.221s | 0.363s | +0.142s |
Statistical Significance
| Test | Failure Rate | Trust Score |
|---|---|---|
| t-statistic | 17.968 | -24.750 |
| p-value | < 0.00000001 | < 0.00000001 |
| Significant at 95% | Yes | Yes |
Both metrics are statistically significant at the 95% confidence level.
The Trade-off
Nerq makes 50 API calls (screening all candidates) vs 5 (random pick), adding 142ms of overhead. That's the cost of checking trust before committing. For an autonomous agent running without human oversight, 142ms to avoid executing untrusted code is not a trade-off — it's a requirement.
How It Works
One API call:
GET /v1/preflight?target=SWE-agent
Returns:
{
"recommendation": "PROCEED",
"target_trust": 92.5,
"grade": "A+"
}
The agent filters to PROCEED, sorts by trust, and picks the top N. No failed calls. No dead endpoints. No untrusted code.
Try It
- Live demo: nerq.ai/discover
- API docs: nerq.ai/nerq/docs
- Full benchmark report: nerq.ai/report/benchmark
-
Reproduce:
python -m agentindex.nerq_benchmark_test
The Nerq index covers 204K+ agents and tools with independent trust scores. Free tier: 60 requests/hour.
Data generated from the live Nerq API on 2026-03-10. All agents in the benchmark pool are real entries from the index.
Originally published on nerq.ai
Top comments (1)
The trust index benchmark is an interesting framing — but how does Nerq handle the cold-start problem where you don't have enough historical runs to build a reliable failure signature for a new task type?