We reduced AI agent failure rate from 36% to 0% — here's the data

#agents #ai #opensource #python

We reduced AI agent failure rate from 36% to 0% — here's the data

Autonomous AI agents pick tools blindly. Without trust signals, 36% of tool selections fail — dead endpoints, untrusted code, abandoned projects. We built a preflight trust check and benchmarked it: 100 iterations, 50 real agents, Welch's t-test. The result: 0% failure rate, p < 0.00000001.

The Problem

When an autonomous agent needs tools — say, for "Bitcoin market analysis" — it selects from a registry. Most registries have no trust signals. The agent picks randomly. Some tools are maintained, some are abandoned, some don't exist anymore.

We wanted to quantify: how bad is blind selection, and can a single API call fix it?

Methodology

Pool: 50 agents from the Nerq index (15 high-trust, 15 medium-trust, 10 low-trust, 10 dead/not-found)
Task: Select 5 tools per iteration
N: 100 iterations per scenario
Statistical test: Welch's two-sample t-test (unequal variances), significance threshold p < 0.05

Scenario A (Without Nerq): Randomly select 5 tools from the pool. Call the KYA endpoint for each. Tools with trust < 40 or not found = failure.

Scenario B (With Nerq): Call /v1/preflight on all 50 candidates. Filter to PROCEED recommendations. Sort by trust descending. Pick top 5.

Results (N=100 iterations)

Metric	Without Nerq	With Nerq	Delta
Failure rate (mean +/- SD)	35.6 +/- 19.8%	0.0 +/- 0.0%	-35.6%
Failure rate 95% CI	[31.7, 39.5]%	[0.0, 0.0]%
Trust score (mean +/- SD)	68.6 +/- 9.5	92.2 +/- 0.0	+23.6
Trust score 95% CI	[66.8, 70.5]	[92.2, 92.2]
Avg API time	0.221s	0.363s	+0.142s

Statistical Significance

Test	Failure Rate	Trust Score
t-statistic	17.968	-24.750
p-value	< 0.00000001	< 0.00000001
Significant at 95%	Yes	Yes

Both metrics are statistically significant at the 95% confidence level.

The Trade-off

Nerq makes 50 API calls (screening all candidates) vs 5 (random pick), adding 142ms of overhead. That's the cost of checking trust before committing. For an autonomous agent running without human oversight, 142ms to avoid executing untrusted code is not a trade-off — it's a requirement.

How It Works

One API call:

GET /v1/preflight?target=SWE-agent

Returns:

{
  "recommendation": "PROCEED",
  "target_trust": 92.5,
  "grade": "A+"
}

The agent filters to PROCEED, sorts by trust, and picks the top N. No failed calls. No dead endpoints. No untrusted code.

Try It

Live demo: nerq.ai/discover
API docs: nerq.ai/nerq/docs
Full benchmark report: nerq.ai/report/benchmark
Reproduce: python -m agentindex.nerq_benchmark_test

The Nerq index covers 204K+ agents and tools with independent trust scores. Free tier: 60 requests/hour.

Data generated from the live Nerq API on 2026-03-10. All agents in the benchmark pool are real entries from the index.

Originally published on nerq.ai

Top comments (1)

klement Gunndu • Mar 10

The trust index benchmark is an interesting framing — but how does Nerq handle the cold-start problem where you don't have enough historical runs to build a reliable failure signature for a new task type?