DEV Community

Anders
Anders

Posted on • Originally published at nerq.ai

We reduced AI agent failure rate from 36% to 0% — here's the data

We reduced AI agent failure rate from 36% to 0% — here's the data

Autonomous AI agents pick tools blindly. Without trust signals, 36% of tool selections fail — dead endpoints, untrusted code, abandoned projects. We built a preflight trust check and benchmarked it: 100 iterations, 50 real agents, Welch's t-test. The result: 0% failure rate, p < 0.00000001.

The Problem

When an autonomous agent needs tools — say, for "Bitcoin market analysis" — it selects from a registry. Most registries have no trust signals. The agent picks randomly. Some tools are maintained, some are abandoned, some don't exist anymore.

We wanted to quantify: how bad is blind selection, and can a single API call fix it?

Methodology

  • Pool: 50 agents from the Nerq index (15 high-trust, 15 medium-trust, 10 low-trust, 10 dead/not-found)
  • Task: Select 5 tools per iteration
  • N: 100 iterations per scenario
  • Statistical test: Welch's two-sample t-test (unequal variances), significance threshold p < 0.05

Scenario A (Without Nerq): Randomly select 5 tools from the pool. Call the KYA endpoint for each. Tools with trust < 40 or not found = failure.

Scenario B (With Nerq): Call /v1/preflight on all 50 candidates. Filter to PROCEED recommendations. Sort by trust descending. Pick top 5.

Results (N=100 iterations)

Metric Without Nerq With Nerq Delta
Failure rate (mean +/- SD) 35.6 +/- 19.8% 0.0 +/- 0.0% -35.6%
Failure rate 95% CI [31.7, 39.5]% [0.0, 0.0]%
Trust score (mean +/- SD) 68.6 +/- 9.5 92.2 +/- 0.0 +23.6
Trust score 95% CI [66.8, 70.5] [92.2, 92.2]
Avg API time 0.221s 0.363s +0.142s

Statistical Significance

Test Failure Rate Trust Score
t-statistic 17.968 -24.750
p-value < 0.00000001 < 0.00000001
Significant at 95% Yes Yes

Both metrics are statistically significant at the 95% confidence level.

The Trade-off

Nerq makes 50 API calls (screening all candidates) vs 5 (random pick), adding 142ms of overhead. That's the cost of checking trust before committing. For an autonomous agent running without human oversight, 142ms to avoid executing untrusted code is not a trade-off — it's a requirement.

How It Works

One API call:

GET /v1/preflight?target=SWE-agent
Enter fullscreen mode Exit fullscreen mode

Returns:

{
  "recommendation": "PROCEED",
  "target_trust": 92.5,
  "grade": "A+"
}
Enter fullscreen mode Exit fullscreen mode

The agent filters to PROCEED, sorts by trust, and picks the top N. No failed calls. No dead endpoints. No untrusted code.

Try It

The Nerq index covers 204K+ agents and tools with independent trust scores. Free tier: 60 requests/hour.


Data generated from the live Nerq API on 2026-03-10. All agents in the benchmark pool are real entries from the index.


Originally published on nerq.ai

Top comments (1)

Collapse
 
klement_gunndu profile image
klement Gunndu

The trust index benchmark is an interesting framing — but how does Nerq handle the cold-start problem where you don't have enough historical runs to build a reliable failure signature for a new task type?