We Recalculated 974K Agents — Here's What Actually Happened

#ai #agents #trust #datatransparency

Last week, we wrote about rewriting the scoring engine. The short version: 84.6% of agents were crammed into a two-point band because 98% of dimension scores were defaults. So we tore it down and rebuilt it.

Now the recalculation is done. Here are the numbers.

Before → After

Score Range	Old Engine	New Engine	Target
2.0 - 2.9	84.6%	65.6%	55-65%
3.0+	~15.4%	~34.4%	35-45%

The 2.0-2.9 band shrank from 84.6% to 65.6%. Not perfect — our target was 55-65%, and we're slightly over — but a meaningful shift. Agents with real signals (GitHub repos, detailed descriptions, multi-platform presence) now score differently from agents with zero verifiable data.

What Worked

Missing data no longer inflates scores. Dimensions without real data don't get a 2.5 and don't participate in the calculation. Weight is redistributed to dimensions that have actual evidence. The result: agents with more data get more differentiated scores. Agents with less data get honest scores, not padded ones.

Three validated signals beat eight hypothetical ones. After distribution validation, only three metadata signals actually differentiate agents in our current dataset: bio length, source sites, and platform type. We disabled the rest. A simpler engine with real signals beats a complex engine with imaginary gradients.

Platform-level calibration works — with guardrails. 250K erc8004 agents all scored 2.19 with 0.06 standard deviation. GitHub agents clustered at a single value. The new engine uses within-platform percentile scaling to amplify differences, but checks information entropy first. If the variance is likely noise, it skips calibration and labels the platform "insufficient differentiation."

What Didn't Work (Yet)

Consistency is still empty. Almost all 974K agents entered the database in the same batch. updated_at = created_at. No time series, no activity span. The consistency dimension is estimated for nearly everyone, so it doesn't participate in scores. "No data" is more informative than "fake data" — but it means consistency won't differentiate agents until we accumulate incremental updates over time.

The verified confidence label has zero differentiation. Our v3.1 confidence system labels agents based on how many dimensions have real (non-estimated) data. "Verified" means 5 real dimensions. The problem: the threshold for "real" data is too low right now. Almost any agent with a bio and a source URL qualifies. We're not fixing this immediately — it's a known limitation, not a hidden one.

The 65.6% band is slightly above target. We aimed for 55-65% in the 2.0-2.9 range. We hit 65.6%. The gap comes from the fact that even with validated signals, most agents simply don't have much real data. Three signals can only do so much when 615K agents come from a single platform (HuggingFace) with similar profile structures.

The Honest Scorecard

Metric	Status
Score distribution improved	✅ 84.6% → 65.6%
Agents with 3.5+ scores exist	✅
Consistency dimension functional	❌ Pending incremental data
Confidence labels meaningful	⚠️ Partially — verified threshold too low
Target band hit	⚠️ 65.6% vs 55-65% target

What This Means for Your Badge

If you already have an AgentRisk badge, your score may have changed — up or down. This isn't algorithm manipulation. It's us removing the padding and showing what we actually know.

If your agent has a GitHub repo, a detailed bio, or is listed on multiple platforms, your score likely went up. If your agent had a 2.5-by-default score that's now "data collection in progress," that's more honest than the alternative.

Check your score at agentrisk.app. Claim your agent to embed the badge in your README.

What's Next

The engine rewrite was about admitting what we don't know. The next phase is about expanding what we do know.

Incremental data collection is running. As agents get updated, consistency scores will emerge naturally.
New data sources are being evaluated. More platforms mean more cross-referencing signals.
Confidence label refinement will tighten the "verified" threshold as real data accumulates.

The 65.6% number isn't the end. It's the starting point after we stopped lying to ourselves.

Search for your agent at agentrisk.app · Full scoring methodology at agentrisk.app/methodology · Badge verification at agentrisk.app/verify