TL;DR: Our first scoring engine assigned default scores to missing data. With 98% of dimensions estimated, 84.6% of 974,000 agents ended up in the same two-point band. We rewrote the engine with four changes. Here's what we found — and what we changed.
84.6% of 974,000 AI agents scored between 2.0 and 2.9. Zero scored above 4.0. That's not scoring — that's a system failure.
We dug into the code and the data. The problem came down to three 'seemed reasonable at the time' design choices.
Three Bugs We Shipped As Features
| Problem | Why It's a Trap |
|---|---|
| 'Missing data → default score' | 98% of dimensions were estimated, but each got a 2.5 and counted toward the total. Result: everyone pulled to the middle. No data is not neutral data. No data means we don't know. |
| 'Hash offset for differentiation' | We used agent ID hashes to add random offsets around 2.5. Looks like differentiation — but ask 'why is this one 2.3 and that one 2.7?' and the answer is 'different hash.' That's not assessment. That's noise injection. |
| 'Metadata gradient assumptions' | We designed elaborate tiers: bio length 4 tiers, source_sites 4 tiers, agent age 5 tiers... After distribution validation, most signals had zero discriminative power in our current dataset. |
Four Changes
Missing data doesn't participate. Dimensions without real data don't get 2.5 and don't count toward the score. Weight is redistributed to dimensions that have data. The less data, the more honest the score — 'here's everything I know,' not 'I think it's probably 2.5.'
Only validated metadata differentiation. After distribution validation, only three signals work: bio length, source_sites, and platform type. The rest — activity span, category, same-category alternatives — have zero discriminative power in our current data. Disabled for now, to be re-enabled as incremental data accumulates.
Platform-level calibration with entropy guardrails. 250K agents on erc8004 chains all scored 2.19 (0.06 stddev). GitHub agents all scored 1.64 (zero variance). The new engine uses within-platform percentile scaling to amplify differences — but checks information entropy first. If the variance is likely noise, we skip calibration and label the platform 'insufficient differentiation.'
Confidence labels. Agents with zero real dimensions don't get a fake score. Badges can still be generated, but display 'Data Collection in Progress' — encouraging developers to add information, rather than pretending we've already evaluated them.
The Most Painful Discovery: Consistency
The consistency dimension — 974K agents, almost all estimated.
The reason is simple: initial batch import. All agents entered the database at the same time. updated_at = created_at. No historical time series, no 'activity span.' Our original 'agent age' scoring — the longer an agent has been active, the more consistent — collided with the reality that every agent was 'active' for the same 0 days.
New engine: consistency is only calculated when time series data exists. Otherwise, it's marked estimated and excluded from the total. In the short term, most agents will have an empty consistency dimension. But 'no data' is more informative than 'fake data.'
Data Doesn't Lie — It Just Tells You When You're Overcomplicating Things
While rewriting the engine, we ran a distribution validation to check whether our designed metadata signals could actually differentiate agents. Results:
| Signal | Status | Notes |
|---|---|---|
| Bio length | ✅ | 3 effective tiers, 30%+ hit rate each after threshold adjustment |
| Source sites | ✅ | Binary: null 28% / present 72% |
| Platform type | ✅ | Naturally spread — largest differentiation signal |
| Others (activity span, category, alternatives) | ❌ | Disabled — zero discriminative power in current dataset |
The takeaway: data doesn't lie — it just tells you when you're overcomplicating things. Good. At least now we know which signals are real and which aren't.
What Happens After the New Engine Goes Live?
| Score Range | Before | Target After |
|---|---|---|
| 2.0 - 2.9 | 84.6% | 55-65% |
| 3.0 - 3.5 | ~15% | 25-35% |
| 3.5+ | ~0.4% | 8-15% |
| 4.0+ | 0% | Real agents exist |
We're not manufacturing high scores. We're letting agents with real signals surface. Open-source agents on GitHub, multi-platform agents, agents with performance assessments — the ones drowned out by 2.5 defaults — will gain the differentiation they deserve.
The lesson from 84.6% clustering: admit what you don't know before pretending you do.
Steps are queued: database backup → rewrite scoring_engine → erc8004 pilot validation → batch recalculation of 974K agents → frontend confidence labels → badge color rules → deploy. ~5 hours total, executing this week. Rollback time: 5 minutes.
AgentRisk's mission: trust infrastructure for the age of AI agents.
Later this week, when recalculation finishes, search for your agent on agentrisk.app. If your agent's score changed — up or down — it's not algorithm manipulation. It's us admitting what we didn't know.
Full scoring methodology: agentrisk.app/methodology
Top comments (0)