How We Index 2M+ AI Agents Across Platforms
2026-06-09 · 6 min read
When we started AgentRisk, the first question wasn't "how do we score agents?" — it was "where are all the agents?"
AI agents don't live in one place. They're scattered across HuggingFace, Coze, GPTs stores, on-chain protocols, npm packages, and dozens of smaller platforms. No single registry exists. No unified API. No common schema.
So we built a collection pipeline that now indexes 2.1 million agents across 28+ platforms — and we learned a few things along the way.
The Problem: Fragmentation at Scale
Here's what the agent ecosystem looks like from the outside:
| Platform | Type | Approx. Agents | Access |
|---|---|---|---|
| HuggingFace Spaces | Web apps | 2,000,000+ | Open API |
| GPTs Store | ChatGPT plugins | 700,000+ | Third-party indexes |
| Coze | Bot marketplace | 100,000+ | Official API |
| On-chain (Olas, Virtuals, ERC-8004) | Smart contracts | ~10,000 | Subgraph / RPC |
| npm / PyPI | Agent packages | ~8,000 | Registry API |
| Long tail (Agentic.ai, Poe, Dify, ...) | Mixed | 100,000+ | Various |
Each platform has its own API schema, rate limits, authentication model, and data quality characteristics. Some have great APIs. Others require creative approaches. A few actively resist automated access.
Our pipeline handles all of them through a unified architecture.
The Architecture
┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Source │ │ Collector │ │ Validator │ │ Scoring │
│ Discovery │────▶│ Layer │────▶│ & Dedup │────▶│ Engine │
└─────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
- Platform - Platform- - canonical_id - 6-dimension
registry specific generation framework
- RSS/webhook adapters - Cross-platform - Ed25519
- On-chain - Rate-limiting deduplication signing
event logs - Error recovery - Schema - Hash chain
- Community - Incremental normalization anchoring
submissions scanning - Water marking
detection
Let's break down each layer.
Layer 1: Source Discovery
How do we know what to index? Three approaches:
Platform registries: Most platforms have some form of directory — HuggingFace's /api/spaces, Coze's bot store, npm's registry. We maintain a prioritized source list ranked by three factors: API openness, agent volume, and daily growth rate.
On-chain events: Blockchain-based agent protocols emit events when new agents are registered. For example, Olas's Gnosis deployment uses a service registry contract — we watch it via GraphQL subgraph:
# Simplified: watching on-chain agent registration
QUERY = """
{
services(first: 1000, orderBy: id, orderDirection: desc) {
id
owner
agentId: agentId
注册时间: createdTimestamp
}
}
"""
response = requests.post(SUBGRAPH_URL, json={"query": QUERY})
Incremental polling: For platforms without webhooks, we poll their "recently created" endpoints at regular intervals. HuggingFace's API makes this easy — sort by createdAt, limit 100, and you get the latest entries.
Layer 2: Platform-Specific Collectors
Each platform gets its own adapter. The interface is the same; the internals differ wildly.
Here's the pattern:
class BaseCollector:
"""Every collector implements this interface."""
def discover(self) -> list[str]:
"""Return a list of agent IDs to collect."""
...
def fetch_one(self, agent_id: str) -> AgentRecord | None:
"""Fetch a single agent's data. Return None on failure."""
...
def normalize(self, raw: dict) -> NormalizedRecord:
"""Map platform-specific fields to our unified schema."""
...
class HuggingFaceCollector(BaseCollector):
RATE_LIMIT = 0.5 # seconds between requests
def discover(self):
# HF has a clean API for incremental discovery
resp = requests.get(
"https://huggingface.co/api/spaces",
params={"sort": "createdAt", "direction": -1, "limit": 100}
)
return [s["id"] for s in resp.json()]
def fetch_one(self, agent_id):
resp = requests.get(f"https://huggingface.co/api/spaces/{agent_id}")
if resp.status_code != 200:
return None
return self.normalize(resp.json())
def normalize(self, raw):
return NormalizedRecord(
platform="huggingface",
source_id=raw["id"],
display_name=raw.get("cardData", {}).get("title", raw["id"]),
tags=raw.get("tags", []),
sdk=raw.get("sdk"),
is_private=raw.get("private", False),
created_at=raw.get("createdAt"),
)
On-chain collectors look different. For Virtuals Protocol on Base, we scan ERC-20 Transfer events to discover new agent token contracts:
# Simplified: discovering agents via token transfers
TRANSFER_TOPIC = "0xddf252ad..." # Transfer(address,address,uint256)
resp = requests.post(RPC_URL, json={
"method": "eth_getLogs",
"params": [{
"fromBlock": hex(last_block),
"toBlock": hex(current_block),
"address": VIRTUAL_TOKEN,
"topics": [TRANSFER_TOPIC],
}]
})
# Extract new contract addresses from transfer logs
The key design principle: collectors are stateless and resumable. If a collector crashes mid-run, it picks up where it left off. We track the last successfully processed block number, page offset, or timestamp.
Layer 3: Validation & Deduplication
This is where it gets interesting — and where most naive pipelines break.
canonical_id generation: The same agent might appear on multiple platforms under different names. We generate a canonical_id that cross-references agents across platforms. (We'll cover this system in detail in our next post.)
Water marking detection: A significant portion of agent registries are placeholder entries — accounts that registered but never deployed anything. We flag these based on multiple signals: empty descriptions, no activity timestamps, default profile data. Our current water rate is 0.038% — meaning 99.96% of indexed agents have real, verifiable data.
Schema normalization: Every platform has different field names for the same concept. HuggingFace calls it sdk, Coze calls it bot_type, on-chain agents have service_type. We map everything to a unified schema before storage.
Layer 4: Scoring Engine
Once validated and deduplicated, agents enter our six-dimension scoring framework: Authenticity, Consistency, Transparency, Commitment, Choice, and Presence.
The scoring engine is a separate system — and a topic for a future post. But the key insight is that collection quality directly determines scoring quality. Garbage in, garbage out applies doubly to trust scoring.
What We Learned
1. Rate limits are generous — until they're not. Most platforms allow reasonable automated access. But if you're polling every 30 seconds from a single IP, you'll get throttled. We use 0.5-2 second delays between requests and exponential backoff on errors.
2. On-chain data is the cleanest — and the hardest. Blockchain data is immutable and well-structured, but RPC endpoints have block range limits on eth_getLogs. We scan in chunks of 10,000 blocks.
3. Placeholder detection matters more than collection speed. It's tempting to chase volume. But 2 million agents where 40% are placeholders is worse than 1 million where 0.04% are. We'd rather index fewer agents with higher confidence.
4. Incremental > full scan. Our collectors run in incremental mode 99% of the time — only fetching what's changed since the last run. Full scans are reserved for schema migrations and bug recovery.
By The Numbers
| Metric | Value |
|---|---|
| Total agents indexed | 2,163,677 |
| Platforms covered | 28+ |
| Water rate (placeholders) | 0.038% |
| Daily new agents | ~1,159 |
| Timeline events tracked | 9,546,093 |
| Hash chain entries | Continuous, no gaps |
What's Next
In our next post, we'll dive into the canonical_id system — how we identify the same agent across HuggingFace, GitHub, on-chain contracts, and marketplace listings. Cross-platform identity is the hardest problem in agent indexing, and we think we have a workable solution.
AgentRisk indexes and scores AI agents for trust and transparency. Check your agent at agentrisk.app or explore our methodology at agentrisk.app/methodology.
Top comments (0)