Agent-Risk

Posted on Jun 9

How We Index 2M+ AI Agents Across Platforms

#ai #agents #datatransparency #buildinpublic

How We Index 2M+ AI Agents Across Platforms

2026-06-09 · 6 min read

When we started AgentRisk, the first question wasn't "how do we score agents?" — it was "where are all the agents?"

AI agents don't live in one place. They're scattered across HuggingFace, Coze, GPTs stores, on-chain protocols, npm packages, and dozens of smaller platforms. No single registry exists. No unified API. No common schema.

So we built a collection pipeline that now indexes 2.1 million agents across 28+ platforms — and we learned a few things along the way.

The Problem: Fragmentation at Scale

Here's what the agent ecosystem looks like from the outside:

Platform	Type	Approx. Agents	Access
HuggingFace Spaces	Web apps	2,000,000+	Open API
GPTs Store	ChatGPT plugins	700,000+	Third-party indexes
Coze	Bot marketplace	100,000+	Official API
On-chain (Olas, Virtuals, ERC-8004)	Smart contracts	~10,000	Subgraph / RPC
npm / PyPI	Agent packages	~8,000	Registry API
Long tail (Agentic.ai, Poe, Dify, ...)	Mixed	100,000+	Various

Each platform has its own API schema, rate limits, authentication model, and data quality characteristics. Some have great APIs. Others require creative approaches. A few actively resist automated access.

Our pipeline handles all of them through a unified architecture.

The Architecture

┌─────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Source      │     │  Collector    │     │  Validator   │     │  Scoring     │
│  Discovery   │────▶│  Layer        │────▶│  & Dedup     │────▶│  Engine      │
└─────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
   - Platform         - Platform-         - canonical_id      - 6-dimension
     registry         specific            generation          framework
   - RSS/webhook      adapters            - Cross-platform    - Ed25519
   - On-chain         - Rate-limiting      deduplication       signing
     event logs       - Error recovery    - Schema             - Hash chain
   - Community        - Incremental        normalization         anchoring
     submissions        scanning          - Water marking
                                          detection

Let's break down each layer.

Layer 1: Source Discovery

How do we know what to index? Three approaches:

Platform registries: Most platforms have some form of directory — HuggingFace's /api/spaces, Coze's bot store, npm's registry. We maintain a prioritized source list ranked by three factors: API openness, agent volume, and daily growth rate.

On-chain events: Blockchain-based agent protocols emit events when new agents are registered. For example, Olas's Gnosis deployment uses a service registry contract — we watch it via GraphQL subgraph:

# Simplified: watching on-chain agent registration
QUERY = """
{
  services(first: 1000, orderBy: id, orderDirection: desc) {
    id
    owner
    agentId: agentId
   注册时间: createdTimestamp
  }
}
"""
response = requests.post(SUBGRAPH_URL, json={"query": QUERY})

Incremental polling: For platforms without webhooks, we poll their "recently created" endpoints at regular intervals. HuggingFace's API makes this easy — sort by createdAt, limit 100, and you get the latest entries.

Layer 2: Platform-Specific Collectors

Each platform gets its own adapter. The interface is the same; the internals differ wildly.

Here's the pattern:

class BaseCollector:
    """Every collector implements this interface."""

    def discover(self) -> list[str]:
        """Return a list of agent IDs to collect."""
        ...

    def fetch_one(self, agent_id: str) -> AgentRecord | None:
        """Fetch a single agent's data. Return None on failure."""
        ...

    def normalize(self, raw: dict) -> NormalizedRecord:
        """Map platform-specific fields to our unified schema."""
        ...

class HuggingFaceCollector(BaseCollector):
    RATE_LIMIT = 0.5  # seconds between requests

    def discover(self):
        # HF has a clean API for incremental discovery
        resp = requests.get(
            "https://huggingface.co/api/spaces",
            params={"sort": "createdAt", "direction": -1, "limit": 100}
        )
        return [s["id"] for s in resp.json()]

    def fetch_one(self, agent_id):
        resp = requests.get(f"https://huggingface.co/api/spaces/{agent_id}")
        if resp.status_code != 200:
            return None
        return self.normalize(resp.json())

    def normalize(self, raw):
        return NormalizedRecord(
            platform="huggingface",
            source_id=raw["id"],
            display_name=raw.get("cardData", {}).get("title", raw["id"]),
            tags=raw.get("tags", []),
            sdk=raw.get("sdk"),
            is_private=raw.get("private", False),
            created_at=raw.get("createdAt"),
        )

On-chain collectors look different. For Virtuals Protocol on Base, we scan ERC-20 Transfer events to discover new agent token contracts:

# Simplified: discovering agents via token transfers
TRANSFER_TOPIC = "0xddf252ad..."  # Transfer(address,address,uint256)
resp = requests.post(RPC_URL, json={
    "method": "eth_getLogs",
    "params": [{
        "fromBlock": hex(last_block),
        "toBlock": hex(current_block),
        "address": VIRTUAL_TOKEN,
        "topics": [TRANSFER_TOPIC],
    }]
})
# Extract new contract addresses from transfer logs

The key design principle: collectors are stateless and resumable. If a collector crashes mid-run, it picks up where it left off. We track the last successfully processed block number, page offset, or timestamp.

Layer 3: Validation & Deduplication

This is where it gets interesting — and where most naive pipelines break.

canonical_id generation: The same agent might appear on multiple platforms under different names. We generate a canonical_id that cross-references agents across platforms. (We'll cover this system in detail in our next post.)

Water marking detection: A significant portion of agent registries are placeholder entries — accounts that registered but never deployed anything. We flag these based on multiple signals: empty descriptions, no activity timestamps, default profile data. Our current water rate is 0.038% — meaning 99.96% of indexed agents have real, verifiable data.

Schema normalization: Every platform has different field names for the same concept. HuggingFace calls it sdk, Coze calls it bot_type, on-chain agents have service_type. We map everything to a unified schema before storage.

Layer 4: Scoring Engine

Once validated and deduplicated, agents enter our six-dimension scoring framework: Authenticity, Consistency, Transparency, Commitment, Choice, and Presence.

The scoring engine is a separate system — and a topic for a future post. But the key insight is that collection quality directly determines scoring quality. Garbage in, garbage out applies doubly to trust scoring.

What We Learned

1. Rate limits are generous — until they're not. Most platforms allow reasonable automated access. But if you're polling every 30 seconds from a single IP, you'll get throttled. We use 0.5-2 second delays between requests and exponential backoff on errors.

2. On-chain data is the cleanest — and the hardest. Blockchain data is immutable and well-structured, but RPC endpoints have block range limits on eth_getLogs. We scan in chunks of 10,000 blocks.

3. Placeholder detection matters more than collection speed. It's tempting to chase volume. But 2 million agents where 40% are placeholders is worse than 1 million where 0.04% are. We'd rather index fewer agents with higher confidence.

4. Incremental > full scan. Our collectors run in incremental mode 99% of the time — only fetching what's changed since the last run. Full scans are reserved for schema migrations and bug recovery.

By The Numbers

Metric	Value
Total agents indexed	2,163,677
Platforms covered	28+
Water rate (placeholders)	0.038%
Daily new agents	~1,159
Timeline events tracked	9,546,093
Hash chain entries	Continuous, no gaps

What's Next

In our next post, we'll dive into the canonical_id system — how we identify the same agent across HuggingFace, GitHub, on-chain contracts, and marketplace listings. Cross-platform identity is the hardest problem in agent indexing, and we think we have a workable solution.

AgentRisk indexes and scores AI agents for trust and transparency. Check your agent at agentrisk.app or explore our methodology at agentrisk.app/methodology.

DEV Community

How We Index 2M+ AI Agents Across Platforms

How We Index 2M+ AI Agents Across Platforms

The Problem: Fragmentation at Scale

The Architecture

Layer 1: Source Discovery

Layer 2: Platform-Specific Collectors

Layer 3: Validation & Deduplication

Layer 4: Scoring Engine

What We Learned

By The Numbers

What's Next

Top comments (0)