184 MCP installs and a 93.9% adversarial signal GPT-4o can't replicate

#ai #benchmarking #python #agents

184 MCP installs in 72 hours after publishing agentoracle-mcp — and more importantly, 93.9% of adversarial-flagged refutations that GPT-4o alone could not catch.

This post is about the verification layer under that number: the benchmark methodology, the architecture that produces the adversarial signal, and why claim verification deserves its own layer in the agent stack.

The Benchmark

We ran AgentOracle head-to-head against GPT-4o on 200 claims from the FEVER dataset — the peer-reviewed fact verification benchmark used in dozens of published papers.

Stratified sample: 67 SUPPORTS, 67 REFUTES, 66 NOT ENOUGH INFO. Random seed 42, fully reproducible. Every claim 10+ words. GPT-4o baseline via OpenRouter, single-word answer prompt, temperature 0.

System	Accuracy	Response time
AgentOracle (`/evaluate`)	58.4% (115/197 valid)	multi-source (slower)
GPT-4o (closed-source frontier)	57.5% (115/200)	single call (fast)

A statistical tie on accuracy. Within measurement noise on a 200-claim sample. That's not the headline.

GPT-5 shipped this week. Accuracy benchmarks will keep moving. The adversarial architecture doesn't.

Full methodology, raw results, and reproducibility scripts: github.com/TKCollective/agentoracle-fever-benchmark

The Real Finding: 94% Adversarial Contribution

AgentOracle runs 4 verification sources in parallel: Sonar, Sonar Pro, Adversarial challenge, and Gemma 4. The adversarial source is the differentiator — it's deliberately prompted to argue against each claim, surfacing counter-evidence instead of affirming.

Of the REFUTES claims that AgentOracle correctly identified, 93.9% were flagged by the adversarial layer specifically.

That's the unique signal. GPT-4o alone can't replicate it. Single-model verification confirms what's there; adversarial challenge surfaces what's missing. In agent pipelines where the cost of acting on a hallucination is a wrong action, not a wrong answer, that asymmetry matters more than accuracy parity.

The Architecture

claim in
    ↓
decompose (Gemma) → atomic claims
    ↓
parallel fan-out:
    ├─ Sonar: "is this true?"
    ├─ Sonar Pro: "is this true with extended reasoning?"
    ├─ Adversarial: "argue why this is false"
    └─ Gemma 4: "verify + calibrate"
    ↓
consensus + confidence calibration
    ↓
per-claim verdict: SUPPORTED / REFUTED / UNVERIFIABLE
+ evidence string
+ confidence 0.00–1.00
+ correction (if refuted)

The adversarial source is not a safety filter. It's a research task: find the best counter-argument, evidence included. Even when a claim is ultimately supported, the adversarial output becomes input to the calibration step — which is why AgentOracle's confidence scores are meaningful, not noise.

Confidence calibration on the 200-claim benchmark:

Average confidence on correct predictions: 0.61
Average confidence on incorrect predictions: 0.55

That 6-point gap sounds small but is exactly what you want: the system is more certain when it's right, less certain when it's wrong. Agents branching on confidence thresholds get useful signal, not just theater.

184 Installs Decomposed

The install curve so far:

Day 1 (launch): tutorial + MCP publish → organic npm discovery → 168 installs in 24h
Day 2: +16

Nobody is running a campaign. There's no paid distribution. The installs are developers finding agentoracle-mcp through:

MCP server directories — Glama auto-indexed us on publish
x402 discovery layers — Decixa verified our endpoints and classified us under Analyze → Verification / Data Enrichment
Framework ecosystems — langchain-agentoracle and crewai-agentoracle on PyPI, found by search
Content entry points — the LangChain tutorial post gets indexed by Google and Dev.to's own recommendation engine

None of those are push channels. They're pull channels that compound. One tutorial gets found over and over. One MCP directory listing surfaces to every new developer exploring MCP.

Why This Compounds

The v2.1.0 release of agentoracle-mcp (shipped this week) adds a resolve tool that calls Decixa's multi-provider discovery API. An agent asking "find me a verification endpoint for analyze + verify a factual claim" gets an answer that's not hardcoded to AgentOracle. It's the best-matching x402 endpoint across the ecosystem, ranked by latency, price, and tag match.

Today that returns AgentOracle first because we're the only pre-action truth oracle in that category on Decixa. As more providers list, the resolve() tool keeps working — it routes by intent, not by URL.

The bet is this: in an agent economy with x402 payments, the distribution channel isn't paid ads or SEO. It's shared discovery infrastructure that every agent uses to find services. Ship the service, instrument the discovery properly, and installs compound without campaigns.

We'll see how the install curve develops over the next few weeks. The bet is that showing up in the right directories, and letting the directories do the rest, produces a baseline that doesn't require acquisition spend.

Try It

Playground (no wallet, no signup): agentoracle.co

MCP server — plug into Claude Desktop, Cursor, Windsurf:

npx agentoracle-mcp

Python SDKs:

pip install langchain-agentoracle
pip install crewai-agentoracle

JavaScript:

npm install agentoracle-verify

Benchmark + reproducibility: github.com/TKCollective/agentoracle-fever-benchmark

The benchmark is 200 claims. The architecture is 4 sources with adversarial challenge. The distribution is shared discovery infrastructure that compounds without campaigns. Three simple facts, none of them require a marketing team. That's the model we're betting on.