DEV Community: AgentOracle

How to add claim verification to your AI content approval workflow

AgentOracle — Mon, 11 May 2026 22:31:01 +0000

The 90-second version

One wrong claim slips through. It reaches the client's customers. Maybe it's a stat your AI invented. Maybe it's a competitor comparison that's almost right, but not quite. Maybe it's a regulatory line your team has said a hundred times the right way, and this once it came out the wrong way.

Now you're explaining it to your client's legal team. Or worse, to a regulator. Or worse than that, to a journalist who screenshots and shares.

This is the quiet fear in every content operations team using AI in 2026. And the current solution — a human fact-checker reviewing every piece — doesn't scale past a handful of campaigns per week. The fact-checker becomes the bottleneck. Errors slip through anyway, because nobody can review 200 pieces in an afternoon.

There is a better answer, and it doesn't require rebuilding your stack. It's a verification step that sits between your AI draft and your final approval, and it returns a tamper-evident receipt your legal team can audit.

That's AgentOracle.

What it actually does in your workflow
Picture your existing approval flow:

AI draft → human review → legal sign-off → publish

AgentOracle adds one step:

AI draft → AgentOracle verification → human review → legal sign-off → publish

The verification step takes any factual claim from the draft and runs it against four independent sources in parallel. It returns three things:

A verdict — act, verify, reject, or abstain

A confidence score — a precise number between 0 and 1

A cryptographic receipt — a signed proof of what was checked, when, and against what sources

Your team uses the verdict to make the publish/hold decision. Anything below your confidence threshold goes to a human. Anything above publishes with the receipt attached to the campaign record.

That receipt is the part that changes everything for your legal team.

What a "cryptographic receipt" means in plain English
A receipt is a small block of text that looks like random characters. It is signed with a key only AgentOracle holds, and a public key anyone can use to verify the signature.

If anyone — your legal team, a client's auditor, a regulator, a journalist — wants to confirm that you actually verified a claim before publishing, they take the receipt, fetch our public key, and run a single verification. If the receipt matches, it's authentic. If a single character was altered, the verification fails closed. There is no ambiguity, no "trust the vendor" dependency, no missing log files.

You don't need to understand the cryptography to benefit from it. You just need to know that:

Receipts are tamper-evident

Receipts are third-party verifiable without trusting AgentOracle

Receipts are portable — they work even if AgentOracle disappears tomorrow

This is what your compliance officer has been quietly wishing for since AI content tooling went mainstream. It's the audit trail that holds up when someone asks "prove you checked this."

What it replaces

Instead of a human fact-check team costing $50K–200K per FTE per year: Claim verification at $0.02–$0.10 per claim, returned in seconds.

Instead of screenshot evidence in a Google Doc:
A cryptographic receipt anyone can verify independently.

Instead of email chains saying "I checked it on [site]":
A signed JWS with structured source data attached to every claim.

Instead of manual EU AI Act Article 26 record-keeping:
Automatic, tamper-evident, replayable audit trail built in.

Instead of trusting the vendor's claims about their checking:
Verify the receipt yourself. No trust required.

You're not adding another tool to your stack. You're replacing two or three.

Why this is not fact-checking software
If you've evaluated tools like Originality.ai, Logically, or NewsGuard — AgentOracle is a different category. Those tools answer different questions:

Originality.ai scores whether content looks AI-generated and runs basic plagiarism checks. Useful for detection. Doesn't verify whether specific claims are true.

Logically runs human-powered misinformation review for governments and brands. Slow turnaround. No cryptographic proof.

NewsGuard rates the credibility of sources (this domain is reliable, that one isn't). Doesn't tell you anything about a specific claim inside a piece.

None of them return a tamper-evident receipt your legal team can hand to a regulator and say "here is the proof we verified this claim before we published it." That's the gap AgentOracle fills.

We're a different layer. You can run all of them together if you want.

What others have said
A contributor to Mastercard's Verifiable Intent RFC independently verified our receipt format end-to-end last month. Tested both the Node and Python verifiers. Tamper test failed closed. His exact quote: "Strong work. The calibration.provisional field is the right discipline."

This week, a Coinbase engineer publicly engaged on our x402 implementation on the canonical x402 issue thread (issue #2207 on x402-foundation/x402, May 7, 2026), diagnosed it, and tagged us directly.

These are the kinds of independent technical signals that don't typically come from vendor marketing departments. They come from people stress-testing the implementation against the spec.

This week, AgentOracle was indexed in Coinbase Bazaar discovery. You can verify this yourself with one curl:

curl 'https://api.cdp.coinbase.com/platform/v2/x402/discovery/merchant?payTo=0xdF90200B0031051BbF7a66BB9387d2Ecf599e109'

That returns our resource manifest, schema, example output, and 30-day usage stats — served by Coinbase, not us. If you'd rather see raw on-chain proof, our most recent settlement on Base mainnet: 0x01e37297…2b79cd5.

The Pilot Offer

If you have 3 to 5 representative claims your team is about to publish — send them to joe@agentoracle.co. We'll come back same-day with real receipts run against YOUR content, plus a pilot scope sized to your volume. No commitment, no procurement step, no scheduling a call. You see the product working on your stuff before anything else happens.

If those receipts justify continuing, here's the pilot:

Thirty days. $2,500. We do the integration work.

Specifically:

Up to 50,000 claim verifications during the pilot

Custom dashboard with audit log export

Async Slack or email support

One integration call to plug AgentOracle into your existing approval workflow (we do the technical work; your team does not need an engineer)

A 30-day evaluation report from us at the end summarizing what we caught, what we missed, and what your team should do next

Money-back if you tell us by day 7 the receipts aren't usable. Keeps both sides honest. Almost never gets requested, but takes the procurement risk to zero.

If after 30 days your team thinks the receipts justify continuing, we move to a monthly tier sized to your volume. If not, you keep every receipt you generated and you owe nothing more. We do not pull data we don't need to. Your content stays your content.

No annual contracts. No procurement gymnastics. No per-seat counting. Just a signed audit trail your legal team has been asking for.

Receipt spec public at github.com/TKCollective/agentoracle-receipt-spec. Public JWKS at agentoracle.co/.well-known/jwks.json. Independently reproducible AVeriTeC + FEVER benchmark shipping May 14, 2026._

Stop Your RAG Pipeline From Hallucinating: A 15-Line Fix published

AgentOracle — Fri, 01 May 2026 21:57:03 +0000

Your RAG pipeline retrieves real documents — and still hallucinates. Here's the retrieve → generate → verify pattern that catches it before your agent acts, with working Python code you can run right now.

Your RAG pipeline retrieves three real documents. The LLM reads them. It generates a response that cites those exact sources. Everything looks clean.

And it's still wrong about 8–15% of the time.

If you've deployed RAG to production, you already know this. The answer looks grounded in the retrieved chunks, but a closer read reveals the model invented a date, swapped a name, overstated a number, or fused two unrelated facts into a single plausible-sounding sentence. The citations point to real documents. The statement the citations supposedly support was not actually in those documents.

This is the hardest class of hallucination to catch. It doesn't look like a hallucination. It looks like a correct answer.

This tutorial shows you how to add a verification step to your RAG pipeline in about 15 lines of Python. The verifier runs independently of your retrieval stack and your generation model. It reads the final output, extracts individual claims, checks each one across four independent sources, and returns a verdict before your agent acts.

Why RAG Hallucinations Are Different

Classic LLM hallucination: the model is asked a question it doesn't know the answer to, so it invents one.

RAG hallucination: the model has correct context in its window, and still produces a statement that isn't supported by that context. The three failure modes I see most in production:

Fabrication under citation. The response cites source [2], but the claim it attributes to source [2] isn't actually there. The citation exists; the grounding doesn't.
Fact fusion. Two unrelated facts from two different retrieved chunks get combined into a single sentence. Each half is correct. The combined sentence is false.
Confident extrapolation. The model extrapolates from what the documents say to a related claim the documents don't support, and delivers it with the same confidence as the verified parts.

All three survive retrieval-quality metrics. They survive BLEU, ROUGE, and BERTScore. They survive your "faithfulness" eval if it runs off the same LLM that generated the answer.

The only reliable catch is a second, independent verification pass — different model, different evidence source, different prompt — that reads the final output and scores each claim against the open web.

The Retrieve → Generate → Verify Pattern

Standard RAG is two stages:

query → retrieve → generate → return

Add one stage:

query → retrieve → generate → verify → return

The verify stage decomposes the generated response into individual atomic claims, checks each one, and returns a per-claim verdict plus an overall act / verify / reject recommendation. Your application decides what to do with a reject: surface the bad claims to a user, regenerate with tighter constraints, fall back to a safer response, or abort.

Install

For the simple programmatic case (the bulk of this tutorial), the only dependency is requests:

pip install requests

For full LangChain tool integration:

pip install langchain-agentoracle

No API keys. No configuration. The free /preview endpoint gives you 10 verifications per hour to test with; the production /evaluate endpoint is $0.01 per call via x402 on Base.

A Minimal RAG Pipeline That Hallucinates

First, let's build a RAG pipeline that's deliberately vulnerable. We'll use a tiny in-memory corpus about OpenAI so the hallucinations are easy to spot:

from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

# Three real documents — our "retrieved context"
corpus = {
    "doc_1": """OpenAI was founded in December 2015 as a non-profit
    research organization. Co-founders included Sam Altman, Elon Musk,
    Ilya Sutskever, and Greg Brockman, among others.""",
    "doc_2": """ChatGPT was released by OpenAI on November 30, 2022.
    It reached 100 million monthly active users by January 2023, making
    it the fastest-growing consumer application in history at the time.""",
    "doc_3": """OpenAI has received major investments from Microsoft,
    including a multi-year, multi-billion dollar commitment announced
    in January 2023."""
}

def retrieve(query):
    # Toy retriever — in production, use your vector DB
    return [corpus["doc_1"], corpus["doc_2"], corpus["doc_3"]]

def generate(query, docs):
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)
    context = "\n\n".join(docs)
    response = llm.invoke([
        SystemMessage(content=f"Answer from this context:\n{context}"),
        HumanMessage(content=query)
    ])
    return response.content

answer = generate(
    "Who founded OpenAI, when was ChatGPT released, and how fast did it grow?",
    retrieve("OpenAI founding and ChatGPT growth")
)
print(answer)

Run this a few times. On some runs you'll get a clean answer. On others you'll get a response that invents a co-founder not in the documents, or claims ChatGPT reached one billion users in two months, or attributes the wrong investment figure to Microsoft. Same retrieval, same prompt — different hallucination profile per run.

This is the exact scenario the verification layer is built for.

Add Verification in 15 Lines

import requests

def verify(text):
    r = requests.post(
        "https://agentoracle.co/evaluate",
        json={"content": text},
        timeout=30,
    )
    return r.json()["evaluation"]

def retrieve_generate_verify(query):
    docs = retrieve(query)
    draft = generate(query, docs)

    verdict = verify(draft)

    refuted = [c["claim"] for c in verdict["claims"] if c["verdict"] == "refuted"]
    unverifiable = [c["claim"] for c in verdict["claims"] if c["verdict"] == "unverifiable"]

    if refuted:
        return {"answer": None, "reason": "refuted", "claims": refuted}
    if verdict["recommendation"] == "reject" and unverifiable:
        return {"answer": None, "reason": "unverifiable", "claims": unverifiable}
    return {"answer": draft, "confidence": verdict["overall_confidence"]}

result = retrieve_generate_verify(
    "Who founded OpenAI and how fast did ChatGPT grow?"
)
print(result)

That's the whole integration. Before your agent acts on draft, verify(draft) extracts the atomic claims, checks each across four independent verification sources, and returns a structured verdict.

LangChain users: if you want the verifier as a tool callable from an agent loop instead of a function call, use from langchain_agentoracle import AgentOracleEvaluateTool — it returns formatted text suitable for LLM consumption. The plain HTTP call above is what you want when you need the JSON for application logic (gating, branching, repair).

What a Real Verification Run Looks Like

Here's actual output from feeding AgentOracle a deliberately-hallucinated RAG response. The input text was:

"OpenAI was founded in 2015 by Sam Altman, Elon Musk, and Mark Zuckerberg. The company released ChatGPT in 2022, which reached 1 billion users within 2 months."

Four of those facts are true. Two are hallucinated: Mark Zuckerberg was never an OpenAI co-founder, and ChatGPT reached 100 million users in two months, not one billion.

The verifier response (trimmed for readability):

{
  "recommendation": "reject",
  "overall_confidence": 0.47,
  "total_claims": 6,
  "verified_claims": 4,
  "refuted_claims": 2,
  "claims": [
    {
      "claim": "OpenAI was founded in 2015",
      "verdict": "supported",
      "confidence": 0.83
    },
    {
      "claim": "OpenAI was founded by Sam Altman",
      "verdict": "supported",
      "confidence": 1.0
    },
    {
      "claim": "OpenAI was founded by Elon Musk",
      "verdict": "supported",
      "confidence": 1.0
    },
    {
      "claim": "OpenAI was founded by Mark Zuckerberg",
      "verdict": "refuted",
      "confidence": 0.75,
      "evidence": "No search results mention Mark Zuckerberg as a founder; founders listed include Sam Altman, Elon Musk, Ilya Sutskever, Greg Brockman."
    },
    {
      "claim": "OpenAI released ChatGPT in 2022",
      "verdict": "supported",
      "confidence": 0.95
    },
    {
      "claim": "ChatGPT reached 1 billion users within 2 months",
      "verdict": "refuted",
      "confidence": 0.48,
      "evidence": "ChatGPT reached 100 million users in 2 months (Jan 2023), not 1 billion. 1 billion milestone was later."
    }
  ]
}

Two hallucinations caught. Four true claims confirmed. One reject recommendation that short-circuits the downstream agent action.

Notice what the verifier does not do: it doesn't grade the answer against the retrieved documents. RAG-specific evals that do that miss fabrication-under-citation and fact-fusion every time. Instead, the verifier treats the generated claim as a free-standing statement and checks it against the open web through four independent sources. The retrieved documents are only as good as the next step of your pipeline, and the next step is the LLM — which already had them and still hallucinated.

When To Use Each Recommendation

The verifier returns one of three top-level recommendations, plus per-claim verdicts from a richer 4-way space.

Top-level recommendation:

Recommendation	Rough confidence band	What your agent should do
`act`	≥ 0.80	Proceed. Claims are well-supported across sources.
`verify`	0.50 – 0.80	Soft-pass. Log the claims that dragged confidence down. Consider human-in-the-loop for high-stakes actions.
`reject`	< 0.50, OR any refuted claim	Do not act on the response as-is.

Per-claim verdicts:

Verdict	Meaning	Recommended action
`supported`	Multiple sources confirm the claim.	Trust.
`refuted`	Evidence directly contradicts the claim.	Always block — this is a hallucination.
`unverifiable`	Couldn't find supporting or contradicting evidence.	Treat as soft-flag, not hard fail. Often means the claim is too specific, too recent, or too obscure for the open web. Not the same as "false."

A common production mistake is treating unverifiable the same as refuted. Don't. A draft can get a reject recommendation purely on low overall confidence from several unverifiable claims even when nothing is actually wrong. Check verdict["refuted_claims"] separately before deciding what to do — the code above does this.

Handling The Three RAG Failure Modes

The three failure modes from the start of this post — fabrication-under-citation, fact-fusion, confident-extrapolation — all get caught by the same pattern. Here's why:

Fabrication under citation. The verifier decomposes the response into atomic claims and checks each one against the open web. The cited source is irrelevant to the verifier; what matters is whether the claim itself is supported. If the response says "source [2] reports 47% revenue growth" and source [2] actually reports 4.7%, the 47% claim gets refuted independently of the citation.

Fact fusion. Each atomic claim gets verified independently. If the response fuses "Apple's Q4 revenue was $120B" (true) with "announced on March 3" (true for a different product) into "Apple's $120B Q4 revenue was announced on March 3" (false), the fused claim gets checked as-is and refuted.

Confident extrapolation. The verifier doesn't care how confident the generation model sounded. It cares what the open web says. An extrapolation that looks authoritative in context but is unsupported by any independent source returns unverifiable or refuted.

Upgrading: Per-Claim Regeneration

Once you have verdict["claims"], you can do more than reject the whole response. You can surgically regenerate only the failed claims:

def verify_and_repair(query):
    docs = retrieve(query)
    draft = generate(query, docs)
    verdict = verify(draft)

    refuted = [c["claim"] for c in verdict["claims"] if c["verdict"] == "refuted"]
    if not refuted:
        return draft

    # Re-generate with explicit "do not include" list
    repair_prompt = (
        f"Answer the following using ONLY the retrieved context. "
        f"Do not include these claims that were refuted: {refuted}\n\n"
        f"Original query: {query}"
    )
    repaired = generate(repair_prompt, docs)
    return repaired

This is the pattern I see most in production RAG pipelines. Soft reject → named failure list → targeted regeneration. You get the speed benefits of auto-generation with the safety of verification, and the user never sees the hallucinated version.

Production Notes

A few things I've learned from running this in real pipelines:

Latency. /evaluate typically returns in 3–6 seconds for a short paragraph with 3–6 claims. If your RAG pipeline runs hot and that's too slow, add verification only to high-stakes agent actions (writes, transactions, external messages) — not to every chat turn.
Cost. The free tier (10/hour) is fine for development. For production, /evaluate is pay-per-query over x402 on Base at $0.01 per call. An agent making 100 verifications/hour costs ~$1/hour. Typically cheaper than the LLM call that generated the response you're verifying.
Thresholds. Default is 0.80 for act. Bump to 0.90 for regulated workflows (medical, legal, financial) where a 10% false-positive on true claims is cheaper than a 1% false-negative on hallucinations.
Failure modes. Sometimes /evaluate returns unverifiable instead of supported / refuted. That usually means the claim is too specific, too recent, or too obscure for the open web. Treat unverifiable the same as verify — soft-flag, don't hard-fail. The code in this tutorial separates refuted from unverifiable on purpose.

The Full Minimal Example

For easy copy-paste, here's the complete working example in one block:

import requests
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

corpus = [
    "OpenAI was founded in December 2015 as a non-profit research organization.",
    "ChatGPT was released by OpenAI on November 30, 2022 and reached 100 million users by January 2023.",
]

def verify(text):
    return requests.post(
        "https://agentoracle.co/evaluate",
        json={"content": text},
        timeout=30,
    ).json()["evaluation"]

def rag_with_verification(query):
    # Retrieve
    docs = corpus

    # Generate
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)
    draft = llm.invoke([
        SystemMessage(content=f"Answer only from this context:\n{chr(10).join(docs)}"),
        HumanMessage(content=query),
    ]).content

    # Verify
    verdict = verify(draft)
    refuted = [c["claim"] for c in verdict["claims"] if c["verdict"] == "refuted"]
    if refuted:
        return f"REJECTED — hallucinated claims: {refuted}"
    return draft

print(rag_with_verification("When was OpenAI founded and how fast did ChatGPT grow?"))

Run it. Break it on purpose by loosening the temperature or narrowing the corpus. Watch what the verifier catches.

Getting Started

Playground (no setup): agentoracle.co

Packages:

pip install langchain-agentoracle — PyPI
pip install crewai-agentoracle — PyPI
npx agentoracle-mcp — npm (Claude Desktop, Cursor, Windsurf)

Source: GitHub

Verifiable receipts spec: github.com/TKCollective/agentoracle-receipt-spec — every /evaluate response commits to a JWS-signed receipt format you can verify offline against the public JWKS. See the /examples directory for verifying examples in Node and Python.

Earlier in this series:

RAG was supposed to solve hallucinations. It solved some — then introduced a harder class. The fix is the same fix it's always been: a verification step that runs on the output, independent of whatever pipeline produced it.

Fifteen lines of Python. Free tier to try. The code above works as-is.

184 MCP installs and a 93.9% adversarial signal GPT-4o can't replicate

AgentOracle — Fri, 24 Apr 2026 14:29:29 +0000

184 MCP installs in 72 hours after publishing agentoracle-mcp — and more importantly, 93.9% of adversarial-flagged refutations that GPT-4o alone could not catch.

This post is about the verification layer under that number: the benchmark methodology, the architecture that produces the adversarial signal, and why claim verification deserves its own layer in the agent stack.

The Benchmark

We ran AgentOracle head-to-head against GPT-4o on 200 claims from the FEVER dataset — the peer-reviewed fact verification benchmark used in dozens of published papers.

Stratified sample: 67 SUPPORTS, 67 REFUTES, 66 NOT ENOUGH INFO. Random seed 42, fully reproducible. Every claim 10+ words. GPT-4o baseline via OpenRouter, single-word answer prompt, temperature 0.

System	Accuracy	Response time
AgentOracle (`/evaluate`)	58.4% (115/197 valid)	multi-source (slower)
GPT-4o (closed-source frontier)	57.5% (115/200)	single call (fast)

A statistical tie on accuracy. Within measurement noise on a 200-claim sample. That's not the headline.

GPT-5 shipped this week. Accuracy benchmarks will keep moving. The adversarial architecture doesn't.

Full methodology, raw results, and reproducibility scripts: github.com/TKCollective/agentoracle-fever-benchmark

The Real Finding: 94% Adversarial Contribution

AgentOracle runs 4 verification sources in parallel: Sonar, Sonar Pro, Adversarial challenge, and Gemma 4. The adversarial source is the differentiator — it's deliberately prompted to argue against each claim, surfacing counter-evidence instead of affirming.

Of the REFUTES claims that AgentOracle correctly identified, 93.9% were flagged by the adversarial layer specifically.

That's the unique signal. GPT-4o alone can't replicate it. Single-model verification confirms what's there; adversarial challenge surfaces what's missing. In agent pipelines where the cost of acting on a hallucination is a wrong action, not a wrong answer, that asymmetry matters more than accuracy parity.

The Architecture

claim in
    ↓
decompose (Gemma) → atomic claims
    ↓
parallel fan-out:
    ├─ Sonar: "is this true?"
    ├─ Sonar Pro: "is this true with extended reasoning?"
    ├─ Adversarial: "argue why this is false"
    └─ Gemma 4: "verify + calibrate"
    ↓
consensus + confidence calibration
    ↓
per-claim verdict: SUPPORTED / REFUTED / UNVERIFIABLE
+ evidence string
+ confidence 0.00–1.00
+ correction (if refuted)

The adversarial source is not a safety filter. It's a research task: find the best counter-argument, evidence included. Even when a claim is ultimately supported, the adversarial output becomes input to the calibration step — which is why AgentOracle's confidence scores are meaningful, not noise.

Confidence calibration on the 200-claim benchmark:

Average confidence on correct predictions: 0.61
Average confidence on incorrect predictions: 0.55

That 6-point gap sounds small but is exactly what you want: the system is more certain when it's right, less certain when it's wrong. Agents branching on confidence thresholds get useful signal, not just theater.

184 Installs Decomposed

The install curve so far:

Day 1 (launch): tutorial + MCP publish → organic npm discovery → 168 installs in 24h
Day 2: +16

Nobody is running a campaign. There's no paid distribution. The installs are developers finding agentoracle-mcp through:

MCP server directories — Glama auto-indexed us on publish
x402 discovery layers — Decixa verified our endpoints and classified us under Analyze → Verification / Data Enrichment
Framework ecosystems — langchain-agentoracle and crewai-agentoracle on PyPI, found by search
Content entry points — the LangChain tutorial post gets indexed by Google and Dev.to's own recommendation engine

None of those are push channels. They're pull channels that compound. One tutorial gets found over and over. One MCP directory listing surfaces to every new developer exploring MCP.

Why This Compounds

The v2.1.0 release of agentoracle-mcp (shipped this week) adds a resolve tool that calls Decixa's multi-provider discovery API. An agent asking "find me a verification endpoint for analyze + verify a factual claim" gets an answer that's not hardcoded to AgentOracle. It's the best-matching x402 endpoint across the ecosystem, ranked by latency, price, and tag match.

Today that returns AgentOracle first because we're the only pre-action truth oracle in that category on Decixa. As more providers list, the resolve() tool keeps working — it routes by intent, not by URL.

The bet is this: in an agent economy with x402 payments, the distribution channel isn't paid ads or SEO. It's shared discovery infrastructure that every agent uses to find services. Ship the service, instrument the discovery properly, and installs compound without campaigns.

We'll see how the install curve develops over the next few weeks. The bet is that showing up in the right directories, and letting the directories do the rest, produces a baseline that doesn't require acquisition spend.

Try It

Playground (no wallet, no signup): agentoracle.co

MCP server — plug into Claude Desktop, Cursor, Windsurf:

npx agentoracle-mcp

Python SDKs:

pip install langchain-agentoracle
pip install crewai-agentoracle

JavaScript:

npm install agentoracle-verify

Benchmark + reproducibility: github.com/TKCollective/agentoracle-fever-benchmark

The benchmark is 200 claims. The architecture is 4 sources with adversarial challenge. The distribution is shared discovery infrastructure that compounds without campaigns. Three simple facts, none of them require a marketing team. That's the model we're betting on.

3 Agent Integration Patterns for Claim Verification (LangChain + CrewAI + MCP)

AgentOracle — Thu, 23 Apr 2026 14:38:24 +0000

Your agent generates a claim. Then what?

In most agent pipelines: nothing. The claim flows straight into the next action — a tool call, a database write, a message sent. If the claim is wrong, the action is wrong, and the first person to notice is usually the user.

There are three patterns that fix this. Each one adds a verification step between generation and action — pre-action claim verification — and each one fits a different stage of agent maturity.

All three patterns below use AgentOracle (free to try, no wallet, no API keys). The code works as-is. Copy it, run it.

Pattern 1: Verify-Then-Act Gate (simplest)

Your agent has exactly one claim it's about to act on. You want a hard pass/fail before anything happens.

pip install langchain-agentoracle

from langchain_agentoracle import AgentOracleVerifyGateTool

gate = AgentOracleVerifyGateTool()

def verify_then_act(claim: str, action_fn):
    """Gate an action behind a single claim verification."""
    result = gate.run(claim)

    # Gate returns PASS/FAIL + confidence. Parse from formatted output.
    if "Recommendation: ACT" in result:
        return action_fn()

    print(f"Action blocked — verification failed:\n{result}")
    return None


# Example: agent thinks a contract exists and wants to call it
claim = "Contract 0xabc...123 is a valid USDC contract on Base mainnet"
verify_then_act(claim, lambda: call_contract(...))

When to use this:

Single atomic claim ("X is true, therefore do Y")
Binary decisions (proceed or halt)
Free — /verify-gate has no cost

When it's not enough:

Your agent generates paragraph-length output with multiple claims
You need evidence for the verdict, not just pass/fail
You want per-claim granularity (accept 3 of 4 claims, flag the 4th)

Pattern 2: Decompose-and-Score (most versatile)

Your agent outputs a paragraph. Some claims are factual, some might be hallucinated, and you don't want to throw out the whole output if only one sentence is wrong.

The /evaluate endpoint decomposes text into atomic claims, scores each independently, and returns per-claim verdicts. You can then keep the good claims, correct the bad ones, or flag them for human review.

from langchain_agentoracle import AgentOracleEvaluateTool
import json
import re

evaluator = AgentOracleEvaluateTool()

def audit_agent_output(text: str):
    """Decompose text into claims, verify each, return structured audit."""
    raw = evaluator.run(text)

    # The tool returns a formatted string. Extract per-claim verdicts.
    claims = []
    for block in re.findall(r"\[(SUPPORTED|REFUTED|UNVERIFIABLE)\] \((\d\.\d+)\) (.+?)(?=\n\n|\Z)", raw, re.DOTALL):
        verdict, confidence, body = block
        lines = body.strip().split("\n")
        claim_text = lines[0].strip()
        evidence = ""
        for line in lines[1:]:
            if line.strip().startswith("Evidence:"):
                evidence = line.split("Evidence:", 1)[1].strip()
                break
        claims.append({
            "claim": claim_text,
            "verdict": verdict,
            "confidence": float(confidence),
            "evidence": evidence,
        })
    return claims


# Example: your agent produced this summary
agent_summary = """
OpenAI released GPT-4 in March 2023.
Bitcoin was created by Elon Musk in 2009.
Python was created by Guido van Rossum in 1991.
"""

audit = audit_agent_output(agent_summary)
for c in audit:
    print(f"{c['verdict']:14} ({c['confidence']:.2f}) {c['claim']}")
    if c['verdict'] == 'REFUTED':
        print(f"                        ↳ {c['evidence'][:120]}")

Sample output:

SUPPORTED      (1.00) OpenAI released GPT-4 in March 2023
REFUTED        (0.83) Bitcoin was created by Elon Musk in 2009
                        ↳ Bitcoin's creator is the pseudonymous Satoshi Nakamoto, not Elon Musk.
SUPPORTED      (1.00) Python was created by Guido van Rossum in 1991

What you can do with this:

# Keep supported, flag refuted, escalate low-confidence
safe_claims = [c for c in audit if c["verdict"] == "SUPPORTED" and c["confidence"] >= 0.8]
need_human  = [c for c in audit if c["verdict"] == "UNVERIFIABLE" or c["confidence"] < 0.5]
refuted     = [c for c in audit if c["verdict"] == "REFUTED"]

if refuted:
    # Regenerate with the corrections inline, or just flag
    log_hallucination(refuted)

When to use this:

Multi-claim agent output (summaries, research, plans)
You need evidence, not just a verdict
You want to selectively keep/reject claims

Pattern 3: Multi-Agent Supervisor (most advanced)

Now you have a CrewAI crew with a researcher agent and a writer agent. The writer is about to publish. You want a supervisor agent that:

Discovers the right verification provider (via Decixa's multi-provider registry, falling back to local)
Calls that provider to audit the writer's draft
Only passes the draft through if verification clears a threshold

This is where AgentOracle's MCP server shines. It exposes both the resolve tool (discovery) and the verification tools in a single MCP binary any agent can call.

# No install — runs via npx
npx agentoracle-mcp

Hook it into Claude Desktop or Cursor or any MCP-compatible runtime. Then in your agent framework:

from crewai import Agent, Task, Crew
from crewai_agentoracle import AgentOracleEvaluateTool

# Writer agent — generates content, might hallucinate
writer = Agent(
    role="Technical Writer",
    goal="Draft a factual summary of recent AI regulation news",
    backstory="Experienced technical writer. Optimizes for readability.",
)

# Supervisor agent — audits the writer's output using AgentOracle
supervisor = Agent(
    role="Fact-Check Supervisor",
    goal="Catch hallucinations before the writer's draft ships",
    backstory="Pedantic editor. Refuses to pass content with unverified claims.",
    tools=[AgentOracleEvaluateTool()],
)

draft = Task(
    description="Write 3 sentences about recent AI regulation news",
    agent=writer,
    expected_output="A 3-sentence factual summary",
)

review = Task(
    description=(
        "Evaluate the writer's draft using the AgentOracle tool. "
        "If any claim is REFUTED, return 'BLOCKED: <reason>'. "
        "If overall confidence is below 0.7, return 'NEEDS_HUMAN_REVIEW'. "
        "Otherwise return 'APPROVED' plus the cleaned draft."
    ),
    agent=supervisor,
    expected_output="APPROVED | BLOCKED | NEEDS_HUMAN_REVIEW + reasoning",
    context=[draft],
)

Crew(agents=[writer, supervisor], tasks=[draft, review]).kickoff()

Why this pattern matters:

Separation of concerns: one agent writes, one agent verifies
The supervisor can be a smaller, cheaper model — it just needs to call the tool and apply logic
Works in CrewAI, AutoGen, LangGraph — any framework that supports agent-as-tool-user

The discovery angle: If you want the supervisor to choose verification providers dynamically (not hardcode AgentOracle), use the resolve tool (v2.1.0 of agentoracle-mcp, via Decixa):

resolve(
  capability="analyze",
  intent="verify a factual claim before acting"
)

Returns the best-matching x402 verification endpoint across the ecosystem, ranked by latency, price, and tag match. AgentOracle is the only pre-action truth oracle currently classified under "Analyze → Verification" on Decixa, so it'll come back first today. As more providers list, your supervisor automatically picks the best one for each query.

Which Pattern To Pick

Your agent setup	Pattern
Single binary decision	1. Verify-then-act gate (free, `/verify-gate`)
Paragraph output, need per-claim scoring	2. Decompose-and-score (free during beta, `/evaluate`)
Multi-agent pipeline, supervisor pattern	3. Multi-agent supervisor (CrewAI + MCP)

All three work together. Start with Pattern 1 while you're prototyping. Graduate to Pattern 2 when your agent produces structured output. Move to Pattern 3 when you have a real pipeline with distinct agent roles.

Getting Started

Playground (no setup): agentoracle.co

Packages:

pip install langchain-agentoracle — PyPI
pip install crewai-agentoracle — PyPI
npx agentoracle-mcp — npm (Claude Desktop, Cursor, Windsurf)
npm install agentoracle-verify — npm

Source: GitHub

Benchmark: We ran AgentOracle head-to-head against GPT-4o on 200 peer-reviewed FEVER claims. Results + methodology.

Hallucinations aren't a bug to patch. They're a property of large language models that doesn't go away with bigger training runs or better prompts. The only reliable fix is to add a verification step your agent can't bypass.

These three patterns are what that step looks like in production code.

24 hours of organic discovery: what we learned from our first external users

AgentOracle — Wed, 22 Apr 2026 04:14:56 +0000

Yesterday we published a tutorial. No list. No paid promotion. No cold outreach.
By Tuesday morning, five developers and two autonomous agents had found AgentOracle and run real evaluations.
Who showed up:

Starlink — Albuquerque, NM
Comcast — Rockville, MD
Charter/Spectrum — Missoula, MT
Azure cloud agent — Des Moines, IA
Azure cloud agent — Chicago, IL

Same path every time: tutorial → playground → /evaluate
The Azure IPs are the most interesting signal. Those aren't humans clicking a tutorial — those are autonomous agents running on cloud infrastructure that found the playground and ran evaluations on their own. That's exactly the use case we built for.
MCP server shipped tonight. Your agent can find it the same way they did.

npx agentoracle-mcp
Or hit the playground directly: agentoracle.co

How to Add Claim Verification to Your LangChain Agent in 5 Minutes

AgentOracle — Mon, 20 Apr 2026 16:01:18 +0000

Your LangChain agent is wrong about 10% of the time. Not occasionally — consistently, confidently, and silently.

The problem isn't the model. It's that your agent has no way to know when it's wrong. It receives information, formats a response, and acts. No second opinion. No fact-check. No circuit breaker.

This tutorial shows you how to add a verification layer in 5 minutes that catches hallucinations before your agent acts on them.

The Problem

LLM hallucination rates in 2026 range from 3% to 20% depending on the task. On a summarization benchmark, GPT-4 looks great. On open-ended factual questions — the kind your agent asks constantly — it's a different story.

The deeper problem: reasoning models hallucinate more on factual tasks, not less. The more a model "thinks through" an answer, the more likely it is to fill gaps with plausible-sounding fiction.

In a simple chatbot, a hallucination is embarrassing. In an autonomous agent pipeline, it's a wrong action. A refunded order, a bad recommendation, a compliance violation, a message sent to the wrong person.

The standard fix is human review. But human review defeats the purpose of an autonomous agent.

The real fix is a verification layer that runs before your agent acts — independently of the model that generated the claim.

Install

pip install langchain-agentoracle

That's it. No API keys. No configuration. The free tier gives you 20 preview verifications per hour to test with.

Quick Start: Verify Before Your Agent Acts

The simplest integration — verify a piece of text and get per-claim verdicts:

from langchain_agentoracle import AgentOracleEvaluateTool

verifier = AgentOracleEvaluateTool()

# Your agent just generated this text — is it true?
agent_output = """
OpenAI released GPT-4 in March 2023.
Bitcoin was created by Elon Musk.
The Python programming language was created by Guido van Rossum.
"""

result = verifier.run(agent_output)
print(result)

Here's what comes back:

EVALUATION RESULT
Overall confidence: 0.61
Recommendation: ACT
Claims found: 3 | Supported: 2 | Refuted: 1 | Unverifiable: 0
Sources used: sonar, sonar-pro, adversarial, gemma-4

CLAIMS:
  ✓ [SUPPORTED] (1.00) OpenAI released GPT-4 in March 2023
    Evidence: Widely documented historical fact; GPT-4 was announced
    and released on March 14, 2023.

  ✗ [REFUTED] (0.83) Bitcoin was created by Elon Musk
    Evidence: Bitcoin's creator is the pseudonymous Satoshi Nakamoto.
    Correction: Bitcoin was created by Satoshi Nakamoto, not Elon Musk.

  ✓ [SUPPORTED] (1.00) Python was created by Guido van Rossum
    Evidence: Confirmed in official Python documentation and
    Van Rossum's own statements.

Three claims went in. Two came back supported with evidence. One came back refuted with a correction. Your agent now knows claim #2 is wrong before it acts on it.

Add It to Your Agent's Toolbelt

Want your agent to verify claims on its own? Add the tools directly:

from langchain_agentoracle import get_agentoracle_tools

# Returns all 6 AgentOracle tools ready for your agent
tools = get_agentoracle_tools()

# Or pick specific ones:
from langchain_agentoracle import (
    AgentOracleEvaluateTool,    # Per-claim verification ($0.01)
    AgentOracleVerifyGateTool,  # Quick pass/fail gate (free)
    AgentOraclePreviewTool,     # Research preview (free, 20/hr)
)

The tools follow LangChain's BaseTool interface, so they plug into any agent:

from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
from langchain_agentoracle import AgentOracleEvaluateTool, AgentOraclePreviewTool

llm = ChatOpenAI(model="gpt-4")

tools = [
    AgentOracleEvaluateTool(),
    AgentOraclePreviewTool(),
]

agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.OPENAI_FUNCTIONS,
    verbose=True,
)

# The agent can now verify claims before acting
agent.run("Check if this is true: Tesla's market cap exceeded $2 trillion in 2024")

The Verify-Then-Act Pattern

The most useful pattern: gate your agent's actions on verification confidence.

from langchain_agentoracle import AgentOracleEvaluateTool
import json

verifier = AgentOracleEvaluateTool()

def verify_then_act(text, confidence_threshold=0.8):
    """Only act if verification confidence exceeds threshold."""
    result = verifier.run(text)

    # Parse the confidence from the result
    # The tool returns a formatted string with overall confidence
    if "Overall confidence:" in result:
        conf_line = [l for l in result.split('\n') if 'Overall confidence' in l][0]
        confidence = float(conf_line.split(': ')[1])

        if confidence >= confidence_threshold:
            print(f"✅ VERIFIED ({confidence}) — safe to act")
            return True
        else:
            print(f"⚠️ LOW CONFIDENCE ({confidence}) — hold for review")
            return False
    return False

# In your agent pipeline:
claim = "The Federal Reserve raised interest rates in March 2024"
if verify_then_act(claim):
    # proceed with the action
    pass
else:
    # flag for human review or use a fallback
    pass

Free Quick Check: The Verify Gate

Don't need per-claim breakdowns? The verify gate gives you a fast pass/fail:

from langchain_agentoracle import AgentOracleVerifyGateTool

gate = AgentOracleVerifyGateTool()

# Quick binary check — free, no payment needed
result = gate.run("The speed of light is approximately 300,000 km per second")
print(result)
# VERIFY GATE: FAIL
# Confidence: 1.00
# Recommendation: ACT
# ("FAIL" = gate found no issues — content is safe to act on)

Why AgentOracle

Most hallucination detection tools are built for humans — dashboards, observability platforms, monitoring UIs. They tell you what went wrong after the fact.

AgentOracle is built for agents. It sits in the pipeline, takes any text, runs it through 4 independent verification sources in parallel, and returns a machine-readable verdict before your agent acts.

No dashboards. No subscriptions. No API keys to configure. Your agent calls /evaluate, gets ACT / VERIFY / REJECT with a confidence score and evidence, and decides what to do next.

What's under the hood:

4 independent sources: Sonar, Sonar Pro, Adversarial challenge, and Gemma 4
Per-claim decomposition — complex text gets broken into individual verifiable claims
Confidence calibration across sources
Evidence and corrections for every verdict
1,900+ claim fingerprints in the database and growing daily

Try It Now

Playground — no setup, no payment: agentoracle.co
Paste any text and see per-claim verdicts in under 15 seconds.

Packages:

pip install langchain-agentoracle — PyPI
pip install crewai-agentoracle — PyPI
npm install agentoracle-verify — npm

Source: GitHub

Hallucinations aren't going away. The models are getting better, but "better" still means wrong 3-10% of the time on the tasks your agents actually run.

A verification layer doesn't replace a good model. It catches the cases where even a good model is confidently wrong — which is exactly when you need it most.