DEV Community

Wauldo
Wauldo

Posted on

Your RAG pipeline doesn't tell you when it's wrong. Here's how to fix that.

Here's something that bugged me for a while: every RAG framework tells you what the LLM said. None of them tell you if it was true.

You get confidence: 0.92 from the retriever. Cool. That means the retrieval was good. It says nothing about whether the LLM hallucinated on top of perfectly retrieved documents.

The LLM can retrieve the right chunk, read "14 days", and confidently write "60 days". Retrieval confidence: high. Answer accuracy: zero.

What if every answer came with a trust score?

Not retrieval confidence. Not perplexity. A score that compares the actual claims in the answer against the actual text in the sources.

from wauldo import HttpClient

client = HttpClient(base_url="https://api.wauldo.com", api_key="YOUR_KEY")

result = client.guard(
    text="The free trial lasts 60 days.",
    source_context="Free trial period: 14 days. No extensions.",
)

print(result.verdict)       # "rejected"
print(result.confidence)    # 0.0
print(result.is_blocked)    # True
print(result.claims[0].reason)  # "numerical_mismatch"
Enter fullscreen mode Exit fullscreen mode

The trust score is a number between 0 and 1. It's not a probability — it's a factual verification score based on claim-by-claim comparison.

What it catches

Numerical mismatches — "60 days" vs "14 days" in the source:

r = client.guard("Price is $99/month", "Pricing: $49/month for Pro plan")
# verdict: "rejected", reason: "numerical_mismatch"
Enter fullscreen mode Exit fullscreen mode

Correct claims — when the answer matches:

r = client.guard("Paris is the capital of France", "Paris is the capital of France.")
# verdict: "verified", confidence: 1.0
Enter fullscreen mode Exit fullscreen mode

Partial evidence — when the source doesn't fully support the claim:

r = client.guard(
    "The API supports JSON and XML formats",
    "All requests must use JSON format."
)
# verdict: "weak", action: "review"
Enter fullscreen mode Exit fullscreen mode

Plugging it into your existing code

Whatever you're using — LangChain, LlamaIndex, Haystack, raw OpenAI — the pattern is the same:

# Step 1: generate answer (your existing code)
answer = your_pipeline.run(question)

# Step 2: verify (3 lines)
check = client.guard(text=answer, source_context=retrieved_docs)
if check.is_blocked:
    answer = "I couldn't verify this answer against the sources."
Enter fullscreen mode Exit fullscreen mode

That's it. No framework migration. No retraining. No prompt engineering.

Three modes, pick your tradeoff

Mode Speed What it does
lexical <1ms Token overlap matching
hybrid ~50ms Token + semantic embeddings
semantic ~500ms Full embedding comparison

Default is lexical. For most production use cases, <1ms verification on every response is the right tradeoff.

Try it right now

No signup needed — paste any text + source in the interactive tool and see the trust score live.

With code — install and test locally with the mock (no API key needed):

from wauldo import MockHttpClient

mock = MockHttpClient()

# Contradiction → rejected
print(mock.guard("60 days", "14 days").verdict)  # "rejected"

# Match → verified
print(mock.guard("14 days", "14 days").verdict)  # "verified"
Enter fullscreen mode Exit fullscreen mode

SDKs: pip install wauldo · npm install wauldo · cargo add wauldo · API docs (Postman)

Free tier: 300 requests/month — get a key


I'm building this because I got tired of shipping RAG pipelines that work on demos and break on real data. If you've solved this differently, I'd genuinely like to hear how.

Top comments (0)