Your RAG isn't broken. It's just lying quietly.
Retrieval works. The LLM sounds confident. Your users get an answer.
But somewhere in that response, a claim contradicts the source document it was supposed to be grounded in. No error thrown. No flag raised. Just a confident, wrong answer, delivered at scale.
This is the hallucination problem that doesn't get talked about enough. Not the obvious failures. The subtle ones.
We've seen it across enterprise RAG deployments in legal tools, internal knowledge bases, customer-facing assistants. The retrieval pipeline performs. The LLM performs. And still, trust erodes the moment a user catches one bad answer.
We're open sourcing LongTracer, our answer to this problem.
LongTracer sits at the output layer of any RAG pipeline and verifies every claim in an LLM response against your source documents. It uses a hybrid STS + NLI approach: first finding the most semantically relevant source sentence per claim, then classifying whether that source actually supports, contradicts, or is neutral to what the LLM said.
The result: a trust score, a verdict, and a clear list of exactly which claims hallucinated and why.
No LLM calls. No vector store required. No new infrastructure. It works with LangChain, LlamaIndex, Haystack, LangGraph, or any pipeline that gives you a response and source chunks.
MIT licensed. Built from real implementation experience.
If you're running RAG in production, your users deserve answers you can actually stand behind.
Try:
pip install longtracer
Top comments (0)