Muhammad Muzammil

Posted on Jun 4 • Originally published at Medium

LongTracer: Open-Source RAG Hallucination Detection Without LLM-as-a-Judge

#ai #rag #llm #langchain

Stop paying to evaluate your LLM outputs. Stop tolerating non-deterministic quality gates. LongTracer is the MIT-licensed Python library that catches RAG hallucinations at inference time — no API calls, no cloud dependency, no per-verification cost.

The Hallucination Problem Is Now a Production Engineering Problem
Retrieval-Augmented Generation (RAG) has become the dominant architecture for enterprise AI in 2025–2026. Legal research tools, medical Q&A systems, financial advisory bots, and customer-support agents all run the same core loop: retrieve context from a knowledge base, pass it to an LLM, return the response.

The failure mode is well-documented: hallucination — the LLM generating confident, plausible-sounding output that directly contradicts the very source documents it was given.

A legal assistant that cites a case that doesn’t exist.
A medical chatbot that states the wrong drug dosage.
A customer-support agent that invents a return policy.
These are not edge cases. They are the daily operational reality for any team running RAG at scale.

The engineering community has largely accepted the reframing: hallucination is not a model bug you patch once. It is a systems engineering discipline you manage continuously. That shift has spawned an entire category of LLM observability tooling — and the market is now crowded.

This article does two things: gives you an honest map of the observability landscape as it stands today, and makes the technical case for LongTracer — a focused, open-source Python library built by EnDevSols that takes a fundamentally different approach to the problem.

The 2025–2026 LLM Observability Landscape: An Honest Map
Before evaluating any specific tool, it helps to understand what the market actually offers. As of mid-2026, the major players fall into four distinct categories.

General-Purpose Trace Platforms
Langfuse (MIT-licensed, self-hostable) has become the default open-source choice for teams that need prompt management, session tracing, and evaluation harnesses. Its breadth is its strength — it integrates with LangChain, LlamaIndex, and custom pipelines, supports prompt versioning, and has a human annotation queue. Its fundamental limitation in the RAG verification space: it is an observability tool. It tells you what happened. It does not automatically verify whether the response was grounded in the retrieved documents.

Arize Phoenix brings a mature MLOps heritage. Built natively on OpenTelemetry, it excels at embedding drift detection, retrieval quality metrics, and evaluation pipelines. Teams with a traditional ML background will find the paradigm familiar. Like Langfuse, it is primarily a tracing and post-hoc evaluation platform.

LangSmith is the native observability layer for LangChain/LangGraph. Tightly integrated and excellent for graph visualization and annotation — but creates significant vendor lock-in and is less useful for teams using other frameworks.

Real-Time Guardrail Platforms
Galileo differentiates through proprietary SLM models purpose-built for real-time evaluation. Its Luna-2 models are widely regarded as state-of-the-art for blocking harmful or hallucinated outputs before they reach users. The tradeoff: enterprise-only pricing, cloud-only deployment, and LLM-calls-to-evaluate-LLM-calls — compounding both cost and latency.

Helicone takes an entirely different approach — acting as a transparent proxy between your application and LLM providers. The “one-line” setup is its headline feature. It excels at cost tracking and caching but is not a semantic verification system in any meaningful sense.

The Gap These Tools Leave
Every solution above falls into one of two categories:

Passive and post-hoc — observes and reports after the fact but does not verify claim-level grounding at inference time.
Expensive and locked — real-time guardrails require enterprise contracts, cloud connectivity, and LLM calls to evaluate LLM calls.
This is precisely the gap LongTracer is designed to fill.

What Is LongTracer?
LongTracer is an open-source Python SDK (MIT license, available on PyPI) built by EnDevSols for one specific job: verify that every claim in an LLM response is actually supported by the source documents used to generate it.

It achieves this using a hybrid STS + NLI pipeline — two lightweight encoder models that run entirely locally, with no external API calls, no internet dependency, and no per-verification cost.

As of v0.2.0 (released May 18, 2026), it ships with a complete observability suite: a built-in web dashboard, OpenTelemetry export for Grafana/Datadog/Jaeger, active alerting via Slack/Discord/webhooks, and a production-grade REST API server. But the core mission has never changed:

“RAG hallucination detection, multi-project tracing, and pluggable backends — all batteries included.”

Install it in one command:

pip install longtracer

Supports Python 3.10, 3.11, and 3.12.

How LongTracer Works: The STS + NLI Pipeline
This is LongTracer’s core technical differentiator. Understanding the architecture is essential to understanding why it solves problems that other tools don’t.

The Problem with LLM-as-a-Judge
Most RAG evaluation approaches use an LLM-as-a-judge strategy: send the original response and the source context to a capable model (GPT-4o, Claude, Gemini) and ask it to score faithfulness. This approach is intuitive but introduces three serious production problems:

Latency: An additional LLM call adds 1–5 seconds per inference.
Cost: At scale, paying for an evaluation call on every response becomes substantial.
Non-determinism: The same inputs can produce different scores on consecutive runs, making CI/CD integration unreliable. You cannot write a test that will not flake.
LongTracer’s design decision is direct: replace the LLM judge with a deterministic two-stage encoder pipeline.

Stage 1 — Claim Splitting
The LLM response is broken into individual atomic claims using a regex-based sentence splitter tuned for LLM output patterns. Key behaviors:

Decimal numbers (98.6°F) are not split at their period
Standard abbreviations (Dr., Inc., e.g.) are handled correctly
Meta-statements — honest uncertainty phrases like “the documents do not contain…” — are detected and never flagged as hallucinations, even if no source explicitly supports them
Hallucination-signaling phrases — statements like “based on my general knowledge…” — are flagged regardless of downstream NLI score, because they explicitly indicate the model is drawing on training data rather than the retrieved context

Stage 2A — STS Evidence Selection (< 10ms per claim)
For each atomic claim, the bi-encoder all-MiniLM-L6-v2 computes cosine similarity between the claim embedding and every sentence in the provided source documents. The highest-scoring sentence is selected as the candidate evidence.

Gating logic: If the best similarity score is below 0.25, the NLI stage is skipped entirely. There is no value in running a cross-encoder on a claim that has no plausible source match — this saves compute and avoids false positives on topics genuinely absent from the retrieved context.

Stage 2B — NLI Verification (~150ms per claim)

The cross-encoder nli-deberta-v3-xsmall takes the (claim, best_source_sentence) pair and outputs three probabilities:

LabelMeaningActionentailmentSource text supports the claim✅ Claim passesneutralSource neither confirms nor contradicts⚠️ Claim is unverifiedcontradictionSource directly contradicts the claim❌ Hallucination flagged

A claim is flagged as a hallucination when contradiction_score > 0.5.

Trust Score

trust_score = supported_claims / total_claims
A score of 1.0 means every claim in the response is supported by retrieved documents. A score of 0.0 means none are.

The SLM Fallback for Numeric and Temporal Claims (v0.1.4+)
Standard NLI models are known to underperform on fine-grained numeric and date comparisons — distinguishing “330 meters” from “303 meters” is a semantic task NLI encoders were not optimized for. LongTracer v0.1.4 addressed this with an optional SLM fallback verifier using Qwen2.5-1.5B-Instruct-GGUF. This model is invoked automatically only when NLI confidence is low and the claim contains numeric or temporal content. The gating logic ensures the baseline verification path stays under 150ms for the vast majority of real-world claims.

The One-Liner API

Zero configuration. No account. No API key.

from longtracer import check
result = check(
    "The Eiffel Tower is 330 meters tall and located in Berlin.",
    ["The Eiffel Tower is a wrought-iron lattice tower in Paris, France. It is 330 metres tall."]
)
print(result.verdict)              # "FAIL"
print(result.trust_score)          # 0.5
print(result.hallucination_count)  # 1  ("Berlin" contradicts "Paris, France")

Or from the terminal, with no Python code at all:

longtracer check "The Eiffel Tower is in Berlin." "The Eiffel Tower is in Paris."
# ✗ FAIL  trust=0.50  hallucinations=1

Framework Integrations: Every Major RAG Stack
One of LongTracer’s most practical competitive advantages is the breadth of its native adapters. As of v0.2.0, it supports seven major frameworks with minimal integration code.

LangChain

from longtracer import LongTracer, instrument_langchain
LongTracer.init(verbose=True)
instrument_langchain(your_chain)
# Every chain.invoke() now auto-verifies responses against retrieved context

LlamaIndex

from longtracer import LongTracer, instrument_llamaindex
LongTracer.init(verbose=True)
instrument_llamaindex(your_query_engine)
LangGraph Agents
from longtracer import instrument_langgraph
handler = instrument_langgraph(graph)
result = agent.invoke(
    {"messages": [("user", "What is the refund policy?")]},
    config={"callbacks": [handler]}
)

The LangGraph adapter accumulates sources across multi-step tool calls and runs verification once at agent completion, not after every intermediate step. This means the final answer — not intermediate reasoning — is what gets verified, avoiding noisy per-step false positives.

Haystack v2

from longtracer.adapters.haystack_handler import LongTracerVerifier
pipeline.add_component("verifier", LongTracerVerifier())
pipeline.connect("generator.replies", "verifier.response")
pipeline.connect("retriever.documents", "verifier.documents")

OpenAI Assistants API

from longtracer import instrument_openai_assistant
instrument_openai_assistant(client)
# Automatically verifies assistant responses against file_search citations

CrewAI

from longtracer import instrument_crewai
instrument_crewai(crew)
# Wraps kickoff() to verify each task output against its context sources

AutoGen (≥ 0.4)

from longtracer import instrument_autogen
instrument_autogen(agent)

Direct API — Any Framework
For custom pipelines or frameworks not yet listed, the CitationVerifier accepts plain strings with no dependencies on vector stores, LLMs, or external services:

from longtracer.guard.verifier import CitationVerifier
verifier = CitationVerifier()
result = verifier.verify_parallel(
    response="LLM said this...",
    sources=["chunk 1 text", "chunk 2 text"],
    source_metadata=[{"source": "doc.pdf", "page": 1}]
)

Multi-Project Tracing

Production teams rarely run a single RAG application. LongTracer’s multi-project architecture allows you to trace multiple applications — a customer chatbot, an internal search API, a document Q&A service — under a single backend while keeping traces tagged and independently filterable:

from longtracer import LongTracer
LongTracer.init(project_name="chatbot-prod", backend="sqlite")
chatbot = LongTracer.get_tracer("chatbot-prod")
search  = LongTracer.get_tracer("search-api")
chatbot.start_root(inputs={"query": "What is your cancellation policy?"})

Each project’s traces are independently browsable via the CLI and the web dashboard.

Pluggable Storage Backends
LongTracer stores verification traces in configurable backends suited to every deployment scenario:

BackendInstallBest ForSQLiteBuilt-in (default)Local development, single-serverMemoryBuilt-inTesting, ephemeral runsMongoDBpip install "longtracer[mongo]"Production, distributedPostgreSQLpip install "longtracer[postgres]"Production, relationalRedispip install "longtracer[redis]"High-throughput, ephemeral

Configuration is a single block in pyproject.toml:

[tool.longtracer]
project = "my-rag-app"
backend = "sqlite"
threshold = 0.5
verbose = true

Or via environment variables, following the configuration priority chain:

Learn about Medium’s values
Code arguments → Environment variables → pyproject.toml → Built-in defaults

v0.2.0: The Observability and Analytics Suite
The most significant release in LongTracer’s history shipped on May 18, 2026. Version 0.2.0 transforms LongTracer from a standalone guardrail library into a full observability platform.

Built-In Web Dashboard
longtracer serve

Open http://localhost:8000/dashboard

Browse all verified traces across every project, view hallucination rates over time, and drill into individual trace spans. The dashboard is authenticated via HTTP-only cookies with timing-safe digest comparison — production-grade security out of the box, no configuration required.

Aggregated Metrics API
Two new endpoints provide programmatic access to verification metrics:

GET /api/v1/metrics/summary — total traces, average trust score, total hallucinations across all projects
GET /api/v1/metrics/timeseries — trend data for dashboarding or alerting integrations
OpenTelemetry Export

pip install "longtracer[otel]"

LongTracer emits standard OTLP spans (longtracer.verify) with the following attributes:

longtracer.trust_score
longtracer.hallucination_count
longtracer.verdict
longtracer.project

These are fully compatible with Jaeger, Grafana Tempo, Datadog, Honeycomb, and any OTLP-compliant backend. A pre-configured Grafana Dashboard Template (grafana/longtracer.json) is included in the repository for instant visualization.

Critically: if OTel packages are not installed, the integration fails gracefully as a zero-overhead no-op. No crashes, no warnings, no behavior change in production.

Active Alerting System
LongTracer’s alerting runs in a background daemon thread — it never blocks the verification pipeline. When a trust score drops below a configured threshold, notifications are dispatched to:

Slack
Discord
Email
Custom Webhooks — HMAC-SHA256 signed, Stripe-style, with 5 retries and exponential backoff

Configuration is a single environment variable:

LONGTRACER_ALERT_THRESHOLD=0.7
LONGTRACER_SLACK_WEBHOOK_URL=https://hooks.slack.com/...

The webhook implementation uses dead-letter logging after maximum retries, ensuring no silent alert failures.

The CLI: Full Observability Without Writing Code
The longtracer CLI provides complete trace access from the terminal:


longtracer view                        # List recent traces
longtracer view --last                 # View most recent trace
longtracer view --id <trace_id>        # View specific trace
longtracer view --project chatbot-prod # Filter by project
longtracer view --export <trace_id>    # Export trace to JSON
longtracer view --html <trace_id>      # Export to self-contained

HTML report
The HTML export is particularly useful for cross-functional teams. It is a zero-dependency, self-contained single file with:

Color-coded per-claim verdict rows
Side-by-side diff of the LLM claim versus the best matching source evidence
A summary stats bar showing pass/fail/hallucination breakdown
Click-to-expand claim detail with STS score, entailment score, and contradiction score
Send an HTML trace file to a product manager, QA engineer, or non-technical stakeholder and they can immediately see exactly which claims were hallucinated and which source sentence was evaluated against each one.

The REST API Server Mode
For polyglot environments or microservice architectures, LongTracer can operate as a standalone HTTP verification service:

longtracer serve

This starts a FastAPI-based server with:

POST /api/v1/verify — verify a single response
POST /api/v1/verify/batch — bulk verification in a single call
GET /api/v1/health — health check (no authentication required)
GET /api/v1/traces — list recent traces
GET /api/v1/traces/{trace_id} — retrieve a specific trace
Security features included by default:

API key authentication via x-api-key header (LangSmith-standard) with Authorization: Bearer fallback
Timing-safe key comparison via secrets.compare_digest
CORS middleware with configurable origins
Token bucket rate limiter (60 req/min per IP, configurable)
Pydantic input validation with max-length and max-items constraints

Why Determinism Matters for CI/CD Integration

One of the most practically important properties of LongTracer is determinism. Because verification uses fixed encoder weights rather than a generative LLM, the same inputs always produce the same output on the same hardware.

This is a prerequisite for integrating hallucination detection into CI/CD pipelines. Teams can write regression tests that assert specific trust scores — and those tests will not flake due to model stochasticity:


# In your test suite
from longtracer import check
def test_rag_response_is_grounded():
    result = check(
        response=generate_response("What is the refund policy?"),
        sources=get_retrieved_chunks("What is the refund policy?")
    )
    assert result.trust_score >= 0.85, (
        f"RAG response grounding degraded: {result.trust_score:.2f}"
    )
    assert result.hallucination_count == 0

This kind of deterministic quality gate is simply not possible with LLM-as-a-judge tools, where the same prompt can score 0.9 on one run and 0.7 on the next.

Async Support and Batch Processing
Modern Python applications run on asyncio. LongTracer supports fully async verification:

result = await verifier.verify_parallel_async(response, sources)
For bulk evaluation workloads — running evaluations over a dataset of historical traces, or benchmarking a new retrieval configuration — the batch API parallelizes claim verification internally using ThreadPoolExecutor:

from longtracer import check_batch
results = check_batch([
    {"response": "P is NP.", "sources": ["It is not known if P equals NP."]},
    {"response": "Water boils at 100°C.", "sources": ["Water boils at 100°C at standard atmospheric pressure."]}
])

Who Should Use LongTracer?
LongTracer is the right choice if:

You are building a RAG application and need to know, at inference time, whether the LLM’s response is grounded in the retrieved documents
You want hallucination detection without paying for additional LLM API calls on every inference
You need CI/CD-compatible, deterministic quality gates for your RAG pipeline
You are using LangChain, LlamaIndex, LangGraph, Haystack, CrewAI, AutoGen, or the OpenAI Assistants API
You want a fully self-hosted, data-private solution with no external dependencies
You need to monitor multiple RAG projects under a single backend

LongTracer is not the primary choice if:

Your primary need is LLM cost tracking and response caching (Helicone is optimized for this)
You need enterprise-grade real-time safety guardrails with SLA guarantees and dedicated support (Galileo is the leader here)
You are deeply invested in the LangChain ecosystem and need native graph visualization and annotation queues (LangSmith serves this niche well)

Getting Started in Under 5 Minutes


# 1. Install
pip install longtracer

# 2. Run your first verification - no config required
python -c "
from longtracer import check
result = check(
    'The Eiffel Tower is located in Berlin.',
    ['The Eiffel Tower is located in Paris, France.']
)
print(f'Verdict: {result.verdict}')
print(f'Trust Score: {result.trust_score}')
print(f'Hallucinations: {result.hallucination_count}')
"

# 3. Or use the CLI
longtracer check "The Eiffel Tower is in Berlin." "The Eiffel Tower is in Paris."

# 4. Start the dashboard
pip install "longtracer[server]"

longtracer serve

# Visit http://localhost:8000/dashboard

Conclusion
The LLM observability market is mature and well-funded. Most tools in the space are still solving the wrong problem — they tell you that something went wrong after the fact, or they use an LLM to evaluate an LLM, adding cost and non-determinism to an already uncertain pipeline.

LongTracer takes a fundamentally different bet: that a carefully engineered two-stage encoder pipeline — STS for evidence selection, NLI for semantic verification — can catch the majority of real-world RAG hallucinations with near-zero latency, zero external API cost, and complete determinism.

That bet has held up in practice. Since its initial release in April 2025, LongTracer has shipped adapters for seven major frameworks, a production-grade REST API server, a complete observability suite with OTel integration, and a web dashboard — all while maintaining its core constraint: no vector store dependency, no LLM dependency, just strings in and verification out.

For teams that have accepted hallucination as an inevitable tax on AI-powered applications, LongTracer offers a different path: treat every LLM response as innocent until proven grounded.

Resources
GitHub: github.com/ENDEVSOLS/LongTracer
Documentation: endevsols.github.io/LongTracer
PyPI: pypi.org/project/longtracer
Quick Start: endevsols.github.io/LongTracer/getting-started/quickstart
EnDevSols Open-Source Projects: CHANGELOG.md

Top comments (1)

Tae Kim • Jun 5

The single best-sentence selection in 2A was the main weakness when I built something similar. Claims that compose facts across multiple source sentences pick up one wrong evidence sentence with high cosine, and NLI then entails it confidently. I switched to top-k candidate sentences with an entailment-over-union aggregate before the false-entailment rate came down.