I built an open-source real-time LLM hallucination guardrail — here are the benchmarks

#machinelearning

What is Director-Class AI?

An open-source Python library that guards LLM output in real time. It watches tokens as they stream and halts generation the moment it detects a hallucination.

It uses NLI (Natural Language Inference via DeBERTa/FactCG) and optional RAG knowledge grounding to score each claim against source documents.

pip install director-ai

Two-line integration:

from director_ai import guard
client = guard(openai.OpenAI())  # wraps any OpenAI/Anthropic client

Benchmarks (measured, not aspirational)

Metric	Value	Conditions
Balanced accuracy	75.8%	FactCG on LLM-AggreFact (29,320 samples)
GPU latency	14.6ms/pair	GTX 1060, ONNX, batch=16
L40S latency	0.5ms/pair	FP16, batch=32
E2E catch rate	90.7%	Hybrid mode, 600 HaluEval traces
Rust BM25 speedup	10.2x	Over pure Python implementation

Framework Integrations

LangChain, LlamaIndex, LangGraph, CrewAI, Haystack, DSPy, Semantic Kernel, and SDK Guard (wraps OpenAI/Anthropic/Bedrock/Gemini/Cohere clients).

Honest Limitations

NLI-only scoring needs KB grounding for domain use (medical FPR=100% without KB)
ONNX CPU is slow (383ms/pair) — GPU recommended
Long documents need >=16GB VRAM
Summarisation accuracy weakest (AggreFact-CNN 68.8%)