Miroslav Šotek

Posted on Jun 8 • Edited on Jun 22

System Architecture: Deterministic Claim-Level Halting for LLM Hallucinations using Rust and Dual-Entropy Scoring

#llminfrastructure #opensource #aisafety #rust

The current standard for LLM hallucination detection is a structural liability. In production enterprise environments, evaluating an output for factual coherence after the entire payload has been generated and transmitted is mathematically and operationally insufficient. In strictly regulated domains—such as financial analytics, compliance frameworks, or clinical data parsing—a single fabricated integer or hallucinated citation invalidates the entire downstream pipeline.

Auditing the error post-hoc is a failure state. The system must possess the capability to sever the generation in real-time.

To resolve this, I architected Director-AI (currently at stable release v3.14). It functions as a drop-in middleware circuit breaker that executes deterministic, token-level streaming halts using a dual-entropy scoring engine, powered by Rust-accelerated compute paths.

The Flaw in Post-Hoc Verification

Most "AI Safety" wrappers operate as parallel or sequential API calls. They wait for the primary LLM to complete its generation, pass the output to an evaluation model, and return a pass/fail boolean. This introduces three critical bottlenecks:

Massive Latency Overhead: Doubling the time-to-first-token (TTFT) and total generation time.

Compute Waste: Processing thousands of tokens in a sequence that was already corrupted at token 15.

Data Exposure: Allowing unverified, potentially non-compliant data to exist in memory or enter logging pipelines.

The Director-AI Solution: Real-Time Interception

Director-AI sits between the client and the LLM provider as an asynchronous streaming proxy. As tokens are generated, they are buffered in micro-batches and evaluated against a dual-entropy scoring algorithm before being forwarded to the client.

Dual-Entropy Scoring Mechanism

The core evaluation logic relies on two distinct axes of verification, combined to calculate a total system entropy state:

NLI (Natural Language Inference) Contradiction Detection: Evaluates if the current token sequence logically contradicts the established premise or prompt constraints using the 0.4B FactCG-DeBERTa model.

RAG (Retrieval-Augmented Generation) Fact-Checking: Cross-references the emerging semantic claim against a validated vector-database context.

If the combined entropy exceeds the defined safety threshold, Director-AI immediately terminates the TCP connection to the LLM and injects a standard exception to the client.

Rust-Accelerated Compute Paths

To execute this evaluation without degrading the user experience, the middleware overhead must remain negligible.

Director-AI shifts intensive operations away from the Python network layer. The v3.14 architecture implements 12 core compute paths natively written in Rust, delivering a 9.4× geometric mean speedup over equivalent Python code blocks. By avoiding garbage-collection pauses and parallelizing tensor evaluations directly on the incoming token stream, it enforces real-time validation. The codebase is backed by over 4,310+ passing tests, guaranteeing strict memory safety and predictable execution under peak API loads.

Cryptographic Auditability for Zero-Tolerance Domains

For banking infrastructure and financial analytics systems, simply halting an invalid output is not enough; the security event must be auditable without violating global data privacy laws.

Director-AI is explicitly architected to align with the EU AI Act, GDPR, and Swiss revDSG frameworks. It achieves this through a zero-knowledge audit pipeline.

Instead of writing plaintext queries to log files, the AuditLogger processes all telemetry into structured JSONL files, utilising one-way SHA-256 query hashing.

Operational Impact: System administrators can mathematically prove that a specific guardrail was active and triggered at a precise Unix timestamp, without ever storing, exposing, or caching the user's proprietary, high-sensitivity prompt data.

Deployment Mechanics

Director-AI operates as a framework-agnostic drop-in. It does not require modifying frontend applications or fine-tuning underlying models. You route your existing OpenAI, Anthropic, or local LLM API base URLs through the Director-AI port, and the middleware handles the token interception autonomously.

The system is deployed under a dual-licensing model (Apache-2.0 AND BUSL-1.1, v3.14→v3.16. open-core, with proprietary enterprise extensions).

Examine the open-core capability manifest, architecture diagrams, and deployment instructions on GitHub:

https://github.com/anulum/director-ai

Top comments (2)

mote • Jun 10

The micro-batch buffering before scoring is the critical design choice here. Every token you buffer before checking adds latency — at 21+ TPS on CPU (like Mixtral), a 3-token buffer costs ~140ms before the user sees anything. Have you benchmarked the latency impact of different buffer sizes?

The dual-entropy approach (NLI + RAG) is solid, but I'm curious about the NLI model choice. FactCG-DeBERTa at 0.4B is lightweight for real-time scoring, but NLI-based contradiction detection has known blind spots: it struggles with implicit contradictions (things the model "knows" but didn't state explicitly). A smaller entainment classifier might miss subtle errors.

On the Rust side — 12 hot paths with 9.4x speedup over Python is impressive. Are you using SIMD for the entropy calculations, or is the gain mainly from avoiding Python's GC overhead in the streaming loop?

Miroslav Šotek • Jun 22 • Edited

Great questions — let me take them in order, with a bit of nuance where the post oversimplified.

1 - Buffering / granularity. One correction worth making: the production halt is claim-level, not strictly per-token. Tokens stream through; the proxy scores the buffered text on a fixed cadence (default every 8 tokens, configurable), and the halt fires on a completed claim that contradicts the governed premise — not on each token. So you're not paying an NLI forward pass per token, and the tunable isn't "buffer N before first byte"; it's how early a contradicting claim is caught vs. scoring overhead. There's a separate token-level span detector for when you want sub-claim resolution. I haven't published a buffer-size sweep — fair ask, I'll add one.

2 - NLI choice / blind spots. You're right that NLI misses implicit, world-knowledge contradictions — I won't pretend otherwise. Two design points that address it:

The halt model isn't the 0.4B FactCG. FactCG is the grounding model and it's 2-class (supported / not-supported), with no contradiction class. The halt uses a 3-class NLI (DeBERTa-v3-large MNLI/FEVER/ANLI) for raw P(contradiction), and it halts only on contradiction — never on "neutral / unsupported." A correct-but-unstated claim is neutral, so it doesn't false-halt; that's the exact failure mode a 2-class grounding model has on streaming text.

For implicit contradictions NLI can't see at inference time, the answer is RAG grounding (put the fact in the premise) plus optional escalation to a stronger judge on low-confidence scores. NLI is one signal, not the whole gate.

3 - Rust gains — SIMD or GC? Neither, honestly. No SIMD in the current kernel — the gains come from moving the hot loops (BM25, the divergence/entropy reductions, softmax, aggregation) into native Rust over PyO3 and out of the Python per-call/interpreter overhead in the streaming loop, plus lock-based concurrency (parking_lot) instead of fighting the GIL. Small correction to the framing: CPython is refcount + cyclic GC, so what we avoid is interpreter/dispatch + GIL contention, not Java-style GC pauses. Per-path numbers (Python-vs-Rust medians + checksum parity) are in BENCHMARKS.md; the geomean has moved since the v3.14 figure in the post — I owe that an update.

KR, Miroslav