DEV Community

Cover image for System Architecture: Deterministic Token-Level Halting for LLM Hallucinations using Rust and Dual-Entropy Scoring
Miroslav Šotek
Miroslav Šotek

Posted on

System Architecture: Deterministic Token-Level Halting for LLM Hallucinations using Rust and Dual-Entropy Scoring

The current standard for LLM hallucination detection is a structural liability. In production enterprise environments, evaluating an output for factual coherence after the entire payload has been generated and transmitted is mathematically and operationally insufficient. In strictly regulated domains—such as financial analytics, compliance frameworks, or clinical data parsing—a single fabricated integer or hallucinated citation invalidates the entire downstream pipeline.

Auditing the error post-hoc is a failure state. The system must possess the capability to sever the generation in real-time.

To resolve this, I architected Director-AI (currently at stable release v3.14). It functions as a drop-in middleware circuit breaker that executes deterministic, token-level streaming halts using a dual-entropy scoring engine, powered by Rust-accelerated compute paths.

The Flaw in Post-Hoc Verification

Most "AI Safety" wrappers operate as parallel or sequential API calls. They wait for the primary LLM to complete its generation, pass the output to an evaluation model, and return a pass/fail boolean. This introduces three critical bottlenecks:

Massive Latency Overhead: Doubling the time-to-first-token (TTFT) and total generation time.

Compute Waste: Processing thousands of tokens in a sequence that was already corrupted at token 15.

Data Exposure: Allowing unverified, potentially non-compliant data to exist in memory or enter logging pipelines.

The Director-AI Solution: Real-Time Interception

Director-AI sits between the client and the LLM provider as an asynchronous streaming proxy. As tokens are generated, they are buffered in micro-batches and evaluated against a dual-entropy scoring algorithm before being forwarded to the client.

Dual-Entropy Scoring Mechanism

The core evaluation logic relies on two distinct axes of verification, combined to calculate a total system entropy state:

NLI (Natural Language Inference) Contradiction Detection: Evaluates if the current token sequence logically contradicts the established premise or prompt constraints using the 0.4B FactCG-DeBERTa model.

RAG (Retrieval-Augmented Generation) Fact-Checking: Cross-references the emerging semantic claim against a validated vector-database context.

If the combined entropy exceeds the defined safety threshold, Director-AI immediately terminates the TCP connection to the LLM and injects a standard exception to the client.

Rust-Accelerated Compute Paths

To execute this evaluation without degrading the user experience, the middleware overhead must remain negligible.

Director-AI shifts intensive operations away from the Python network layer. The v3.14 architecture implements 12 core compute paths natively written in Rust, delivering a 9.4× geometric mean speedup over equivalent Python code blocks. By avoiding garbage-collection pauses and parallelizing tensor evaluations directly on the incoming token stream, it enforces real-time validation. The codebase is backed by over 4,310+ passing tests, guaranteeing strict memory safety and predictable execution under peak API loads.

Cryptographic Auditability for Zero-Tolerance Domains

For banking infrastructure and financial analytics systems, simply halting an invalid output is not enough; the security event must be auditable without violating global data privacy laws.

Director-AI is explicitly architected to align with the EU AI Act, GDPR, and Swiss revDSG frameworks. It achieves this through a zero-knowledge audit pipeline.

Instead of writing plaintext queries to log files, the AuditLogger processes all telemetry into structured JSONL files, utilising one-way SHA-256 query hashing.

Operational Impact: System administrators can mathematically prove that a specific guardrail was active and triggered at a precise Unix timestamp, without ever storing, exposing, or caching the user's proprietary, high-sensitivity prompt data.

Deployment Mechanics

Director-AI operates as a framework-agnostic drop-in. It does not require modifying frontend applications or fine-tuning underlying models. You route your existing OpenAI, Anthropic, or local LLM API base URLs through the Director-AI port, and the middleware handles the token interception autonomously.

The system is deployed under a dual-licensing model (AGPL v3 open-core, with proprietary enterprise extensions).

Examine the open-core capability manifest, architecture diagrams, and deployment instructions on GitHub:

https://github.com/anulum/director-ai

Top comments (0)