Why I Stopped Using LLMs to Verify LLMs (And Built a Deterministic Protocol Instead)

#ai #python #opensource #architecture

The "Silent Failure" in Production

We talk a lot about hallucinations, but we rarely talk about how we catch them. The industry standard right now is "LLM-as-a-Judge"—asking LLM to verify if an LLM's answer is correct.

When I was building a RAG pipeline for a critical use case, I realized a dangerous flaw in this approach: Probabilistic models cannot perform deterministic verification.

If an LLM makes a math error or writes unsafe SQL, it’s often because it fundamentally misunderstands the logic in that context. Asking the same model (or a similar one) to "double-check" often leads to the same error, just with more confidence.

I realized I couldn't ship "vibes" to production. I needed proofs.

The Shift: From Probabilistic to Deterministic
I started asking a simple question: Why are we using AI to check math, when we have Python?

We have tools that have been correct for 30 years:

SymPy for Calculus/Math.
Z3 Theorem Prover for Logic.
AST Parsers for Code Security.

The problem wasn't the tools; it was the lack of a protocol to connect them to LLM outputs easily.

Introducing QWED: An Infrastructure for Truth
I decided to build QWED not as another "AI tool," but as a Verification Protocol. It treats the LLM as an "Untrusted Translator"—it can translate natural language into code/logic, but it is never trusted to evaluate the result.

Here is the difference in architecture:

❌ The Old Way (LLM-as-a-Judge):
User Query -> LLM Answer -> LLM Judge (Vibes based check) -> User Result: 80% Reliable, unpredictable latency.
✅ The Zero-Trust Way (QWED):
User Query -> LLM -> Deterministic Engine (Math/Code/Logic) -> Proof/Fail -> User Result: 100% Mathematically Proven.

Here is how the Math Engine catches a subtle hallucination that usually slips past LLM judges:

from qwed_sdk import QWEDClient

client = QWEDClient()

## Scenario: LLM claims sqrt(81) is 8 (Common token prediction error)
llm_output = "sqrt(81) == 8"

## QWED uses SymPy to evaluate the expression mathematically
result = client.verify_math(llm_output)

if not result["verified"]:
    print(f"Hallucination Blocked: {result['explanation']}")
    # Output: "sqrt(81) evaluates to 9, which is not equal to 8."

I also built engines for:

SQL Security: Using AST to detect injection patterns before execution.
Logic Puzzles: Using Z3 to solve boolean satisfiability.
Data Integrity: Using Pandas to verify tabular claims.

Why Open Source?
I believe "Verification" should be a standard infrastructure layer, like SSL for security. It shouldn't be a black-box API hidden behind a paywall.

We released QWED under Apache 2.0. You can audit the code, run it locally (air-gapped), or inspect the solvers yourself.

If you are tired of debugging "vibes" and want to build a pipeline based on proofs, check out the repo. I’d love to hear your feedback on the architecture.

🌟 Repo: QWED-AI/qwed-verification

DEV Community

Why I Stopped Using LLMs to Verify LLMs (And Built a Deterministic Protocol Instead)

The "Silent Failure" in Production

Top comments (0)