The "Silent Failure" in Production
We talk a lot about hallucinations, but we rarely talk about how we catch them. The industry standard right now is "LLM-as-a-Judge"—asking LLM to verify if an LLM's answer is correct.
When I was building a RAG pipeline for a critical use case, I realized a dangerous flaw in this approach: Probabilistic models cannot perform deterministic verification.
If an LLM makes a math error or writes unsafe SQL, it’s often because it fundamentally misunderstands the logic in that context. Asking the same model (or a similar one) to "double-check" often leads to the same error, just with more confidence.
I realized I couldn't ship "vibes" to production. I needed proofs.
The Shift: From Probabilistic to Deterministic
I started asking a simple question: Why are we using AI to check math, when we have Python?
We have tools that have been correct for 30 years:
- SymPy for Calculus/Math.
- Z3 Theorem Prover for Logic.
- AST Parsers for Code Security.
The problem wasn't the tools; it was the lack of a protocol to connect them to LLM outputs easily.
Introducing QWED: An Infrastructure for Truth
I decided to build QWED not as another "AI tool," but as a Verification Protocol. It treats the LLM as an "Untrusted Translator"—it can translate natural language into code/logic, but it is never trusted to evaluate the result.
Here is the difference in architecture:
❌ The Old Way (LLM-as-a-Judge):
User Query -> LLM Answer -> LLM Judge (Vibes based check) -> User Result: 80% Reliable, unpredictable latency.✅ The Zero-Trust Way (QWED):
User Query -> LLM -> Deterministic Engine (Math/Code/Logic) -> Proof/Fail -> User Result: 100% Mathematically Proven.
Here is how the Math Engine catches a subtle hallucination that usually slips past LLM judges:
from qwed_sdk import QWEDClient
client = QWEDClient()
## Scenario: LLM claims sqrt(81) is 8 (Common token prediction error)
llm_output = "sqrt(81) == 8"
## QWED uses SymPy to evaluate the expression mathematically
result = client.verify_math(llm_output)
if not result["verified"]:
print(f"Hallucination Blocked: {result['explanation']}")
# Output: "sqrt(81) evaluates to 9, which is not equal to 8."
I also built engines for:
- SQL Security: Using AST to detect injection patterns before execution.
- Logic Puzzles: Using Z3 to solve boolean satisfiability.
- Data Integrity: Using Pandas to verify tabular claims.
Why Open Source?
I believe "Verification" should be a standard infrastructure layer, like SSL for security. It shouldn't be a black-box API hidden behind a paywall.
We released QWED under Apache 2.0. You can audit the code, run it locally (air-gapped), or inspect the solvers yourself.
If you are tired of debugging "vibes" and want to build a pipeline based on proofs, check out the repo. I’d love to hear your feedback on the architecture.
🌟 Repo: QWED-AI/qwed-verification
Top comments (0)