🚫 Zero Mental Math: An Anti-Hallucination Architecture for LLM-Driven Analysis
A six-layer system for achieving 100% accurate numerical reporting from Large Language Models
🎯 The problem
I built an MCP server that extracts data from my MT5 terminals on a VPS. Basically its a load of financial data reports, like trades, averages, technical indicators etc.
I built it all out and I realized that my LLM would randomly hallucinate random things, for example it would say there was a 16th trade when there only had been 15 trades for that day.
When it comes to financial reporting I realize there is probaly a lot on this topic, so I grabbed some ideas from a lot of the latest research on RAG topics, and i threw something together.
I wrote tests that actually test the accuracy of the results of my embeddings over a period of 10 times, and each MCP tool has 100% accuracy on end to end integration tests.
I had the AI summarize it, but if anyone is curious about the exact code maybe I can open source a repeatable process, but i'm hoping from this Article you will have everything you need.
( incoming AI gen content )
📋 Abstract
Large Language Models (LLMs) are fundamentally pattern matchers, not calculators. When asked to analyze data, they generate "plausible-looking" numbers based on statistical patterns in training data—not deterministic computation. This is catastrophic for domains requiring precision, such as trading analysis, financial reporting, or medical diagnostics.
This document describes the Zero Mental Math Architecture, a multi-layered system that achieves accurate numerical reporting by shifting all computation to deterministic Python code and reducing the LLM to a "citation copy machine."
⚠️ The Core Problem
🤖 LLMs Hallucinate Numbers
Given raw trading data, an LLM will confidently state:
"Your win rate is approximately 70%"
...without performing any calculation. The model pattern-matched to a "reasonable-sounding" percentage. The actual win rate might be 65.52%, but the LLM has no mechanism to know this.
🧠 Why This Happens
LLMs predict the next token based on learned probability distributions. When they encounter a context suggesting a percentage is needed, they sample from the distribution of "percentages that appeared in similar contexts during training." This is fundamentally different from computation.
Research backing: Google's work on arithmetic capabilities in transformers (Nogueira et al., 2021) demonstrated that LLMs fail reliably at multi-digit arithmetic. The error rate increases with operand size and operation complexity. This isn't a bug to be fixed—it's an architectural limitation of attention-based sequence models.
🏗️ Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ ZERO MENTAL MATH ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ LAYER 1: Fat MCP Server (Pre-Calculation) │
│ └── Shift ALL computation to deterministic Python │
├─────────────────────────────────────────────────────────────────┤
│ LAYER 2: Accuracy Reports (Provenance Tracking) │
│ └── Pre-formatted citations with cryptographic checksums │
├─────────────────────────────────────────────────────────────────┤
│ LAYER 3: Response Formatter (Constrained Generation) │
│ └── Template-based output with zero degrees of freedom │
├─────────────────────────────────────────────────────────────────┤
│ LAYER 4: RAG Context (Semantic Grounding) │
│ └── Retrieval-augmented generation for entity resolution │
├─────────────────────────────────────────────────────────────────┤
│ LAYER 5: LLM Validation (Adversarial Verification) │
│ └── Second LLM fact-checks against source data │
├─────────────────────────────────────────────────────────────────┤
│ LAYER 6: Auto-Retry (Iterative Refinement) │
│ └── Automatic correction loop with convergence guarantees │
└─────────────────────────────────────────────────────────────────┘
🔧 Layer 1: Fat MCP Server (Pre-Calculation)
📊 What It Does
The MCP (Model Context Protocol) server performs ALL numerical calculations before returning data to the LLM. The LLM never sees raw data that would require arithmetic.
# ❌ BAD: Raw data requires LLM to calculate
get_mt5_history_deals() → [deal1, deal2, deal3, ...]
# LLM must: count deals, group by position, sum P&L, calculate ratios
# ✅ GOOD: Pre-calculated metrics
get_mt5_position_history() → {
"summary": {
"total_positions": 29, # Server counted
"win_rate": 65.52, # Server calculated: (19/29)*100
"profit_factor": 2.34, # Server calculated: sum(wins)/abs(sum(losses))
"expectancy": 42.57 # Server calculated: total_pl/total_positions
}
}
✅ Why This Works
Principle: Tool-Augmented LLMs
The insight from Meta's "Toolformer" (Schick et al., 2023) and the broader ReAct paradigm (Yao et al., 2022) is that LLMs should delegate to external tools for tasks they perform poorly. Arithmetic is the canonical example.
Principle: Separation of Concerns
Asking an LLM to calculate percentages is like asking a poet to do accounting. Language models are trained on text prediction, not numerical computation. By moving calculation to Python—a language designed for computation—we use each system for its strengths.
Principle: Determinism Over Stochasticity
Python's 19/29*100 = 65.517... is deterministic. Running it 1000 times yields identical results. An LLM's "calculation" is stochastic—it samples from a probability distribution, introducing variance even at temperature 0 (due to floating-point non-determinism in GPU operations).
Research Foundation
- Toolformer (Schick et al., 2023): LLMs can learn to call APIs for tasks like calculation
- Program-Aided Language Models (Gao et al., 2022): Offloading computation to code interpreters
- Chain-of-Thought Arithmetic Failures (Wei et al., 2022): Even with step-by-step reasoning, LLMs make arithmetic errors
📝 Layer 2: Accuracy Reports (Provenance Tracking)
🎯 What It Does
Every tool response includes an _accuracy_report field containing:
- Pre-formatted citations — Complete sentences ready for copy-paste
- CRC32 checksum — Cryptographic fingerprint of all metric values
- Confidence score — Data quality assessment
{
"summary": { "win_rate": 65.52, "profit_factor": 2.34 },
"_accuracy_report": {
"checksum": "A7B3C2D1",
"checksum_input": "29|19|10|65.52|1234.56|85.25|-42.15|2.34|42.57",
"confidence": {
"score": "high",
"reason": "9/9 metrics populated, 29 positions analyzed"
},
"metrics": [
{
"path": "summary.win_rate",
"value": 65.52,
"citation": "Win rate: 65.52% [Source: get_mt5_position_history.summary.win_rate]"
}
],
"instructions": {
"checksum_required": true,
"format": "End analysis with: [Verified: A7B3C2D1]"
}
}
}
✅ Why This Works
Principle: The LLM as Copy Machine
The critical insight is that LLMs are excellent at copying text verbatim. By providing the exact citation string, we reduce the LLM's job from "interpret this number and write about it" to "copy this string into your response." The former invites hallucination; the latter is mechanical.
Principle: Verifiable Provenance
Every number in the output has a traceable source. This enables:
- Automated verification: Scripts can check that reported values match source data
- Human auditing: Readers can follow citations to verify claims
- Debugging: When errors occur, the citation trail identifies the failure point
Principle: Checksums as Commitment Devices
The CRC32 checksum serves multiple purposes:
- Tamper detection: If any metric changes, the checksum changes
- Verification anchor: The
[Verified: A7B3C2D1]at the end of output confirms the LLM used the correct source data - Debugging aid: The
checksum_inputfield shows the exact values used, enabling manual verification
Research Foundation
- Attribution in RAG Systems (Liu et al., 2023): Citation improves factual accuracy
- Self-Consistency Checking (Wang et al., 2022): Multiple verification signals improve reliability
- Data Provenance in ML Pipelines: Standard practice in MLOps for reproducibility
📄 Layer 3: Response Formatter (Constrained Generation)
🎯 What It Does
Templates define the exact structure of outputs, with placeholder slots for citations:
TEMPLATE = """## Performance Analysis (Confidence: {confidence.score})
### Overview
{citation:summary.total_positions}
{citation:summary.win_rate}
{citation:summary.profit_factor}
[Verified: {checksum}]"""
The formatter replaces {citation:summary.win_rate} with the exact citation string from Layer 2:
Win rate: 65.52% [Source: get_mt5_position_history.summary.win_rate]
✅ Why This Works
Principle: Reducing Degrees of Freedom
Hallucination occurs when LLMs have too much freedom. Consider:
| Approach | Degrees of Freedom | Hallucination Risk |
|---|---|---|
| "Analyze this data" | Unlimited | Very High |
| "Report the win rate" | High (format, precision, context) | High |
| "Copy this citation: Win rate: 65.52%" | Near Zero | Near Zero |
Templates eliminate structural decisions. The LLM doesn't choose what to report, in what order, with what formatting—the template specifies everything.
Principle: Slot-Filling vs. Generation
This follows the "skeleton-then-fill" paradigm from structured NLG (Natural Language Generation). The template is the skeleton; citations are the fill. The LLM's role is purely mechanical substitution.
Critical Implementation Rule:
class ResponseFormatter:
"""
Critical Rule: NEVER calculates numbers. Only uses citations from
_accuracy_report.metrics provided by the server.
"""
The formatter is explicitly prohibited from performing any computation. It can only copy existing citations.
Research Foundation
- Constrained Decoding (Hokamp & Liu, 2017): Forcing outputs to satisfy constraints
- Template-Based NLG (Reiter & Dale, 1997): Classical approach to reliable text generation
- Structured Output Forcing: JSON mode, function calling schemas
🗄️ Layer 4: RAG Context (Semantic Grounding)
🎯 What It Does
A ChromaDB knowledge base stores static facts:
- Strategy mappings (magic numbers → strategy names)
- Trading rules and constraints
- Domain-specific terminology
Before generating responses, the system retrieves relevant context:
# Query: "What strategy uses magic 106?"
# Returns: ["Magic number 106 is Goldfish Scalper trading XAUUSD"]
This context is injected into both the formatter and the validator.
✅ Why This Works
Principle: Not All Hallucinations Are Numerical
An LLM might correctly report "Win rate: 65.52%" but incorrectly attribute it to "Dark Dione strategy" when it's actually "Goldfish Scalper." This is a semantic hallucination—the number is right, but the entity relationship is wrong.
RAG grounds the LLM in factual knowledge about entities, preventing semantic errors.
Principle: Ephemeral Session Scope
kb = KnowledgeBase(ephemeral=True) # Resets each MCP session
kb.load_static_rules() # Loads known-good facts
The knowledge base is session-scoped to prevent stale data accumulation. Static rules (which don't change) are loaded fresh; dynamic trading statistics are always fetched live from MT5.
Principle: Context for Both Generator and Validator
The same RAG context is passed to:
- Formatter: To ground response generation
- Validator: To prevent false-positive hallucination flags
If the response says "Goldfish Scalper (Magic 106)" and the validator's context confirms this mapping, it won't incorrectly flag it as a hallucination.
Research Foundation
- RAG (Lewis et al., 2020): The foundational retrieval-augmented generation paper
- REALM (Guu et al., 2020): Retrieval-enhanced pre-training
- In-Context Learning (Brown et al., 2020): GPT-3's ability to use context examples
- Grounding in Dialogue Systems (Roller et al., 2020): Connecting responses to knowledge
✅ Layer 5: LLM Validation (Adversarial Verification)
🎯 What It Does
A second LLM (Novita AI) validates the drafted response against source data before delivery to the user:
validation_result = validate_with_llm(
response_text=draft, # What the LLM wants to say
source_data=mcp_response, # Ground truth from server
context=rag_context # Knowledge base facts
)
The validator checks four rules:
- Zero Mental Math: All numbers match source exactly
- Anti-Aggregation: Raw values shown before averages
- Citation Requirement: Every number has
[Source: ...] - Checksum Verification: Response ends with correct
[Verified: XXXX]
✅ Why This Works
Principle: Verification is Easier Than Generation
This is a fundamental asymmetry in computational complexity. Consider:
- Generation: "Analyze this data and write a report" (open-ended, creative)
- Verification: "Does '65.52%' match the source value '65.52'?" (closed, deterministic)
The validator has a much simpler task: pattern matching and comparison. This makes it far less prone to hallucination than the generator.
Principle: Adversarial Checking
This draws from:
- Constitutional AI (Anthropic, 2022): Using AI to critique and improve AI outputs
- Debate (Irving et al., 2018): Having models argue to expose weaknesses
- Red-teaming: Standard security practice of adversarial testing
The validator is explicitly instructed to be strict:
Be strict - any deviation from source is a hallucination
Principle: Structured Error Output
The validator returns structured JSON with specific issue categorization:
{
"hallucinations_found": true,
"issues": [{
"claim": "Win Rate: approximately 70%",
"problem": "Source shows 65.52%, not 'approximately 70%'",
"severity": "critical",
"correct_value": "Win rate: 65.52% [Source: ...]",
"rule_violated": "Zero Mental Math"
}]
}
This enables automated correction in Layer 6.
Research Foundation
- Constitutional AI (Bai et al., 2022): AI systems that critique themselves
- Self-Consistency (Wang et al., 2022): Sampling multiple times and checking agreement
- Fact Verification (Thorne et al., 2018): FEVER dataset and verification systems
- LLM-as-Judge (Zheng et al., 2023): Using LLMs to evaluate LLM outputs
🔄 Layer 6: Auto-Retry (Iterative Refinement)
🎯 What It Does
When validation fails, the system automatically:
- Parses the validation errors
- Applies corrections to the draft
- Re-validates
- Repeats up to N times (default: 3)
for attempt in range(1, max_retries + 1):
validation = validate_with_llm(narrative, source_data, context)
if not validation["hallucinations_found"]:
# Success! Return validated response
return {"analysis": narrative, "_validation_meta": {"validated": True}}
# Failed - apply corrections and retry
narrative = corrector.apply_corrections(narrative, validation["issues"])
✅ Why This Works
Principle: Iterative Refinement
Self-refinement is a well-established technique for improving LLM outputs. The key insight is that correction is easier than generation—given specific feedback ("this number is wrong, it should be X"), the fix is mechanical.
Principle: Bounded Retry with Graceful Degradation
The system doesn't retry forever:
- Fixable issues (wrong numbers): Auto-correct and retry
- Unfixable issues (structural problems): Fail immediately with diagnostics
- Max retries exceeded: Return error with last attempt for debugging
if not can_fix:
return {
"success": False,
"error": "Validation failed with unfixable issues",
"validation_issues": issues,
"unfixable_reasons": reasons
}
Principle: Convergence Guarantees
Because corrections are deterministic (replace X with Y) and the validator is consistent, the system converges. If the corrector properly applies all fixes, the next validation will pass. The retry loop guards against transient failures, not fundamental incompatibility.
Research Foundation
- Self-Refine (Madaan et al., 2023): Iterative refinement with self-feedback
- Reflexion (Shinn et al., 2023): Verbal reinforcement learning through self-reflection
- Error Correction in Communication (Shannon, 1948): Fundamental information theory
🚨 Special Rule: Anti-Aggregation
⚠️ The Problem
Aggregation hides critical information. Consider:
❌ WRONG (Aggregation Hallucination):
Current lot sizes:
- EURUSD: 0.06 lots average (range: 0.05-0.08)
This hides that lot size doubled from 0.05 → 0.08 recently—a critical signal that risk management changed.
✅ The Solution
✅ CORRECT (Raw Data First):
Current lot sizes [Source: positions list]:
- EURUSD last 5: [0.05, 0.05, 0.05, 0.08, 0.07]
→ CURRENT: 0.07 lots
→ TREND: Scaled up 60% on Nov 24 (0.05 → 0.08)
→ Average: 0.06 lots (for reference only)
✅ Why This Works
Principle: Simpson's Paradox Awareness
Aggregates can reverse the apparent direction of relationships. A "stable average" can hide dramatic changes in underlying data. By requiring raw values first, we prevent this information loss.
Principle: Auditability
Scientific reporting standards require showing raw data. If you only report "average 0.06," readers cannot detect:
- Outliers that skew the average
- Trends (increasing/decreasing)
- Distribution shape (uniform vs. bimodal)
Principle: Transparency Over Convenience
It's easier to report a single number. But the Anti-Aggregation Rule prioritizes transparency over convenience. The small cognitive cost of reading 5 raw values prevents potentially catastrophic misunderstandings.
🎯 Accuracy Indicators in Final Output
✅ What Indicates Accuracy
| Indicator | Location | Example | Why It Matters |
|---|---|---|---|
| Citation tags | After every number | [Source: tool.path] | Traceable provenance |
| Checksum | End of response | [Verified: A7B3C2D1] | Data integrity proof |
| Confidence score | Header | (Confidence: high) | Data quality signal |
| Validation metadata | Response field | "validated": true | System verification passed |
| No approximations | Absence | Never: "~", "about" | Zero Mental Math compliance |
| Raw values before aggregates | Data sections | last 5: [...] | Anti-Aggregation compliance |
🚩 Red Flags (Hallucination Indicators)
| Red Flag | Example | Rule Violated |
|---|---|---|
| Missing citation | Win rate: 65.52% | Citation Requirement |
| Approximation words | approximately 70% | Zero Mental Math |
| Missing checksum | No [Verified: XXXX] | Checksum Requirement |
| Rounded numbers | 70% vs 65.52% | Zero Mental Math |
| Averages without raw data | Average: 0.06 alone | Anti-Aggregation |
| Confidence missing | No confidence score | Incomplete output |
📖 Complete Example: End-to-End Pipeline
Step 1: User Request
"Analyze my trading performance for November"
Step 2: MCP Server Response (Layers 1-2)
{
"success": true,
"summary": {
"total_positions": 29,
"total_wins": 19,
"total_losses": 10,
"win_rate": 65.52,
"profit_factor": 2.34,
"total_pl": 1234.56
},
"_accuracy_report": {
"checksum": "A7B3C2D1",
"confidence": {"score": "high", "reason": "9/9 metrics, 29 positions"},
"metrics": [
{
"path": "summary.win_rate",
"value": 65.52,
"citation": "Win rate: 65.52% [Source: get_mt5_position_history.summary.win_rate]"
},
{
"path": "summary.profit_factor",
"value": 2.34,
"citation": "Profit factor: 2.34 [Source: get_mt5_position_history.summary.profit_factor]"
}
]
}
}
Step 3: Template Formatting (Layer 3)
## Performance Analysis (Confidence: high)
### Overview
- Total positions: 29 [Source: get_mt5_position_history.summary.total_positions]
- Win rate: 65.52% [Source: get_mt5_position_history.summary.win_rate]
- Profit factor: 2.34 [Source: get_mt5_position_history.summary.profit_factor]
### Financial Performance
- Total P&L: $1234.56 [Source: get_mt5_position_history.summary.total_pl]
[Verified: A7B3C2D1]
Step 4: RAG Context Retrieval (Layer 4)
Retrieved: "Magic number 106 is Goldfish Scalper trading XAUUSD"
Step 5: LLM Validation (Layer 5)
{
"hallucinations_found": false,
"checksum_valid": true,
"summary": "All claims verified against source data"
}
Step 6: Final Response
{
"success": true,
"analysis": "## Performance Analysis (Confidence: high)\n\n...\n\n[Verified: A7B3C2D1]",
"_validation_meta": {
"validated": true,
"attempts": 1,
"model": "novita-default",
"rag_context_used": true
}
}
📊 Summary: Why This Architecture Works
| Layer | Technique | What It Prevents | Key Insight |
|---|---|---|---|
| 1 | Server-side calculation | LLM arithmetic errors | Use the right tool for the job |
| 2 | Pre-formatted citations | LLM paraphrasing numbers | Reduce LLM to copy machine |
| 3 | Template-based output | Structural hallucination | Minimize degrees of freedom |
| 4 | RAG context grounding | Semantic hallucination | Ground entities in facts |
| 5 | Second LLM validation | Subtle errors slipping through | Verification < Generation |
| 6 | Auto-retry with correction | Transient failures | Iterative refinement converges |
💡 The Meta-Principle
Trust flows from deterministic systems to stochastic ones, never the reverse.
Python calculates → Server stores → Citations copy → Templates structure → Validator checks
At no point does an LLM "decide" a number. The LLM's role is purely mechanical: copying citations into template slots. This is the fundamental insight that makes 100% accuracy achievable.
📚 References
- Schick, T., et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761
- Yao, S., et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629
- Lewis, P., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020
- Wang, X., et al. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171
- Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073
- Madaan, A., et al. (2023). "Self-Refine: Iterative Refinement with Self-Feedback." arXiv:2303.17651
- Gao, L., et al. (2022). "PAL: Program-Aided Language Models." arXiv:2211.10435
- Nogueira, R., et al. (2021). "Investigating the Limitations of Transformers with Simple Arithmetic Tasks." arXiv:2102.13019
Top comments (0)