chowderhead

Posted on Nov 28

Trust the Server, Not the LLM: A Deterministic Approach to LLM Accuracy

#rag #ai #python #programming

🚫 Zero Mental Math: An Anti-Hallucination Architecture for LLM-Driven Analysis

A six-layer system for achieving 100% accurate numerical reporting from Large Language Models

🎯 The problem

I built an MCP server that extracts data from my MT5 terminals on a VPS. Basically its a load of financial data reports, like trades, averages, technical indicators etc.

I built it all out and I realized that my LLM would randomly hallucinate random things, for example it would say there was a 16th trade when there only had been 15 trades for that day.

When it comes to financial reporting I realize there is probaly a lot on this topic, so I grabbed some ideas from a lot of the latest research on RAG topics, and i threw something together.

I wrote tests that actually test the accuracy of the results of my embeddings over a period of 10 times, and each MCP tool has 100% accuracy on end to end integration tests.

I had the AI summarize it, but if anyone is curious about the exact code maybe I can open source a repeatable process, but i'm hoping from this Article you will have everything you need.

( incoming AI gen content )

📋 Abstract

Large Language Models (LLMs) are fundamentally pattern matchers, not calculators. When asked to analyze data, they generate "plausible-looking" numbers based on statistical patterns in training data—not deterministic computation. This is catastrophic for domains requiring precision, such as trading analysis, financial reporting, or medical diagnostics.

This document describes the Zero Mental Math Architecture, a multi-layered system that achieves accurate numerical reporting by shifting all computation to deterministic Python code and reducing the LLM to a "citation copy machine."

⚠️ The Core Problem

🤖 LLMs Hallucinate Numbers

Given raw trading data, an LLM will confidently state:

"Your win rate is approximately 70%"

...without performing any calculation. The model pattern-matched to a "reasonable-sounding" percentage. The actual win rate might be 65.52%, but the LLM has no mechanism to know this.

🧠 Why This Happens

LLMs predict the next token based on learned probability distributions. When they encounter a context suggesting a percentage is needed, they sample from the distribution of "percentages that appeared in similar contexts during training." This is fundamentally different from computation.

Research backing: Google's work on arithmetic capabilities in transformers (Nogueira et al., 2021) demonstrated that LLMs fail reliably at multi-digit arithmetic. The error rate increases with operand size and operation complexity. This isn't a bug to be fixed—it's an architectural limitation of attention-based sequence models.

🏗️ Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    ZERO MENTAL MATH ARCHITECTURE                │
├─────────────────────────────────────────────────────────────────┤
│  LAYER 1: Fat MCP Server (Pre-Calculation)                      │
│  └── Shift ALL computation to deterministic Python              │
├─────────────────────────────────────────────────────────────────┤
│  LAYER 2: Accuracy Reports (Provenance Tracking)                │
│  └── Pre-formatted citations with cryptographic checksums       │
├─────────────────────────────────────────────────────────────────┤
│  LAYER 3: Response Formatter (Constrained Generation)           │
│  └── Template-based output with zero degrees of freedom         │
├─────────────────────────────────────────────────────────────────┤
│  LAYER 4: RAG Context (Semantic Grounding)                      │
│  └── Retrieval-augmented generation for entity resolution       │
├─────────────────────────────────────────────────────────────────┤
│  LAYER 5: LLM Validation (Adversarial Verification)             │
│  └── Second LLM fact-checks against source data                 │
├─────────────────────────────────────────────────────────────────┤
│  LAYER 6: Auto-Retry (Iterative Refinement)                     │
│  └── Automatic correction loop with convergence guarantees      │
└─────────────────────────────────────────────────────────────────┘

🔧 Layer 1: Fat MCP Server (Pre-Calculation)

📊 What It Does

The MCP (Model Context Protocol) server performs ALL numerical calculations before returning data to the LLM. The LLM never sees raw data that would require arithmetic.

# ❌ BAD: Raw data requires LLM to calculate
get_mt5_history_deals() → [deal1, deal2, deal3, ...]
# LLM must: count deals, group by position, sum P&L, calculate ratios

# ✅ GOOD: Pre-calculated metrics
get_mt5_position_history() → {
    "summary": {
        "total_positions": 29,      # Server counted
        "win_rate": 65.52,          # Server calculated: (19/29)*100
        "profit_factor": 2.34,      # Server calculated: sum(wins)/abs(sum(losses))
        "expectancy": 42.57         # Server calculated: total_pl/total_positions
    }
}

✅ Why This Works

Principle: Tool-Augmented LLMs

The insight from Meta's "Toolformer" (Schick et al., 2023) and the broader ReAct paradigm (Yao et al., 2022) is that LLMs should delegate to external tools for tasks they perform poorly. Arithmetic is the canonical example.

Principle: Separation of Concerns

Asking an LLM to calculate percentages is like asking a poet to do accounting. Language models are trained on text prediction, not numerical computation. By moving calculation to Python—a language designed for computation—we use each system for its strengths.

Principle: Determinism Over Stochasticity

Python's 19/29*100 = 65.517... is deterministic. Running it 1000 times yields identical results. An LLM's "calculation" is stochastic—it samples from a probability distribution, introducing variance even at temperature 0 (due to floating-point non-determinism in GPU operations).

Research Foundation

Toolformer (Schick et al., 2023): LLMs can learn to call APIs for tasks like calculation
Program-Aided Language Models (Gao et al., 2022): Offloading computation to code interpreters
Chain-of-Thought Arithmetic Failures (Wei et al., 2022): Even with step-by-step reasoning, LLMs make arithmetic errors

📝 Layer 2: Accuracy Reports (Provenance Tracking)

🎯 What It Does

Every tool response includes an _accuracy_report field containing:

Pre-formatted citations — Complete sentences ready for copy-paste
CRC32 checksum — Cryptographic fingerprint of all metric values
Confidence score — Data quality assessment

{
    "summary": { "win_rate": 65.52, "profit_factor": 2.34 },
    "_accuracy_report": {
        "checksum": "A7B3C2D1",
        "checksum_input": "29|19|10|65.52|1234.56|85.25|-42.15|2.34|42.57",
        "confidence": {
            "score": "high",
            "reason": "9/9 metrics populated, 29 positions analyzed"
        },
        "metrics": [
            {
                "path": "summary.win_rate",
                "value": 65.52,
                "citation": "Win rate: 65.52% [Source: get_mt5_position_history.summary.win_rate]"
            }
        ],
        "instructions": {
            "checksum_required": true,
            "format": "End analysis with: [Verified: A7B3C2D1]"
        }
    }
}

✅ Why This Works

Principle: The LLM as Copy Machine

The critical insight is that LLMs are excellent at copying text verbatim. By providing the exact citation string, we reduce the LLM's job from "interpret this number and write about it" to "copy this string into your response." The former invites hallucination; the latter is mechanical.

Principle: Verifiable Provenance

Every number in the output has a traceable source. This enables:

Automated verification: Scripts can check that reported values match source data
Human auditing: Readers can follow citations to verify claims
Debugging: When errors occur, the citation trail identifies the failure point

Principle: Checksums as Commitment Devices

The CRC32 checksum serves multiple purposes:

Tamper detection: If any metric changes, the checksum changes
Verification anchor: The [Verified: A7B3C2D1] at the end of output confirms the LLM used the correct source data
Debugging aid: The checksum_input field shows the exact values used, enabling manual verification

Research Foundation

Attribution in RAG Systems (Liu et al., 2023): Citation improves factual accuracy
Self-Consistency Checking (Wang et al., 2022): Multiple verification signals improve reliability
Data Provenance in ML Pipelines: Standard practice in MLOps for reproducibility

📄 Layer 3: Response Formatter (Constrained Generation)

🎯 What It Does

Templates define the exact structure of outputs, with placeholder slots for citations:

TEMPLATE = """## Performance Analysis (Confidence: {confidence.score})

### Overview
{citation:summary.total_positions}
{citation:summary.win_rate}
{citation:summary.profit_factor}

[Verified: {checksum}]"""

The formatter replaces {citation:summary.win_rate} with the exact citation string from Layer 2:

Win rate: 65.52% [Source: get_mt5_position_history.summary.win_rate]

✅ Why This Works

Principle: Reducing Degrees of Freedom

Hallucination occurs when LLMs have too much freedom. Consider:

Approach	Degrees of Freedom	Hallucination Risk
"Analyze this data"	Unlimited	Very High
"Report the win rate"	High (format, precision, context)	High
"Copy this citation: Win rate: 65.52%"	Near Zero	Near Zero

Templates eliminate structural decisions. The LLM doesn't choose what to report, in what order, with what formatting—the template specifies everything.

Principle: Slot-Filling vs. Generation

This follows the "skeleton-then-fill" paradigm from structured NLG (Natural Language Generation). The template is the skeleton; citations are the fill. The LLM's role is purely mechanical substitution.

Critical Implementation Rule:

class ResponseFormatter:
    """
    Critical Rule: NEVER calculates numbers. Only uses citations from
    _accuracy_report.metrics provided by the server.
    """

The formatter is explicitly prohibited from performing any computation. It can only copy existing citations.

Research Foundation

Constrained Decoding (Hokamp & Liu, 2017): Forcing outputs to satisfy constraints
Template-Based NLG (Reiter & Dale, 1997): Classical approach to reliable text generation
Structured Output Forcing: JSON mode, function calling schemas

🗄️ Layer 4: RAG Context (Semantic Grounding)

🎯 What It Does

A ChromaDB knowledge base stores static facts:

Strategy mappings (magic numbers → strategy names)
Trading rules and constraints
Domain-specific terminology

Before generating responses, the system retrieves relevant context:

# Query: "What strategy uses magic 106?"
# Returns: ["Magic number 106 is Goldfish Scalper trading XAUUSD"]

This context is injected into both the formatter and the validator.

✅ Why This Works

Principle: Not All Hallucinations Are Numerical

An LLM might correctly report "Win rate: 65.52%" but incorrectly attribute it to "Dark Dione strategy" when it's actually "Goldfish Scalper." This is a semantic hallucination—the number is right, but the entity relationship is wrong.

RAG grounds the LLM in factual knowledge about entities, preventing semantic errors.

Principle: Ephemeral Session Scope

kb = KnowledgeBase(ephemeral=True)  # Resets each MCP session
kb.load_static_rules()              # Loads known-good facts

The knowledge base is session-scoped to prevent stale data accumulation. Static rules (which don't change) are loaded fresh; dynamic trading statistics are always fetched live from MT5.

Principle: Context for Both Generator and Validator

The same RAG context is passed to:

Formatter: To ground response generation
Validator: To prevent false-positive hallucination flags

If the response says "Goldfish Scalper (Magic 106)" and the validator's context confirms this mapping, it won't incorrectly flag it as a hallucination.

Research Foundation

RAG (Lewis et al., 2020): The foundational retrieval-augmented generation paper
REALM (Guu et al., 2020): Retrieval-enhanced pre-training
In-Context Learning (Brown et al., 2020): GPT-3's ability to use context examples
Grounding in Dialogue Systems (Roller et al., 2020): Connecting responses to knowledge

✅ Layer 5: LLM Validation (Adversarial Verification)

🎯 What It Does

A second LLM (Novita AI) validates the drafted response against source data before delivery to the user:

validation_result = validate_with_llm(
    response_text=draft,      # What the LLM wants to say
    source_data=mcp_response, # Ground truth from server
    context=rag_context       # Knowledge base facts
)

The validator checks four rules:

Zero Mental Math: All numbers match source exactly
Anti-Aggregation: Raw values shown before averages
Citation Requirement: Every number has [Source: ...]
Checksum Verification: Response ends with correct [Verified: XXXX]

✅ Why This Works

Principle: Verification is Easier Than Generation

This is a fundamental asymmetry in computational complexity. Consider:

Generation: "Analyze this data and write a report" (open-ended, creative)
Verification: "Does '65.52%' match the source value '65.52'?" (closed, deterministic)

The validator has a much simpler task: pattern matching and comparison. This makes it far less prone to hallucination than the generator.

Principle: Adversarial Checking

This draws from:

Constitutional AI (Anthropic, 2022): Using AI to critique and improve AI outputs
Debate (Irving et al., 2018): Having models argue to expose weaknesses
Red-teaming: Standard security practice of adversarial testing

The validator is explicitly instructed to be strict:

Be strict - any deviation from source is a hallucination

Principle: Structured Error Output

The validator returns structured JSON with specific issue categorization:

{
    "hallucinations_found": true,
    "issues": [{
        "claim": "Win Rate: approximately 70%",
        "problem": "Source shows 65.52%, not 'approximately 70%'",
        "severity": "critical",
        "correct_value": "Win rate: 65.52% [Source: ...]",
        "rule_violated": "Zero Mental Math"
    }]
}

This enables automated correction in Layer 6.

Research Foundation

Constitutional AI (Bai et al., 2022): AI systems that critique themselves
Self-Consistency (Wang et al., 2022): Sampling multiple times and checking agreement
Fact Verification (Thorne et al., 2018): FEVER dataset and verification systems
LLM-as-Judge (Zheng et al., 2023): Using LLMs to evaluate LLM outputs

🔄 Layer 6: Auto-Retry (Iterative Refinement)

🎯 What It Does

When validation fails, the system automatically:

Parses the validation errors
Applies corrections to the draft
Re-validates
Repeats up to N times (default: 3)

for attempt in range(1, max_retries + 1):
    validation = validate_with_llm(narrative, source_data, context)

    if not validation["hallucinations_found"]:
        # Success! Return validated response
        return {"analysis": narrative, "_validation_meta": {"validated": True}}

    # Failed - apply corrections and retry
    narrative = corrector.apply_corrections(narrative, validation["issues"])

✅ Why This Works

Principle: Iterative Refinement

Self-refinement is a well-established technique for improving LLM outputs. The key insight is that correction is easier than generation—given specific feedback ("this number is wrong, it should be X"), the fix is mechanical.

Principle: Bounded Retry with Graceful Degradation

The system doesn't retry forever:

Fixable issues (wrong numbers): Auto-correct and retry
Unfixable issues (structural problems): Fail immediately with diagnostics
Max retries exceeded: Return error with last attempt for debugging

if not can_fix:
    return {
        "success": False,
        "error": "Validation failed with unfixable issues",
        "validation_issues": issues,
        "unfixable_reasons": reasons
    }

Principle: Convergence Guarantees

Because corrections are deterministic (replace X with Y) and the validator is consistent, the system converges. If the corrector properly applies all fixes, the next validation will pass. The retry loop guards against transient failures, not fundamental incompatibility.

Research Foundation

Self-Refine (Madaan et al., 2023): Iterative refinement with self-feedback
Reflexion (Shinn et al., 2023): Verbal reinforcement learning through self-reflection
Error Correction in Communication (Shannon, 1948): Fundamental information theory

🚨 Special Rule: Anti-Aggregation

⚠️ The Problem

Aggregation hides critical information. Consider:

❌ WRONG (Aggregation Hallucination):
Current lot sizes:
- EURUSD: 0.06 lots average (range: 0.05-0.08)

This hides that lot size doubled from 0.05 → 0.08 recently—a critical signal that risk management changed.

✅ The Solution

✅ CORRECT (Raw Data First):
Current lot sizes [Source: positions list]:
- EURUSD last 5: [0.05, 0.05, 0.05, 0.08, 0.07]
  → CURRENT: 0.07 lots
  → TREND: Scaled up 60% on Nov 24 (0.05 → 0.08)
  → Average: 0.06 lots (for reference only)

✅ Why This Works

Principle: Simpson's Paradox Awareness

Aggregates can reverse the apparent direction of relationships. A "stable average" can hide dramatic changes in underlying data. By requiring raw values first, we prevent this information loss.

Principle: Auditability

Scientific reporting standards require showing raw data. If you only report "average 0.06," readers cannot detect:

Outliers that skew the average
Trends (increasing/decreasing)
Distribution shape (uniform vs. bimodal)

Principle: Transparency Over Convenience

It's easier to report a single number. But the Anti-Aggregation Rule prioritizes transparency over convenience. The small cognitive cost of reading 5 raw values prevents potentially catastrophic misunderstandings.

🎯 Accuracy Indicators in Final Output

✅ What Indicates Accuracy

Indicator	Location	Example	Why It Matters
Citation tags	After every number	[Source: tool.path]	Traceable provenance
Checksum	End of response	[Verified: A7B3C2D1]	Data integrity proof
Confidence score	Header	(Confidence: high)	Data quality signal
Validation metadata	Response field	"validated": true	System verification passed
No approximations	Absence	Never: "~", "about"	Zero Mental Math compliance
Raw values before aggregates	Data sections	last 5: [...]	Anti-Aggregation compliance

🚩 Red Flags (Hallucination Indicators)

Red Flag	Example	Rule Violated
Missing citation	Win rate: 65.52%	Citation Requirement
Approximation words	approximately 70%	Zero Mental Math
Missing checksum	No [Verified: XXXX]	Checksum Requirement
Rounded numbers	70% vs 65.52%	Zero Mental Math
Averages without raw data	Average: 0.06 alone	Anti-Aggregation
Confidence missing	No confidence score	Incomplete output

📖 Complete Example: End-to-End Pipeline

Step 1: User Request

"Analyze my trading performance for November"

Step 2: MCP Server Response (Layers 1-2)

{
    "success": true,
    "summary": {
        "total_positions": 29,
        "total_wins": 19,
        "total_losses": 10,
        "win_rate": 65.52,
        "profit_factor": 2.34,
        "total_pl": 1234.56
    },
    "_accuracy_report": {
        "checksum": "A7B3C2D1",
        "confidence": {"score": "high", "reason": "9/9 metrics, 29 positions"},
        "metrics": [
            {
                "path": "summary.win_rate",
                "value": 65.52,
                "citation": "Win rate: 65.52% [Source: get_mt5_position_history.summary.win_rate]"
            },
            {
                "path": "summary.profit_factor",
                "value": 2.34,
                "citation": "Profit factor: 2.34 [Source: get_mt5_position_history.summary.profit_factor]"
            }
        ]
    }
}

Step 3: Template Formatting (Layer 3)

## Performance Analysis (Confidence: high)

### Overview
- Total positions: 29 [Source: get_mt5_position_history.summary.total_positions]
- Win rate: 65.52% [Source: get_mt5_position_history.summary.win_rate]
- Profit factor: 2.34 [Source: get_mt5_position_history.summary.profit_factor]

### Financial Performance
- Total P&L: $1234.56 [Source: get_mt5_position_history.summary.total_pl]

[Verified: A7B3C2D1]

Step 4: RAG Context Retrieval (Layer 4)

Retrieved: "Magic number 106 is Goldfish Scalper trading XAUUSD"

Step 5: LLM Validation (Layer 5)

{
    "hallucinations_found": false,
    "checksum_valid": true,
    "summary": "All claims verified against source data"
}

Step 6: Final Response

{
    "success": true,
    "analysis": "## Performance Analysis (Confidence: high)\n\n...\n\n[Verified: A7B3C2D1]",
    "_validation_meta": {
        "validated": true,
        "attempts": 1,
        "model": "novita-default",
        "rag_context_used": true
    }
}

📊 Summary: Why This Architecture Works

Layer	Technique	What It Prevents	Key Insight
1	Server-side calculation	LLM arithmetic errors	Use the right tool for the job
2	Pre-formatted citations	LLM paraphrasing numbers	Reduce LLM to copy machine
3	Template-based output	Structural hallucination	Minimize degrees of freedom
4	RAG context grounding	Semantic hallucination	Ground entities in facts
5	Second LLM validation	Subtle errors slipping through	Verification < Generation
6	Auto-retry with correction	Transient failures	Iterative refinement converges

💡 The Meta-Principle

Trust flows from deterministic systems to stochastic ones, never the reverse.

Python calculates → Server stores → Citations copy → Templates structure → Validator checks

At no point does an LLM "decide" a number. The LLM's role is purely mechanical: copying citations into template slots. This is the fundamental insight that makes 100% accuracy achievable.

📚 References

Schick, T., et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761
Yao, S., et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629
Lewis, P., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020
Wang, X., et al. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171
Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073
Madaan, A., et al. (2023). "Self-Refine: Iterative Refinement with Self-Feedback." arXiv:2303.17651
Gao, L., et al. (2022). "PAL: Program-Aided Language Models." arXiv:2211.10435
Nogueira, R., et al. (2021). "Investigating the Limitations of Transformers with Simple Arithmetic Tasks." arXiv:2102.13019