DEV Community

Cover image for Trust the Server, Not the LLM: A Deterministic Approach to LLM Accuracy
chowderhead
chowderhead

Posted on

Trust the Server, Not the LLM: A Deterministic Approach to LLM Accuracy

🚫 Zero Mental Math: An Anti-Hallucination Architecture for LLM-Driven Analysis

A six-layer system for achieving 100% accurate numerical reporting from Large Language Models

🎯 The problem

I built an MCP server that extracts data from my MT5 terminals on a VPS. Basically its a load of financial data reports, like trades, averages, technical indicators etc.

I built it all out and I realized that my LLM would randomly hallucinate random things, for example it would say there was a 16th trade when there only had been 15 trades for that day.

When it comes to financial reporting I realize there is probaly a lot on this topic, so I grabbed some ideas from a lot of the latest research on RAG topics, and i threw something together.

I wrote tests that actually test the accuracy of the results of my embeddings over a period of 10 times, and each MCP tool has 100% accuracy on end to end integration tests.

I had the AI summarize it, but if anyone is curious about the exact code maybe I can open source a repeatable process, but i'm hoping from this Article you will have everything you need.

( incoming AI gen content )


📋 Abstract

Large Language Models (LLMs) are fundamentally pattern matchers, not calculators. When asked to analyze data, they generate "plausible-looking" numbers based on statistical patterns in training data—not deterministic computation. This is catastrophic for domains requiring precision, such as trading analysis, financial reporting, or medical diagnostics.

This document describes the Zero Mental Math Architecture, a multi-layered system that achieves accurate numerical reporting by shifting all computation to deterministic Python code and reducing the LLM to a "citation copy machine."


⚠️ The Core Problem

🤖 LLMs Hallucinate Numbers

Given raw trading data, an LLM will confidently state:

"Your win rate is approximately 70%"

...without performing any calculation. The model pattern-matched to a "reasonable-sounding" percentage. The actual win rate might be 65.52%, but the LLM has no mechanism to know this.

🧠 Why This Happens

LLMs predict the next token based on learned probability distributions. When they encounter a context suggesting a percentage is needed, they sample from the distribution of "percentages that appeared in similar contexts during training." This is fundamentally different from computation.

Research backing: Google's work on arithmetic capabilities in transformers (Nogueira et al., 2021) demonstrated that LLMs fail reliably at multi-digit arithmetic. The error rate increases with operand size and operation complexity. This isn't a bug to be fixed—it's an architectural limitation of attention-based sequence models.


🏗️ Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    ZERO MENTAL MATH ARCHITECTURE                │
├─────────────────────────────────────────────────────────────────┤
│  LAYER 1: Fat MCP Server (Pre-Calculation)                      │
│  └── Shift ALL computation to deterministic Python              │
├─────────────────────────────────────────────────────────────────┤
│  LAYER 2: Accuracy Reports (Provenance Tracking)                │
│  └── Pre-formatted citations with cryptographic checksums       │
├─────────────────────────────────────────────────────────────────┤
│  LAYER 3: Response Formatter (Constrained Generation)           │
│  └── Template-based output with zero degrees of freedom         │
├─────────────────────────────────────────────────────────────────┤
│  LAYER 4: RAG Context (Semantic Grounding)                      │
│  └── Retrieval-augmented generation for entity resolution       │
├─────────────────────────────────────────────────────────────────┤
│  LAYER 5: LLM Validation (Adversarial Verification)             │
│  └── Second LLM fact-checks against source data                 │
├─────────────────────────────────────────────────────────────────┤
│  LAYER 6: Auto-Retry (Iterative Refinement)                     │
│  └── Automatic correction loop with convergence guarantees      │
└─────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

🔧 Layer 1: Fat MCP Server (Pre-Calculation)

📊 What It Does

The MCP (Model Context Protocol) server performs ALL numerical calculations before returning data to the LLM. The LLM never sees raw data that would require arithmetic.

# ❌ BAD: Raw data requires LLM to calculate
get_mt5_history_deals()  [deal1, deal2, deal3, ...]
# LLM must: count deals, group by position, sum P&L, calculate ratios

# ✅ GOOD: Pre-calculated metrics
get_mt5_position_history()  {
    "summary": {
        "total_positions": 29,      # Server counted
        "win_rate": 65.52,          # Server calculated: (19/29)*100
        "profit_factor": 2.34,      # Server calculated: sum(wins)/abs(sum(losses))
        "expectancy": 42.57         # Server calculated: total_pl/total_positions
    }
}
Enter fullscreen mode Exit fullscreen mode

✅ Why This Works

Principle: Tool-Augmented LLMs

The insight from Meta's "Toolformer" (Schick et al., 2023) and the broader ReAct paradigm (Yao et al., 2022) is that LLMs should delegate to external tools for tasks they perform poorly. Arithmetic is the canonical example.

Principle: Separation of Concerns

Asking an LLM to calculate percentages is like asking a poet to do accounting. Language models are trained on text prediction, not numerical computation. By moving calculation to Python—a language designed for computation—we use each system for its strengths.

Principle: Determinism Over Stochasticity

Python's 19/29*100 = 65.517... is deterministic. Running it 1000 times yields identical results. An LLM's "calculation" is stochastic—it samples from a probability distribution, introducing variance even at temperature 0 (due to floating-point non-determinism in GPU operations).

Research Foundation

  • Toolformer (Schick et al., 2023): LLMs can learn to call APIs for tasks like calculation
  • Program-Aided Language Models (Gao et al., 2022): Offloading computation to code interpreters
  • Chain-of-Thought Arithmetic Failures (Wei et al., 2022): Even with step-by-step reasoning, LLMs make arithmetic errors

📝 Layer 2: Accuracy Reports (Provenance Tracking)

🎯 What It Does

Every tool response includes an _accuracy_report field containing:

  1. Pre-formatted citations — Complete sentences ready for copy-paste
  2. CRC32 checksum — Cryptographic fingerprint of all metric values
  3. Confidence score — Data quality assessment
{
    "summary": { "win_rate": 65.52, "profit_factor": 2.34 },
    "_accuracy_report": {
        "checksum": "A7B3C2D1",
        "checksum_input": "29|19|10|65.52|1234.56|85.25|-42.15|2.34|42.57",
        "confidence": {
            "score": "high",
            "reason": "9/9 metrics populated, 29 positions analyzed"
        },
        "metrics": [
            {
                "path": "summary.win_rate",
                "value": 65.52,
                "citation": "Win rate: 65.52% [Source: get_mt5_position_history.summary.win_rate]"
            }
        ],
        "instructions": {
            "checksum_required": true,
            "format": "End analysis with: [Verified: A7B3C2D1]"
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

✅ Why This Works

Principle: The LLM as Copy Machine

The critical insight is that LLMs are excellent at copying text verbatim. By providing the exact citation string, we reduce the LLM's job from "interpret this number and write about it" to "copy this string into your response." The former invites hallucination; the latter is mechanical.

Principle: Verifiable Provenance

Every number in the output has a traceable source. This enables:

  • Automated verification: Scripts can check that reported values match source data
  • Human auditing: Readers can follow citations to verify claims
  • Debugging: When errors occur, the citation trail identifies the failure point

Principle: Checksums as Commitment Devices

The CRC32 checksum serves multiple purposes:

  1. Tamper detection: If any metric changes, the checksum changes
  2. Verification anchor: The [Verified: A7B3C2D1] at the end of output confirms the LLM used the correct source data
  3. Debugging aid: The checksum_input field shows the exact values used, enabling manual verification

Research Foundation

  • Attribution in RAG Systems (Liu et al., 2023): Citation improves factual accuracy
  • Self-Consistency Checking (Wang et al., 2022): Multiple verification signals improve reliability
  • Data Provenance in ML Pipelines: Standard practice in MLOps for reproducibility

📄 Layer 3: Response Formatter (Constrained Generation)

🎯 What It Does

Templates define the exact structure of outputs, with placeholder slots for citations:

TEMPLATE = """## Performance Analysis (Confidence: {confidence.score})

### Overview
{citation:summary.total_positions}
{citation:summary.win_rate}
{citation:summary.profit_factor}

[Verified: {checksum}]"""
Enter fullscreen mode Exit fullscreen mode

The formatter replaces {citation:summary.win_rate} with the exact citation string from Layer 2:

Win rate: 65.52% [Source: get_mt5_position_history.summary.win_rate]
Enter fullscreen mode Exit fullscreen mode

✅ Why This Works

Principle: Reducing Degrees of Freedom

Hallucination occurs when LLMs have too much freedom. Consider:

Approach Degrees of Freedom Hallucination Risk
"Analyze this data" Unlimited Very High
"Report the win rate" High (format, precision, context) High
"Copy this citation: Win rate: 65.52%" Near Zero Near Zero

Templates eliminate structural decisions. The LLM doesn't choose what to report, in what order, with what formatting—the template specifies everything.

Principle: Slot-Filling vs. Generation

This follows the "skeleton-then-fill" paradigm from structured NLG (Natural Language Generation). The template is the skeleton; citations are the fill. The LLM's role is purely mechanical substitution.

Critical Implementation Rule:

class ResponseFormatter:
    """
    Critical Rule: NEVER calculates numbers. Only uses citations from
    _accuracy_report.metrics provided by the server.
    """
Enter fullscreen mode Exit fullscreen mode

The formatter is explicitly prohibited from performing any computation. It can only copy existing citations.

Research Foundation

  • Constrained Decoding (Hokamp & Liu, 2017): Forcing outputs to satisfy constraints
  • Template-Based NLG (Reiter & Dale, 1997): Classical approach to reliable text generation
  • Structured Output Forcing: JSON mode, function calling schemas

🗄️ Layer 4: RAG Context (Semantic Grounding)

🎯 What It Does

A ChromaDB knowledge base stores static facts:

  • Strategy mappings (magic numbers → strategy names)
  • Trading rules and constraints
  • Domain-specific terminology

Before generating responses, the system retrieves relevant context:

# Query: "What strategy uses magic 106?"
# Returns: ["Magic number 106 is Goldfish Scalper trading XAUUSD"]
Enter fullscreen mode Exit fullscreen mode

This context is injected into both the formatter and the validator.

✅ Why This Works

Principle: Not All Hallucinations Are Numerical

An LLM might correctly report "Win rate: 65.52%" but incorrectly attribute it to "Dark Dione strategy" when it's actually "Goldfish Scalper." This is a semantic hallucination—the number is right, but the entity relationship is wrong.

RAG grounds the LLM in factual knowledge about entities, preventing semantic errors.

Principle: Ephemeral Session Scope

kb = KnowledgeBase(ephemeral=True)  # Resets each MCP session
kb.load_static_rules()              # Loads known-good facts
Enter fullscreen mode Exit fullscreen mode

The knowledge base is session-scoped to prevent stale data accumulation. Static rules (which don't change) are loaded fresh; dynamic trading statistics are always fetched live from MT5.

Principle: Context for Both Generator and Validator

The same RAG context is passed to:

  1. Formatter: To ground response generation
  2. Validator: To prevent false-positive hallucination flags

If the response says "Goldfish Scalper (Magic 106)" and the validator's context confirms this mapping, it won't incorrectly flag it as a hallucination.

Research Foundation

  • RAG (Lewis et al., 2020): The foundational retrieval-augmented generation paper
  • REALM (Guu et al., 2020): Retrieval-enhanced pre-training
  • In-Context Learning (Brown et al., 2020): GPT-3's ability to use context examples
  • Grounding in Dialogue Systems (Roller et al., 2020): Connecting responses to knowledge

✅ Layer 5: LLM Validation (Adversarial Verification)

🎯 What It Does

A second LLM (Novita AI) validates the drafted response against source data before delivery to the user:

validation_result = validate_with_llm(
    response_text=draft,      # What the LLM wants to say
    source_data=mcp_response, # Ground truth from server
    context=rag_context       # Knowledge base facts
)
Enter fullscreen mode Exit fullscreen mode

The validator checks four rules:

  1. Zero Mental Math: All numbers match source exactly
  2. Anti-Aggregation: Raw values shown before averages
  3. Citation Requirement: Every number has [Source: ...]
  4. Checksum Verification: Response ends with correct [Verified: XXXX]

✅ Why This Works

Principle: Verification is Easier Than Generation

This is a fundamental asymmetry in computational complexity. Consider:

  • Generation: "Analyze this data and write a report" (open-ended, creative)
  • Verification: "Does '65.52%' match the source value '65.52'?" (closed, deterministic)

The validator has a much simpler task: pattern matching and comparison. This makes it far less prone to hallucination than the generator.

Principle: Adversarial Checking

This draws from:

  • Constitutional AI (Anthropic, 2022): Using AI to critique and improve AI outputs
  • Debate (Irving et al., 2018): Having models argue to expose weaknesses
  • Red-teaming: Standard security practice of adversarial testing

The validator is explicitly instructed to be strict:

Be strict - any deviation from source is a hallucination
Enter fullscreen mode Exit fullscreen mode

Principle: Structured Error Output

The validator returns structured JSON with specific issue categorization:

{
    "hallucinations_found": true,
    "issues": [{
        "claim": "Win Rate: approximately 70%",
        "problem": "Source shows 65.52%, not 'approximately 70%'",
        "severity": "critical",
        "correct_value": "Win rate: 65.52% [Source: ...]",
        "rule_violated": "Zero Mental Math"
    }]
}
Enter fullscreen mode Exit fullscreen mode

This enables automated correction in Layer 6.

Research Foundation

  • Constitutional AI (Bai et al., 2022): AI systems that critique themselves
  • Self-Consistency (Wang et al., 2022): Sampling multiple times and checking agreement
  • Fact Verification (Thorne et al., 2018): FEVER dataset and verification systems
  • LLM-as-Judge (Zheng et al., 2023): Using LLMs to evaluate LLM outputs

🔄 Layer 6: Auto-Retry (Iterative Refinement)

🎯 What It Does

When validation fails, the system automatically:

  1. Parses the validation errors
  2. Applies corrections to the draft
  3. Re-validates
  4. Repeats up to N times (default: 3)
for attempt in range(1, max_retries + 1):
    validation = validate_with_llm(narrative, source_data, context)

    if not validation["hallucinations_found"]:
        # Success! Return validated response
        return {"analysis": narrative, "_validation_meta": {"validated": True}}

    # Failed - apply corrections and retry
    narrative = corrector.apply_corrections(narrative, validation["issues"])
Enter fullscreen mode Exit fullscreen mode

✅ Why This Works

Principle: Iterative Refinement

Self-refinement is a well-established technique for improving LLM outputs. The key insight is that correction is easier than generation—given specific feedback ("this number is wrong, it should be X"), the fix is mechanical.

Principle: Bounded Retry with Graceful Degradation

The system doesn't retry forever:

  • Fixable issues (wrong numbers): Auto-correct and retry
  • Unfixable issues (structural problems): Fail immediately with diagnostics
  • Max retries exceeded: Return error with last attempt for debugging
if not can_fix:
    return {
        "success": False,
        "error": "Validation failed with unfixable issues",
        "validation_issues": issues,
        "unfixable_reasons": reasons
    }
Enter fullscreen mode Exit fullscreen mode

Principle: Convergence Guarantees

Because corrections are deterministic (replace X with Y) and the validator is consistent, the system converges. If the corrector properly applies all fixes, the next validation will pass. The retry loop guards against transient failures, not fundamental incompatibility.

Research Foundation

  • Self-Refine (Madaan et al., 2023): Iterative refinement with self-feedback
  • Reflexion (Shinn et al., 2023): Verbal reinforcement learning through self-reflection
  • Error Correction in Communication (Shannon, 1948): Fundamental information theory

🚨 Special Rule: Anti-Aggregation

⚠️ The Problem

Aggregation hides critical information. Consider:

❌ WRONG (Aggregation Hallucination):
Current lot sizes:
- EURUSD: 0.06 lots average (range: 0.05-0.08)
Enter fullscreen mode Exit fullscreen mode

This hides that lot size doubled from 0.05 → 0.08 recently—a critical signal that risk management changed.

✅ The Solution

✅ CORRECT (Raw Data First):
Current lot sizes [Source: positions list]:
- EURUSD last 5: [0.05, 0.05, 0.05, 0.08, 0.07]
  → CURRENT: 0.07 lots
  → TREND: Scaled up 60% on Nov 24 (0.05 → 0.08)
  → Average: 0.06 lots (for reference only)
Enter fullscreen mode Exit fullscreen mode

✅ Why This Works

Principle: Simpson's Paradox Awareness

Aggregates can reverse the apparent direction of relationships. A "stable average" can hide dramatic changes in underlying data. By requiring raw values first, we prevent this information loss.

Principle: Auditability

Scientific reporting standards require showing raw data. If you only report "average 0.06," readers cannot detect:

  • Outliers that skew the average
  • Trends (increasing/decreasing)
  • Distribution shape (uniform vs. bimodal)

Principle: Transparency Over Convenience

It's easier to report a single number. But the Anti-Aggregation Rule prioritizes transparency over convenience. The small cognitive cost of reading 5 raw values prevents potentially catastrophic misunderstandings.


🎯 Accuracy Indicators in Final Output

✅ What Indicates Accuracy

Indicator Location Example Why It Matters
Citation tags After every number [Source: tool.path] Traceable provenance
Checksum End of response [Verified: A7B3C2D1] Data integrity proof
Confidence score Header (Confidence: high) Data quality signal
Validation metadata Response field "validated": true System verification passed
No approximations Absence Never: "~", "about" Zero Mental Math compliance
Raw values before aggregates Data sections last 5: [...] Anti-Aggregation compliance

🚩 Red Flags (Hallucination Indicators)

Red Flag Example Rule Violated
Missing citation Win rate: 65.52% Citation Requirement
Approximation words approximately 70% Zero Mental Math
Missing checksum No [Verified: XXXX] Checksum Requirement
Rounded numbers 70% vs 65.52% Zero Mental Math
Averages without raw data Average: 0.06 alone Anti-Aggregation
Confidence missing No confidence score Incomplete output

📖 Complete Example: End-to-End Pipeline

Step 1: User Request

"Analyze my trading performance for November"
Enter fullscreen mode Exit fullscreen mode

Step 2: MCP Server Response (Layers 1-2)

{
    "success": true,
    "summary": {
        "total_positions": 29,
        "total_wins": 19,
        "total_losses": 10,
        "win_rate": 65.52,
        "profit_factor": 2.34,
        "total_pl": 1234.56
    },
    "_accuracy_report": {
        "checksum": "A7B3C2D1",
        "confidence": {"score": "high", "reason": "9/9 metrics, 29 positions"},
        "metrics": [
            {
                "path": "summary.win_rate",
                "value": 65.52,
                "citation": "Win rate: 65.52% [Source: get_mt5_position_history.summary.win_rate]"
            },
            {
                "path": "summary.profit_factor",
                "value": 2.34,
                "citation": "Profit factor: 2.34 [Source: get_mt5_position_history.summary.profit_factor]"
            }
        ]
    }
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Template Formatting (Layer 3)

## Performance Analysis (Confidence: high)

### Overview
- Total positions: 29 [Source: get_mt5_position_history.summary.total_positions]
- Win rate: 65.52% [Source: get_mt5_position_history.summary.win_rate]
- Profit factor: 2.34 [Source: get_mt5_position_history.summary.profit_factor]

### Financial Performance
- Total P&L: $1234.56 [Source: get_mt5_position_history.summary.total_pl]

[Verified: A7B3C2D1]
Enter fullscreen mode Exit fullscreen mode

Step 4: RAG Context Retrieval (Layer 4)

Retrieved: "Magic number 106 is Goldfish Scalper trading XAUUSD"
Enter fullscreen mode Exit fullscreen mode

Step 5: LLM Validation (Layer 5)

{
    "hallucinations_found": false,
    "checksum_valid": true,
    "summary": "All claims verified against source data"
}
Enter fullscreen mode Exit fullscreen mode

Step 6: Final Response

{
    "success": true,
    "analysis": "## Performance Analysis (Confidence: high)\n\n...\n\n[Verified: A7B3C2D1]",
    "_validation_meta": {
        "validated": true,
        "attempts": 1,
        "model": "novita-default",
        "rag_context_used": true
    }
}
Enter fullscreen mode Exit fullscreen mode

📊 Summary: Why This Architecture Works

Layer Technique What It Prevents Key Insight
1 Server-side calculation LLM arithmetic errors Use the right tool for the job
2 Pre-formatted citations LLM paraphrasing numbers Reduce LLM to copy machine
3 Template-based output Structural hallucination Minimize degrees of freedom
4 RAG context grounding Semantic hallucination Ground entities in facts
5 Second LLM validation Subtle errors slipping through Verification < Generation
6 Auto-retry with correction Transient failures Iterative refinement converges

💡 The Meta-Principle

Trust flows from deterministic systems to stochastic ones, never the reverse.

Python calculates → Server stores → Citations copy → Templates structure → Validator checks
Enter fullscreen mode Exit fullscreen mode

At no point does an LLM "decide" a number. The LLM's role is purely mechanical: copying citations into template slots. This is the fundamental insight that makes 100% accuracy achievable.


📚 References

  1. Schick, T., et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761
  2. Yao, S., et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629
  3. Lewis, P., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020
  4. Wang, X., et al. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171
  5. Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073
  6. Madaan, A., et al. (2023). "Self-Refine: Iterative Refinement with Self-Feedback." arXiv:2303.17651
  7. Gao, L., et al. (2022). "PAL: Program-Aided Language Models." arXiv:2211.10435
  8. Nogueira, R., et al. (2021). "Investigating the Limitations of Transformers with Simple Arithmetic Tasks." arXiv:2102.13019

Top comments (0)