Adamo Software

Posted on Apr 28

How we reduced LLM hallucination to under 1% in a production booking system

#ai #llm #python #tutorial

LLM hallucinations are not a research curiosity. They are a production problem. A 2026 Stanford AI Index reported hallucination rates across 26 leading LLMs ranging from 22% to 94%, and even the best-performing model on Vectara's leaderboard (Gemini-2.0-Flash-001) still hallucinates 0.7% of the time on grounded summarization tasks. In a booking system that handles thousands of transactions daily, even a 1% hallucination rate means dozens of bookings made with wrong prices, wrong dates, or wrong policies every single day.

This article walks through the validation pipeline we built that brought our production hallucination rate from approximately 4% (out of the box) to under 1%, and the architectural decisions that mattered most.

What "hallucination" means in a booking context

Before we get into the solution, it helps to be precise about what we are fixing. In a booking system, hallucinations break down into four categories:

Numerical hallucinations. The LLM states a price, a duration, a distance, or a count that does not match the source data. Example: the API returns $189/night, but the LLM tells the user "$179 per night." This is the most common and most damaging category.

Temporal hallucinations. Wrong times, wrong dates, wrong durations. Example: the API returns "departure 22:30 local time" and the LLM says "departure at 10:30 PM" (correct in 12-hour format) or "departure at 2:30 AM" (incorrect). The first is fine. The second is a hallucination.

Attribute hallucinations. The LLM invents amenities, policies, or features that are not in the source data. Example: claiming a hotel has a rooftop pool when the API returned no pool data, or stating a flight allows free cabin baggage when the fare class actually charges.

Citation hallucinations. The LLM references reviews, ratings, or sources that do not exist. Example: "According to recent guest reviews, the hotel scored 9.2 for cleanliness" when no review API was queried.

These categories matter because each requires a different detection strategy. A single validation pipeline that treats them uniformly will miss most of the failures.

The architecture: separating reasoning from facts

The core principle: the LLM never generates facts. It only formats them.

                ┌─────────────────────┐
   User query   │   LLM (reasoning)   │
   ───────────▶ │   Intent + Tool     │
                │   selection only    │
                └──────────┬──────────┘
                           │
                           ▼
                ┌─────────────────────┐
                │  Tool execution     │
                │  (real APIs)        │
                │  ───────────────    │
                │  • search_hotels    │
                │  • get_pricing      │
                │  • check_avail      │
                └──────────┬──────────┘
                           │
                           ▼
                ┌─────────────────────┐
                │  Structured data    │
                │  (canonical facts)  │
                └──────────┬──────────┘
                           │
                           ▼
                ┌─────────────────────┐
                │   LLM (formatter)   │
                │   Generates prose   │
                │   from facts only   │
                └──────────┬──────────┘
                           │
                           ▼
                ┌─────────────────────┐
                │ Validation pipeline │
                │ Checks LLM output   │
                │ against facts       │
                └──────────┬──────────┘
                           │
                           ▼
                       User reply

The LLM has two roles, executed in two separate calls:

Reasoning call. Given the user's message, decide which tools to invoke and with what parameters. Output: a list of tool calls. No facts generated yet.
Formatting call. Given the tool results (structured JSON), generate a natural language response. The system prompt explicitly forbids introducing any numerical, temporal, or factual claim not present in the tool output.

This separation alone reduced our hallucination rate from ~4% to ~2%. The remaining 2% came from the LLM still occasionally introducing claims during the formatting step. That is what the validation pipeline catches.

The validation pipeline

The pipeline runs every LLM-formatted response through three checks before sending it to the user.

Check 1: Numerical claim extraction and verification

Every number in the LLM's response must trace back to the structured tool output. We extract numbers using regex (with currency, time, and unit awareness), then verify each one against the canonical fact set.

import re
from decimal import Decimal

# Patterns for different numerical claim types
PRICE_PATTERN = re.compile(
    r'\$\s?(\d{1,3}(?:,\d{3})*(?:\.\d{1,2})?)|(\d+(?:\.\d{1,2})?)\s?(?:USD|dollars?)',
    re.IGNORECASE
)
TIME_PATTERN = re.compile(
    r'\b(\d{1,2}):(\d{2})\s?(AM|PM|am|pm)?\b'
)
PERCENT_PATTERN = re.compile(r'(\d+(?:\.\d+)?)\s?%')

def extract_numerical_claims(text: str) -> dict:
    """Extract all numerical claims from LLM-generated text."""
    claims = {
        'prices': [],
        'times': [],
        'percentages': [],
    }

    for match in PRICE_PATTERN.finditer(text):
        value = match.group(1) or match.group(2)
        claims['prices'].append(Decimal(value.replace(',', '')))

    for match in TIME_PATTERN.finditer(text):
        hour, minute, meridiem = match.groups()
        claims['times'].append({
            'raw': match.group(0),
            'hour': int(hour),
            'minute': int(minute),
            'meridiem': meridiem,
        })

    for match in PERCENT_PATTERN.finditer(text):
        claims['percentages'].append(Decimal(match.group(1)))

    return claims


def validate_claims(claims: dict, canonical_facts: dict) -> list:
    """Return list of unsupported claims."""
    violations = []

    # Build lookup sets from canonical facts
    valid_prices = {Decimal(str(p)) for p in canonical_facts.get('prices', [])}
    valid_times = canonical_facts.get('times', [])

    for price in claims['prices']:
        if price not in valid_prices:
            # Allow small rounding (±$0.01)
            if not any(abs(price - vp) <= Decimal('0.01') for vp in valid_prices):
                violations.append(('price', price, valid_prices))

    for time_claim in claims['times']:
        if not is_time_in_canonical(time_claim, valid_times):
            violations.append(('time', time_claim, valid_times))

    return violations

If any claim fails verification, the response is rejected and the LLM is asked to regenerate with stricter instructions. After two failed attempts, the system falls back to a templated response built directly from the structured data.

Check 2: Attribute grounding via embedding similarity

Numerical extraction does not catch attribute hallucinations ("the hotel has a rooftop pool"). For these, we use a different approach: embedding-based grounding.

For every property the LLM mentions in its response, we have a canonical attribute list from the API. We compute embeddings for both the LLM's claims (extracted as noun phrases) and the canonical attributes, then check that every claim has a high-similarity match in the canonical set.

from sentence_transformers import SentenceTransformer
import numpy as np

embedder = SentenceTransformer('all-MiniLM-L6-v2')

def extract_noun_phrases(text: str) -> list:
    """Extract attribute claims using a lightweight NLP pipeline."""
    # Simplified: use spaCy noun chunks in production
    import spacy
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    return [chunk.text for chunk in doc.noun_chunks]


def validate_attributes(
    llm_text: str,
    canonical_attributes: list[str],
    threshold: float = 0.7
) -> list:
    """Flag attributes claimed in LLM text that don't match canonical data."""
    claimed_phrases = extract_noun_phrases(llm_text)

    if not claimed_phrases or not canonical_attributes:
        return []

    claim_embeddings = embedder.encode(claimed_phrases)
    canonical_embeddings = embedder.encode(canonical_attributes)

    violations = []
    for phrase, claim_emb in zip(claimed_phrases, claim_embeddings):
        similarities = np.dot(canonical_embeddings, claim_emb) / (
            np.linalg.norm(canonical_embeddings, axis=1) *
            np.linalg.norm(claim_emb)
        )
        if similarities.max() < threshold:
            violations.append({
                'claim': phrase,
                'best_match_score': float(similarities.max()),
                'best_match': canonical_attributes[similarities.argmax()],
            })

    return violations

This catches statements like "the hotel offers a fitness center and spa" when the API only returned "fitness center" (no spa). The threshold of 0.7 was tuned empirically. Lower thresholds let too many attribute hallucinations through. Higher thresholds caused false positives on legitimate paraphrases.

Check 3: Self-consistency via second LLM pass

For complex multi-sentence responses, regex and embeddings miss subtle errors. The third check uses a second LLM call (with a smaller, cheaper model) configured as a fact-checker.

FACT_CHECK_PROMPT = """You are a strict fact-checker. Given:
1. A SOURCE DATA object (the only ground truth)
2. A CANDIDATE RESPONSE (what we want to send to a user)

Your job: identify any claim in the CANDIDATE that is NOT supported by the SOURCE.

Output JSON only:
{
  "supported": true/false,
  "violations": [
    {"claim": "exact text from candidate", "reason": "why not supported"}
  ]
}

Do not flag stylistic differences or paraphrasing. Only flag factual claims that contradict or are not present in SOURCE.

SOURCE DATA:
{source_json}

CANDIDATE RESPONSE:
{response_text}
"""


async def fact_check(response_text: str, source_data: dict) -> dict:
    result = await call_llm(
        model='gpt-4o-mini',  # cheaper model for fact-checking
        prompt=FACT_CHECK_PROMPT.format(
            source_json=json.dumps(source_data, indent=2),
            response_text=response_text,
        ),
        response_format={'type': 'json_object'},
        temperature=0,  # deterministic fact-checking
    )
    return json.loads(result)

This catches hallucinations that are linguistically subtle. Example: the source data says "free cancellation until 24 hours before check-in" and the LLM writes "you can cancel free of charge anytime before your stay." The numerical extractor sees nothing wrong (no numbers conflict). The embedding check thinks it is similar enough. But the fact-checker catches that "anytime" contradicts "until 24 hours before."

The fact-checker call adds 200 to 400ms of latency. We run it asynchronously and only block the user-facing response when the fact-checker flags violations.

Results from production

After 60 days of running this pipeline in production:

Numerical hallucinations: dropped from ~2.4% to 0.3%
Temporal hallucinations: dropped from ~0.8% to 0.1%
Attribute hallucinations: dropped from ~1.2% to 0.4%
Citation hallucinations: dropped to 0% (we eliminated review/rating citations from the formatter prompt entirely)

Combined hallucination rate: under 1% across all conversation turns. The 0.8% to 0.9% that remains is dominated by edge cases in attribute paraphrasing (e.g., the LLM saying "ocean view" when the source said "sea view"). These are technically inaccurate but rarely cause booking errors.

Things that did not work

A few approaches we tried and abandoned:

Higher temperature with self-correction prompts. Asking the LLM "are you sure?" or "double-check your facts" before responding reduced hallucinations slightly but was inconsistent. Google's December 2024 research found that asking "are you hallucinating right now?" reduces hallucination rates by 17%, but the effect diminishes after 5-7 interactions. We confirmed this empirically.
Fine-tuning on our domain data. Improved phrasing and tone but did not reliably reduce hallucinations. The model still invented prices and times confidently. Fine-tuning addresses style, not factual grounding.
Constraining output to JSON-only. Useful for tool calls but unsuitable for user-facing responses. Travelers want natural language, not structured data.
Reasoning models for everything. Paradoxically, OpenAI's o3 reasoning model hallucinated 33% of the time on PersonQA, double the rate of its predecessor o1, and the smaller o4-mini performed even worse at 48%. Reasoning models excel at analysis but introduce more hallucinations on factual tasks. We use reasoning mode only for intent parsing, never for response generation.

Practical takeaways

If you are building any LLM-powered system where factual accuracy matters (booking, customer service, financial advisory, healthcare assistance), the architectural rules that mattered most for us:

Two LLM calls, not one. Reasoning and formatting are different jobs. Mixing them in a single call lets the LLM hallucinate facts while reasoning.
Structured ground truth as the only fact source. If a claim is not in the structured tool output, it cannot be in the response. Period.
Validation as a hard gate. Logging hallucinations is not enough. Block them from reaching the user, even at the cost of latency and occasional templated responses.
Layer your detection. No single check (regex, embeddings, fact-checker) catches everything. Layered checks with different signal types catch different failure modes.
Measure rates by category. "Hallucination rate" as a single number is not actionable. Numerical, temporal, attribute, and citation hallucinations have different costs and different solutions.

The fundamental insight is that hallucination is not a model problem. It is a system design problem. The Stanford AI Index 2026 calls for engineers to "treat hallucination rates as design inputs rather than bugs to be ignored". That has been our experience exactly. Pick a baseline model with reasonable performance, then build the pipeline around it that enforces grounding.

This pattern was developed across several production builds at Adamo Software, including AI travel assistants and AI chatbots for travel booking where booking accuracy is non-negotiable.

DEV Community