Why your LLM product hallucinates the one thing it shouldn't, and the architectural pattern that fixes it

#ai #architecture #llm

A woman forwards a conversation with her boyfriend to my AI bot. The model detects danger signals (emotional abuse, isolation tactics) and responds with a crisis hotline number. Caring. Responsible. One problem: it's a children's hotline. The model hallucinated a crisis contact for an adult in distress.

The prompt says "DO NOT invent contact information." Doesn't matter. The model's drive to be helpful is stronger than any instruction. This is not a prompting problem. This is an architecture problem.

The single-pass trap

The typical LLM product architecture: user input goes into the model, model output goes to the user. If you need the model to both analyze input and present the result in a specific voice, tone, or format, both jobs go into one prompt.

This is where things break.

Analysis demands precision and structure. Voice demands freedom and empathy. These are conflicting objectives competing for the same token budget. The result: the model weaves hallucinations into convincing prose, and you cannot programmatically tell which parts are real.

First instinct: throw a stronger model at it. Reasoning models, higher token budgets. I tested this on my own cases. Expensive models hallucinate less often, but the cost adds up fast. Cheaper reasoning models still hallucinate when the prompt demands complex behavior. On top of that, on complex cases with multiple participants and layered subtext, even the best models confuse characters, misattribute intent, and fill in gaps that don't exist. The problem isn't model capability. The problem is asking one completion to do two fundamentally different jobs.

This generalizes far beyond chatbots. Any LLM product where the model simultaneously decides what to say and how to say it, while touching data the user will trust, has this vulnerability. Prices, legal references, medical dosages, addresses, phone numbers. If the model can invent it and the user has no reason to doubt it, you're sitting on a powder keg.

The pattern: Triage-and-Voice

The fix is architectural: split the LLM work into two passes, with your backend as the decision layer in between.

Pass 1 — Triage. A model performs structured analysis. No voice, no personality, no user-facing text. Pure signal extraction into a machine-readable format (JSON). What is happening? Who are the participants? What are their goals? What are the risks? What category does this fall into?

Backend gate. Your code inspects the triage output and makes routing decisions. Crisis detected → inject verified data, switch to a crisis-specific prompt. Business rule triggered → modify what the next pass receives. This is where reliability is born: in deterministic code after probabilistic generation.

Pass 2 — Voice. A model takes the triage output, filtered and enriched by your backend, and produces the user-facing response in the appropriate voice, tone, and format. It does not analyze. It does not decide. It articulates.

The model never touches critical data directly. In Pass 1, it sets a flag. The backend acts on the flag: injects verified contacts from the database and switches to a crisis-specific prompt. The model in Pass 2 doesn't invent a hotline number. Its job is to convey the seriousness of the situation and point the user to the contact the backend provided. Dozens of crisis case runs on DeepSeek V3.2, zero hallucinated contact data.

Why this works: three properties

Separation of concerns. The analyst model doesn't spend tokens on style. The voice model doesn't spend tokens on analysis. Each pass does one thing. Quality goes up on both axes. I measured this across 40 evaluation cases.

Cacheable analysis. My product is a Telegram bot that analyzes conversations through the lenses of different AI personas. The user wants to see their situation from multiple angles. Triage runs once. On a follow-up request, only the Voice pass reruns. Cheaper, faster, and the user gets a second perspective in seconds instead of waiting for full reanalysis. A single pass with a character voice used to take 50–90 seconds. With the split: first response 30–45 seconds, subsequent ones 15–20.

Backend-gated safety. The critical insight: between the two passes sits deterministic code. Not another prompt. Not a guardrail model. Code that you control, that you can test, that behaves the same way every time. Crisis contacts come from your database. The model's job is to know when. Your backend's job is to know what.

Where Triage-and-Voice applies

This is not just a chatbot pattern. It's an LLM product architecture pattern. It applies whenever:

The model must analyze input AND produce user-facing output
The output contains data the user will trust
You need to handle edge cases (crisis, legal, financial) differently from the happy path
You want to vary presentation without re-running analysis
You need a deterministic checkpoint between "what the model thinks" and "what the user sees"

The principle: the model flags, the code decides, the model speaks. Every time you catch yourself writing "DO NOT hallucinate [critical data]" in a prompt, that's your signal to split the processing.

The uncomfortable truth about naive LLM product architecture

If your LLM product makes a single call where the model both reasons about the input and produces the final output, you are running analysis and presentation as one atomic operation with no checkpoint in between.

You cannot validate what the model "decided" before the user sees it. You cannot cache the analytical work separately from the presentation. You cannot route edge cases to different handling logic. You cannot inject verified data at the right moment.

You are trusting a probabilistic system to be right about the one thing that matters most.

Triage-and-Voice is not about adding complexity. It's about adding a point of control between what the model thinks and what the user receives. One additional LLM call. One backend gate. The difference between "the model said it" and "we verified it."

I derived the Triage-and-Voice pattern from real incidents building "Between the Lines" (three Telegram bots for AI conversation analysis).