Akshay Kumar BM

Posted on May 10

Why Your RAG Chatbot Looks Great in Week 1 and Hallucinates by Month 2

#ai #architecture #rag #llm

💡 Week 1 demo → "this is amazing."

Month 2 production → "why is it hallucinating?"

I've seen this pattern more times than I can count. The team builds a RAG chatbot. It works beautifully on the 20 questions they tested it with. They ship it. Real users show up with real questions. The cracks appear.

The model is almost never the problem.

The system around it is.

After shipping 10+ production RAG systems — handling everything from 100 queries a day to 1,000+ a week at 95% accuracy — I've narrowed it down to four things that consistently separate the ones that keep working from the ones that quietly get abandoned.

Here's what I follow on every build now.

🔍 Why RAG Fails in Production (It's Not the Model)

Most teams build RAG like this:

→ Grab the documents
→ Chunk and embed them
→ Add a retriever
→ Write a prompt
→ Ship it

It works in the demo because the demo is a controlled environment. You know the questions. You've tested on a small, clean set of docs. The model looks smart.

Production breaks this in three ways.

Real users ask questions the team never anticipated. The retriever pulls the wrong chunks. The model doesn't know it retrieved the wrong chunks. It answers anyway.

The knowledge base is messier than it looked. Confluence pages that conflict with each other. Slack threads that reference decisions never written down. Google Docs with five versions. The model retrieves all of it and synthesizes a "confident average" — which is a hallucination with citations.

There's no feedback loop. Nobody knows which answers are wrong until a user complains, by which point the damage is done.

Research confirms this isn't a fringe problem: 40–60% of RAG implementations never make it to production at all. Of the ones that do, naive RAG — embed, retrieve top-k, prompt — consistently plateaus at 70–80% retrieval precision without additional structure around it.

The fix isn't a better model. It's building the system properly.

🧠 Rule 1: Evals Before Prompts

This is the highest-leverage thing I do on every RAG project, and it's the thing most teams skip.

Before writing a single prompt, build a test set of 30–40 real questions with expected answers. Run every prompt change — every chunk size tweak, every retriever adjustment — against the full set.

Without this, you're optimizing blind. You fix one question and break three others without knowing it.

# eval_set.py — structure for a minimal RAG eval set

eval_set = [
    {
        "question": "What is our refund policy for enterprise customers?",
        "expected_answer": "Enterprise customers get a 30-day money-back guarantee.",
        "source_doc": "enterprise-policy-v3.md",
        "category": "policy"
    },
    {
        "question": "Who do I contact for billing issues?",
        "expected_answer": "billing@company.com or the #billing Slack channel.",
        "source_doc": "support-contacts.md",
        "category": "contacts"
    },
    # ... 30-40 total, covering every major topic area
]

def run_eval(rag_chain, eval_set):
    results = []
    for item in eval_set:
        response = rag_chain.invoke(item["question"])
        results.append({
            "question": item["question"],
            "expected": item["expected_answer"],
            "actual": response["answer"],
            "sources_retrieved": response["source_documents"],
        })
    return results

📌 Pro tip: Pull the questions from real user queries where possible. If you have a Slack channel where employees ask the kind of questions the bot will answer, mine it for the first 30 questions. Real questions are always harder than invented ones.

The eval set also tells you when you're done. "Good enough" stops being subjective — either it passes 90%+ of cases or it doesn't ship.

Worth noting: in 2026, 60% of RAG deployments include systematic evaluation from day one, up from under 30% in early 2025. The teams that skipped it are the ones rebuilding their systems now.

🛠 Rule 2: One Source of Truth

Most RAG failures I've traced start here — not in the model.

The knowledge is scattered. Confluence has the official policy. Someone updated it in a Google Doc but didn't update Confluence. There's a Slack thread where a manager made a call that contradicts both. The RAG system indexes all three.

The model retrieves two conflicting chunks. It can't tell which is authoritative. It synthesizes a confident-sounding answer that blends both — and matches neither.

No retriever in the world solves this. It's a data problem, not a model problem.

Step 1 of every RAG project I touch is not building the bot. It's forcing a canonical source:

→ Pick one system as ground truth per knowledge domain (Confluence for policy, Notion for specs, etc.)
→ Archive or redirect any conflicting copies — if it's not in the canonical source, it doesn't exist for the bot
→ Set an update SLA: if a policy changes, the canonical page gets updated within 24 hours

graph TD
    A[User Query] --> B[Retriever]
    B --> C{Single canonical source?}
    C -->|Yes| D[Clean, consistent chunks]
    C -->|No| E[Conflicting chunks from 3 systems]
    D --> F[Faithful, accurate answer]
    E --> G[Confident hallucination]

This sounds like busywork. It's actually the most important infrastructure decision in the project.

Research backs it up: 60% of enterprise RAG projects fail not because of poor retrieval, but because they can't maintain data freshness at scale. One study found that when given unvetted baseline data, models fabricated responses for 52% of out-of-scope questions — not because the model was bad, but because it had no way to know what it didn't know.

🎯 Rule 3: Fallback Over Hallucination

A system that says "I don't know" is more valuable than one that guesses wrong.

This sounds obvious. It's almost never implemented.

The failure mode: the retriever pulls chunks that are loosely related but not actually relevant. The model generates a plausible-sounding answer from them. The user reads it. They have no way to know it's wrong. They act on it.

Trust is built one answer at a time. It's destroyed the same way.

Here's the pattern I use — a confidence-based router that sends uncertain answers to a human instead of letting the model guess:

# confidence_router.py

CONFIDENCE_THRESHOLD = 0.75

def route_with_fallback(query, rag_chain, retriever):
    docs = retriever.get_relevant_documents(query)

    # No docs or low retrieval confidence → route to human
    top_score = docs[0].metadata.get("score", 0) if docs else 0

    if top_score < CONFIDENCE_THRESHOLD or not docs:
        return {
            "answer": (
                "I'm not confident I have accurate information on this. "
                "Let me tag someone who can help — @support-team"
            ),
            "routed_to_human": True,
            "query": query,
        }

    response = rag_chain.invoke(query)
    return {
        "answer": response["answer"],
        "routed_to_human": False,
        "sources": response["source_documents"],
    }

📌 Pro tip: Log every query that gets routed to a human. These are your highest-value training questions. Add them to the eval set with the correct answer documented. Over time, the system learns its own blind spots.

The fallback message matters too. "I don't know — here's who does" is not a failure. Users who get honest fallbacks trust the system more than users who get confidently wrong answers twice.

💡 Rule 4: Observability from Day One

The eval set is your pre-deployment safety net. Observability is your production safety net.

Log every conversation. Not just the query and the answer — log the retrieved chunks, the retrieval scores, the prompt constructed, and the final response.

# logger.py — minimal production RAG logging

import json
from datetime import datetime

def log_rag_event(query, retrieved_docs, response, routed_to_human, metadata={}):
    event = {
        "timestamp": datetime.utcnow().isoformat(),
        "query": query,
        "retrieved_chunks": [
            {
                "content": doc.page_content[:200],
                "score": doc.metadata.get("score"),
                "source": doc.metadata.get("source"),
            }
            for doc in retrieved_docs
        ],
        "response": response,
        "routed_to_human": routed_to_human,
        **metadata,
    }
    # Write to your logging backend — CloudWatch, Datadog, a flat file
    print(json.dumps(event))
    return event

Then build the review loop:

→ Flag every answer that gets a thumbs-down or triggers a human escalation
→ Review flagged answers weekly (30 minutes, not hours)
→ Add wrong answers to the eval set with the correct answer documented
→ Re-run evals before the next deployment

This loop is what makes the system improve over time instead of silently degrade.

The eval set starts at 30–40 questions. Six months in, it's 150 questions covering every edge case real users found. The system that was 80% accurate in week 1 is genuinely 95% accurate by month 6 — not because the model changed, but because the system around it got tighter.

Tools I've used for this: LangSmith for tracing and full conversation logging, RAGAS for automated eval scoring, and a simple shared spreadsheet for the human review loop. You don't need an elaborate stack. You need the habit.

Key Takeaways

✅ Write 30–40 eval questions before a single prompt — this is the line between a demo and a system you can maintain
✅ Fix the data before building the bot — no retriever fixes conflicting or fragmented knowledge sources
✅ A system that says "I don't know" is more valuable than one that confidently hallucinates

The pattern I keep seeing: teams spend time optimizing the part that's already good (the model) and skip the parts that actually break in production (evals, data quality, fallbacks, observability). The boring infrastructure is what makes the intelligent part work.

If you're building or inheriting a RAG system that performs well in testing but keeps surfacing wrong answers in production — these four things are almost always where it breaks.

👋 Let's Connect

If you found this useful or want to talk through a similar problem:

🔗 LinkedIn | 💻 Portfolio | 📧 akshaykumarbedre.bm@gmail.com

Top comments (1)

Xidao • May 10

Great framework. Rule 3 (fallback over hallucination) is the one I wish more teams implemented early. The confidence-based router pattern you showed is solid, but one thing worth adding: the threshold needs to be calibrated per domain. In my experience, 0.75 works for general Q&A but for compliance-sensitive domains (finance, healthcare) you want it higher, around 0.85-0.9, because the cost of a wrong confident answer is way higher than the cost of saying "I don't know."

On Rule 2 (single source of truth) - this is where we see most teams struggle in practice. The canonical source sounds simple until you realize different teams update docs in different tools and nobody owns the cleanup. What has worked for us is running a weekly "doc conflict scan" that flags overlapping content across systems before it reaches the embedding pipeline.

One addition I'd suggest: instrumenting the retriever layer itself. Most teams only track end-to-end accuracy, but tracking retrieval precision separately (did the right chunks get retrieved?) vs generation fidelity (given the right chunks, did the model answer correctly?) tells you exactly which layer to fix. Without that split, you end up tuning prompts when the retriever is the bottleneck.