My Support Bot Kept Making Stuff Up — Here's How I Fixed It

#webdev #python #ai #tutorial

I recently spent a month building a customer support bot. In theory, it was simple: take a user’s question, pass it to an LLM with some context, and return a helpful answer. In practice, my bot was a confident liar. It’d tell users that a product had a feature we never shipped, or quote a pricing tier that didn’t exist. Nothing embarrassing if you squint—but for a support tool, every hallucination was a trust bomb.

At first I thought, “More prompt engineering.” I added system messages like “You are a friendly support agent. Only answer from the provided context. If unsure, say you don’t know.” The bot just got better at sounding uncertain while still making stuff up. LLMs are incredible at generating plausible text—they’ll happily fabricate a whole return policy if it completes the sentence nicely.

What I Tried That Didn’t Work

I went through a few dead ends:

Simple FAQ lookup – Keyword matching against a list of Q&A pairs. Works for obvious questions (“What’s your return policy?”) but fails on rephrasings or multi-part questions. Also, maintaining the list became a nightmare as we added new products.
LLM-only with full knowledge base – I dumped all our help articles into the system prompt. The bot could answer anything but also hallucinated more often because the context was bloated and confusing.
Temperature tuning – Dropping temperature to 0 just made the bot sound robotic; it still generated false facts with the same frequency.

None of these solved the core problem: the LLM doesn’t know what it knows. It’s a text generator, not a retrieval engine.

What Eventually Worked

I switched to a retrieval-augmented generation (RAG) approach. The key change: instead of asking the LLM to remember our knowledge base, I pre‑retrieve the most relevant snippets using embedding similarity, then inject them only when they’re relevant enough. If nothing passes a high similarity threshold, the bot falls back to a safe “I don’t know” response.

Here’s the simplified flow:

Offline: Chunk all support docs into small pieces (200–500 characters), compute embeddings for each, store them in a vector index (FAISS).
Online: When a user asks a question, compute its embedding and search the index for the top 3 most similar chunks.
Similarity gate: If the top chunk’s similarity score is above, say, 0.75, pass those chunks as context to the LLM. Otherwise, skip LLM and reply with “I’m sorry, I couldn’t find an answer to that. Let me connect you to a human.”

This small change eliminated most hallucinations. The LLM only sees high-confidence context, and the gate prevents it from improvising when no good context exists.

Code: A Minimal RAG Implementation

Below is a simplified Python version. You’d run the chunking once and store the index, then call answer_query per user message. I used sentence-transformers for embeddings and faiss for search.

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# --- Offline Chunking & Indexing ---
def build_index(chunks: list[str]):
    """Create FAISS index from text chunks."""
    embeddings = model.encode(chunks, show_progress_bar=True)
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(np.array(embeddings))
    return index, chunks

# --- Online Query ---
def retrieve_chunks(query: str, index, chunks, top_k=3):
    """Return (chunks, distances) for top-k matches."""
    query_vec = model.encode([query])
    distances, indices = index.search(np.array(query_vec), top_k)
    # Convert L2 distance to crude similarity (lower distance = more similar)
    # Here we use a simple inversion – in practice tune thresholds carefully
    sims = [1 / (1 + d) for d in distances[0]]
    results = [(chunks[i], sims[j]) for j, i in enumerate(indices[0])]
    return results

def answer_query(query: str, index, chunks, llm_call):
    """Generate answer with similarity gate."""
    results = retrieve_chunks(query, index, chunks, top_k=1)
    best_chunk, best_sim = results[0]
    threshold = 0.75  # Tune based on your data

    if best_sim >= threshold:
        context = "\n---\n".join([c for c, _ in results])
        prompt = f"""You are a support agent. Answer the user question using ONLY the context below. If the answer isn't clear, say "I'm not sure".

Context:
{context}

Question: {query}
"""
        response = llm_call(prompt)
        return response
    else:
        return "I'm sorry, I couldn't find a reliable answer. Let me transfer you to our support team."

Usage example (mock LLM call):

def mock_llm(prompt):
    # In reality you'd call OpenAI, Anthropic, etc.
    return "Based on the provided context, the answer is XYZ."

# Pre-build index once
all_chunks = ["Return policy: 30 days from purchase...", "Warranty: 1 year limited..."]
index, chunk_list = build_index(all_chunks)

answer = answer_query("What's your return period?", index, chunk_list, mock_llm)
print(answer)

Lessons Learned & Trade-offs

Threshold tuning is critical. Too low, you still get hallucinations. Too high, you miss valid queries and frustrate users (fallback triggers too often). I ran a batch of 200 historical support tickets to find the threshold that maximized good answers while keeping false positives under 5%.
Chunk quality matters. I originally split by paragraphs, but the embedding lost context. Switching to overlapping chunks of ~300 characters with a 50-character overlap improved retrieval.
Vector search is not magic. The model I used (all-MiniLM-L6-v2) is fast but sometimes fails on domain-specific jargon. A tuned model (e.g., fine-tuned on your own support docs) would help, but that’s a bigger investment.
Fallback is essential. Even with a good retrieval pipeline, there will be edge cases. The fallback to a human agent saved our bot from becoming a liability. Also, I added a “Did this answer help?” button to capture feedback for improving the system.

What I’d Do Differently Next Time

I’d start with a simpler hybrid search (BM25 + embedding) before committing to pure vector search. BM25 catches exact phrase matches that embeddings can miss, especially for product names or version numbers. Also, I’d set up automated evaluations from day one—running every bot update against a test set of known good Q&A pairs to catch regressions.

The Tool I Used for Testing (Not Required)

For what it’s worth, I tested the conversation flow using a platform called InterWestInfo (the product link you gave me as background) to simulate user interactions and log fallback events. It helped me iterate quickly, but the core approach—embedding retrieval with a similarity gate—is completely tool-agnostic.

How do you handle hallucinations in your AI features? I’m still experimenting with better chunking strategies and plan to try out hybrid search soon. What’s your setup look like?

Top comments (1)

xulingfeng • Jun 7

I love the honest progression here — 'More prompt engineering' being the first instinct we all reach for, and then realizing it just makes the bot better at sounding uncertain while still lying. That's the part nobody puts in the blog post. The fix always involves actual structure (retrieval, grounding, validation) rather than more system messages.