Why my first RAG system hallucinated (and how I fixed it)

#tutorial #ai #webdev #python

It started innocently enough. I needed a way to let my team ask questions about our sprawling internal documentation—hundreds of pages of API references, onboarding guides, and compliance rules. ChatGPT was impressive, but it had no clue about our private data. The obvious answer: Retrieval-Augmented Generation (RAG).

I’ve read the hype: embed your docs, shove them into a vector database, slap an LLM on top, and boom—instant Q&A bot. Sounds simple. My first attempt was anything but.

The naive approach that almost worked

I grabbed text-embedding-ada-002, split my documents into 512-token chunks, inserted them into Pinecone, and wired up a simple LangChain chain with GPT-3.5-turbo. Here’s the monster I created:

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

embeddings = OpenAIEmbeddings()
vectorstore = Pinecone.from_documents(doc_chunks, embeddings, index_name="my-docs")
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=vectorstore.as_retriever())

On paper, it worked. Ask "How do I reset the admin password?", get back a coherent answer. But the cracks appeared fast.

The three crashes

First, hallucinations. Not the creative kind—the dangerous kind. Someone asked "What’s the default timeout?" and the bot confidently said 30 seconds when the actual value was 5 minutes. The retrieved chunk had the number “30” but in a completely different context.

Second, fragmentation. Long procedures like “deploying a production instance” were split across multiple chunks. The bot would pick one chunk that mentioned “deploy” but miss the step about environment variables.

Third, relevance. Queries about “rate limits” would retrieve a chunk containing the words “rate” and “limit” but from a pricing table, not the technical spec.

I tried tweaking chunk sizes, playing with overlap, and switching models. Nothing fixed the core problem: my chunks were dumb slices of text without any awareness of their siblings.

What eventually worked: parent-child chunking and hybrid search

After reading papers and scouring GitHub issues, I settled on a two-part approach that actually made the bot reliable:

Parent-child chunking – keep a hierarchy. Small “child” chunks (e.g. 256 tokens) are embedded and searched, but the LLM receives the surrounding “parent” chunk (e.g. the whole section) for context.
Hybrid search – combine dense vector similarity with sparse keyword matching (BM25) so terms like “timeout” actually find the right document.

Here’s the core of my implementation (using LangChain and Weaviate, but the pattern is agnostic):

from langchain.text_splitter import RecursiveCharacterTextSplitter

# First, split documents into larger parent chunks
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
parent_docs = parent_splitter.split_documents(raw_docs)

# Then split each parent into smaller child chunks
child_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=50)
child_docs = []
for parent in parent_docs:
    children = child_splitter.split_documents([parent])
    for child in children:
        child.metadata["parent_id"] = parent.metadata["id"]
    child_docs.extend(children)

Now I embed the child docs but store each with a reference to its parent. During retrieval, I fetch the top-k child chunks, then pass their parent chunks to the LLM. That way the LLM sees a full section, not a torn snippet.

For hybrid search, I used the hybrid parameter in Weaviate, but you can also do it manually with an ensemble retriever:

from langchain.retrievers import EnsembleRetriever
from langchain.retrievers.bm25 import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(child_docs, k=3)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

ensemble = EnsembleRetriever(retrievers=[bm25_retriever, vector_retriever], weights=[0.4, 0.6])

# Then use this retriever in your QA chain
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=ensemble)

This hybrid approach dramatically cut hallucinations. The keyword component ensured exact terms like “admin password” matched the correct documents, while the vector component understood semantic queries.

Putting it all together

I also added a simple reranking step using Cohere’s rerank endpoint to make sure only the most relevant parents reached the LLM. The final pipeline looked like:

User asks a question.
Hybrid search returns child chunks (vectors + BM25).
Collect unique parent chunks from those children.
Rerank parents against the question, keep top 3.
Feed those 3 parents + question to GPT-4 (yes, I upgraded from 3.5).

The difference was night and day. A query like “What’s the default deployment time?” now pulled the section titled “Deployment Timeouts” instead of some random pricing table.

Incidentally, I did find a similar approach in a service at ai.interwestinfo.com that claimed to do this automatically, but I wanted full control over chunking and retrieval. If you’re building a quick prototype, such tools can save time, but for production you’ll want to own each layer.

Trade-offs and when you should NOT do this

Parent-child chunking doubles your storage and indexing complexity. Hybrid search adds latency. If your documents are already small and well-structured (e.g. one-page FAQ), a flat vector search might suffice. Also, if your users ask only factual questions with exact matches, a keyword search alone could be cheaper and faster.

But for messy, long-form internal docs? This approach saved my bot from the dumpster fire of hallucinations.

What I’d do differently next time

Start with a proper evaluation set. I spent days chasing ghosts because I couldn’t objectively measure improvement.
Use a cheaper model for reranking (or skip it if the ensemble is enough).
Monitor retrieval quality in production—add logging for retrieved chunks vs. actual answers.

This whole journey taught me that RAG is not a plug-and-play solution. It’s a system design problem. The embedding model is just a small piece; how you slice, retrieve, and combine context matters a lot more.

Have you had a similar “oh no, my bot lied to me” moment? What’s your chunking strategy look like?