Alan West

Posted on Mar 27 • Edited on Mar 29

Why Your RAG System Returns Garbage (And How to Actually Fix It)

#ai #llm #rag #python

So you followed a tutorial, spun up a vector database, embedded some documents, and asked your shiny new RAG system a question. The answer? Completely wrong. Or worse — confidently wrong with citations that don't support the claim.

I've been there. Twice in production. Let me walk you through the problems that actually bite you when building RAG systems, and the fixes that got my retrieval quality from "embarrassing demo" to "genuinely useful."

The core problem: retrieval is harder than it looks

Most RAG tutorials make it seem simple: chunk documents, embed them, do a similarity search, stuff the results into a prompt. That pipeline works great for toy demos with 50 documents about the same topic.

It falls apart the moment you have real data. Different document formats, varying levels of detail, ambiguous queries, and chunks that lost their context during splitting — these are the things that actually kill your system.

The root cause is almost always the same: your retrieval step returns irrelevant chunks, and no amount of prompt engineering can fix bad context.

Failure #1: Naive chunking destroys meaning

This was my first painful lesson. I was splitting documents by token count with some overlap, like every tutorial suggests:

from langchain.text_splitter import RecursiveCharacterTextSplitter

# The "default" approach everyone starts with
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = splitter.split_documents(docs)

The problem? A 500-token chunk from the middle of a technical doc often has zero context about what it's describing. You get a paragraph about "configuring the timeout parameter" with no indication of which service or component it belongs to.

The fix: semantic chunking + metadata enrichment

Two changes made a massive difference. First, switch to a chunking strategy that respects document structure:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# Splits based on semantic similarity between sentences
# instead of arbitrary token counts
semantic_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_percentile_threshold=85
)
chunks = semantic_splitter.split_documents(docs)

Second — and this is the one people skip — prepend context to each chunk:

def enrich_chunk(chunk, parent_doc):
    """Add document-level context so the chunk stands on its own."""
    prefix = f"Document: {parent_doc.title}\n"
    prefix += f"Section: {chunk.metadata.get('section', 'Unknown')}\n"
    prefix += f"Topic: {parent_doc.category}\n\n"
    chunk.page_content = prefix + chunk.page_content
    return chunk

This is sometimes called "contextual retrieval." The idea is simple: every chunk should be understandable in isolation. After implementing this, my retrieval precision jumped noticeably — chunks about "configuring the timeout" now carried metadata saying which service they belonged to.

Failure #2: Embedding similarity isn't semantic similarity

This one is subtle. A user asks "how do I handle errors in the payment flow?" and the top result is a chunk about "error handling best practices" from a completely unrelated section. The words match. The meaning doesn't.

Cosine similarity on embeddings captures lexical relationships pretty well, but it struggles with domain-specific intent.

The fix: hybrid search

Pure vector search has blind spots. Pure keyword search (BM25) has different blind spots. Combine them:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Chroma

# Vector search catches semantic matches
vector_retriever = Chroma.from_documents(
    chunks, embedding_function
).as_retriever(search_kwargs={"k": 10})

# BM25 catches exact keyword matches the embeddings miss
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 10

# Reciprocal Rank Fusion merges both result sets
hybrid_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]  # tune these for your domain
)

The weights matter. For technical documentation, I found bumping BM25 weight to 0.4-0.5 helped because users often search for exact function names or error codes. For more conversational knowledge bases, lean heavier on vector search.

Failure #3: Stuffing too many chunks into the prompt

More context is better, right? Wrong. I learned this the hard way when I increased k from 4 to 15 and watched answer quality decrease.

The LLM gets overwhelmed by irrelevant chunks and starts hallucinating connections between unrelated pieces of context. It's the "lost in the middle" problem — models pay more attention to the beginning and end of their context window.

The fix: reranking before generation

Retrieve broadly, then rerank aggressively:

from sentence_transformers import CrossEncoder

# Cross-encoders are slower but much more accurate
# than bi-encoder similarity for reranking
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_chunks(query, chunks, top_k=4):
    pairs = [(query, chunk.page_content) for chunk in chunks]
    scores = reranker.predict(pairs)

    # Sort by reranker score and keep only the best
    ranked = sorted(zip(scores, chunks), reverse=True)
    return [chunk for _, chunk in ranked[:top_k]]

# Retrieve 20 candidates, rerank to top 4
candidates = hybrid_retriever.get_relevant_documents(query)
final_chunks = rerank_chunks(query, candidates, top_k=4)

This pattern — over-retrieve then rerank — consistently outperformed just retrieving fewer chunks. The cross-encoder sees the query and chunk together, so it catches relevance signals that embedding similarity misses.

Failure #4: No evaluation means flying blind

Probably my biggest mistake was going weeks without any systematic way to measure retrieval quality. I was vibes-checking results by trying random queries. Don't do this.

The fix: build an eval set early

You don't need anything fancy. A spreadsheet with 30-50 query/expected-answer pairs gets you surprisingly far:

import json

def evaluate_retrieval(retriever, eval_set, k=4):
    """Basic retrieval evaluation: does the right chunk show up?"""
    hits = 0
    for item in eval_set:
        results = retriever.get_relevant_documents(item["query"])
        retrieved_texts = [r.page_content for r in results[:k]]

        # Check if expected source doc appears in results
        if any(item["expected_source"] in text for text in retrieved_texts):
            hits += 1

    recall = hits / len(eval_set)
    print(f"Recall@{k}: {recall:.2%}")
    return recall

Every time I changed chunking, embeddings, or retrieval strategy, I ran this eval. It caught regressions I would have missed with manual testing and gave me confidence when deploying changes.

Prevention: the checklist I wish I'd had

After building a few of these systems, here's what I'd do differently from day one:

Start with your hardest queries. Grab 10 real questions from users or stakeholders before writing any code. If your retrieval can't handle these, no amount of prompt tuning will save you.
Log everything. Store the query, retrieved chunks, reranker scores, and the final answer. When something goes wrong in production (it will), you need to know which stage failed.
Chunk size is not one-size-fits-all. Technical docs with code blocks need larger chunks. FAQ-style content works better with smaller, self-contained chunks. Test multiple strategies per document type.
Don't skip metadata filtering. If your documents have natural categories (product area, doc type, date), use metadata filters to narrow the search space before vector similarity kicks in. This is cheap and surprisingly effective.
Embeddings models matter less than you think. I spent days benchmarking embedding models and got maybe a 2-3% improvement. I spent one afternoon implementing hybrid search + reranking and got a 15%+ improvement. Focus on retrieval architecture first.

The uncomfortable truth

Building a RAG system that works on a demo is a weekend project. Building one that works reliably in production is an ongoing engineering effort. The retrieval pipeline needs as much attention as any other critical system — monitoring, evaluation, iteration.

The good news is that the fixes above aren't complicated. Semantic chunking, hybrid search, reranking, and a basic eval harness will get you past most of the painful failure modes. Start there, measure everything, and iterate based on real queries from real users.

The LLM is almost never the bottleneck. The retrieval is.

DEV Community