Most RAG Problems Are Retrieval Problems. Here Are 8 Fixes That Worked for Me

#ai #machinelearning #rag #llm

The first few times a RAG system gave me a bad answer, I did what I think everyone does: I went and fiddled with the prompt. Made it stricter. Added a "only answer from the context" line. It barely moved the needle.

What finally fixed things was looking one step earlier. Nine times out of ten the model wasn't the problem at all — the right passage just never showed up in the context window, so there was nothing to ground the answer on. You can't prompt your way out of missing evidence.

So here's what I now reach for when retrieval is the weak link, roughly in the order I'd try them.

The short version

Fix	Example	Why it helps
Get chunking right first	`RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)`	Chunk size sets the ceiling on everything downstream. Too big and the answer drowns in noise; too small and it loses the surrounding context.
Add some overlap	`chunk_overlap=200`	Carries a bit of the previous chunk's tail forward, so an answer that straddles a boundary doesn't get sliced in half.
Contextual chunking	`chunk = f"CONTEXT: {llm(doc, chunk)}\{chunk}"`	Sticks a short, model-written "here's where this fits" note on each chunk before you embed it. Makes otherwise-vague chunks findable.
Hybrid search (BM25 + dense)	`collection.query.hybrid(query, alpha=0.5)`	Vectors are good at meaning, BM25 is good at literal strings — error codes, product names, that one weird acronym. Use both.
Reranking	`co.rerank(model="rerank-v3.5", query=q, documents=texts, top_n=10)`	A cross-encoder reads each query/doc pair properly and reorders them. Cheap next to the LLM call, big jump in precision.
Parent-document retriever	`ParentDocumentRetriever(child_splitter=small)`	Search over tiny chunks so you hit the right spot, then hand the model the bigger surrounding chunk so it has room to reason.
Rewrite the query	`MultiQueryRetriever.from_llm(retriever, llm)`	People don't phrase questions the way docs are written. Generate a few variants, a fake answer to search with (HyDE), or a broader version of the question.
Filter on metadata	`where={"source": "handbook"}`	Cut the candidate set down by source, date, section, whatever — before the vector search runs. Faster and more accurate.
Fuse with RRF	`RRF(d) = Σ 1/(k + rank_i(d))`, `k=60`	Merges several ranked lists without needing their scores to line up.
De-dupe / MMR	`SimilarityPostprocessor` + dedup	Stops you from handing the model three chunks that all say the same thing.

The three I'd reach for first

Hybrid search, so you stop losing exact terms

This one bit me directly. We had a support bot that could happily explain what a connection error was but couldn't find the doc for ERR_CONN_REFUSED specifically, because dense embeddings smear that exact token into "something about connections." BM25 finds it instantly. The fix is to run both retrievers and merge the two ranked lists with Reciprocal Rank Fusion — no score calibration needed, which is the nice part:

dense_hits  = vector_store.search(query, k=20)
sparse_hits = bm25.search(query, k=20)

def rrf(rank, k=60):
    return 1 / (k + rank)

scores = {}
for rank, d in enumerate(dense_hits, start=1):  scores[d.id] = scores.get(d.id, 0) + rrf(rank)
for rank, d in enumerate(sparse_hits, start=1): scores[d.id] = scores.get(d.id, 0) + rrf(rank)

top_ids = sorted(scores, key=scores.get, reverse=True)[:10]   # doc IDs, best first

Reranking, so the best chunk is actually on top

First-pass retrieval optimizes for recall, which means it grabs a wide net and the genuinely-best passage often sits at rank 7, not rank 1. A reranker takes that shortlist and scores each pair more carefully. It costs a bit more than the initial search but it's pocket change compared to the generation call, so there's not much reason to skip it. Pull a wide shortlist, rerank, keep the top few:

candidates = retriever.invoke(query)          # grab 20–50, recall first
scores     = cross_encoder.predict([(query, d.page_content) for d in candidates])
top_k      = [d for _, d in sorted(zip(scores, candidates), key=lambda x: x[0], reverse=True)][:5]

Contextual chunking, so a chunk knows what it's about

Picture a chunk that just says "the limit is 5,000 requests per minute." Great — for which API? Which plan? On its own it's nearly unretrievable. The trick is to have an LLM write one line of context for each chunk and prepend it before embedding:

context = llm(f"Briefly situate this chunk within the document:\n{doc}\n\nCHUNK:\n{chunk}")
embed(f"{context}\n\n{chunk}")   # index the contextualized version

Yes, it costs an extra LLM call per chunk at index time. You pay it once, and it's the single change that moved our numbers the most.

Things I'd tell my past self not to do

Don't start with the prompt. If the evidence isn't in the context window, the prompt is irrelevant. Log what retrieval actually returned before you change anything else — half the time the bug is obvious the moment you look.
Don't set chunk size once and forget it. It's the biggest lever you've got. Try 256, 512, 1024 on your own data and measure; the "right" default depends entirely on your docs.
Don't go dense-only. Pure vector search will keep missing codes, IDs, and rare names. Bolting on BM25 is the cheapest real win after chunking.
Don't skip reranking and just stuff 20 chunks in. More context isn't better context — it's slower, pricier, and the model gets distracted. Retrieve wide, rerank, send five.
Don't crank top_k to paper over bad ranking. Past a handful of good chunks you're mostly adding noise. Fix the ordering, then trim.

Wrapping up

Retrieval is where most of your RAG quality lives or dies, and the nice thing is these stack — cleaner chunks make the reranker's job easier, hybrid search feeds it better candidates, and so on. If you want the full version of this (embeddings, vector DBs, the query-rewriting tricks, agentic patterns, evaluation — way more than fits here), I keep a sorted reference over at CheatGrid's RAG cheat sheet. It's free and there's no signup.