There Are Too Many RAG Optimization Techniques, So I Organized Them — and the Big Picture Finally Made Sense

nagi — Wed, 15 Apr 2026 06:32:03 +0000

Why I Learned About RAG

I came across a technique called RAG (Retrieval-Augmented Generation) and was intrigued by how search systems could use this kind of approach. It made me want to build a search app using RAG myself. So I went through a fairly detailed course that covered everything from the basics to advanced techniques. It was a lot of material.

Before starting, my understanding was roughly this:

You turn documents into vectors, search for similar ones when a question comes in, and pass them to an LLM.

That was it.

Not wrong, but as I went through the course material, optimization techniques started piling up one after another. Multi Query, RAG-Fusion, HyDE, Decomposition, Step Back, RAPTOR, ColBERT, CRAG, Self-RAG... close to ten just by name.

Honestly, partway through I couldn't keep track of what made each one different.

So I decided to write it all up. As an engineer who eventually wants to implement these, I wanted to organize what each technique does in my own words.

The Basics of RAG: Four Stages

The RAG pipeline itself is straightforward.

Data Preparation: Gather documents (PDFs, web pages, databases, etc.)
Index Construction: Split into chunks → embed → store in a vector database
Retrieval: Vectorize the user's question and run a similarity search
Generation: Pass the search results + question to an LLM to generate an answer

So far, nothing surprising. The chunking step (splitting documents into smaller blocks) has several strategies:

Fixed-length: Mechanically split at, say, every 300 characters. Simplest to implement, but you risk cutting mid-sentence and losing meaning
Sentence-based: Split at natural boundaries like periods or newlines. Preserves meaningful units better, but chunk sizes can vary a lot
Sliding window: Split with some overlap between adjacent chunks. Information near boundaries is less likely to be lost, so this is the most commonly used approach in practice

The recommended chunk size is around 200–500 tokens, with 10–20% overlap. How you split can change retrieval accuracy significantly — it's a subtle but important step.

Why Vectorize?

After splitting text into chunks, machines can't understand their "meaning" as-is. So each chunk gets converted into a numerical vector (something like [0.12, 0.34, 0.56, ...]). This is called embedding.

What makes embedding useful is that even if the wording differs — like "cold medicine" vs. "flu medication" — vectors for semantically similar text end up close together. Instead of exact keyword matching, you can search by "closeness in meaning." The metric used to measure this closeness is cosine similarity.

These vectors are stored and searched in a vector database (Pinecone, Milvus, FAISS, etc.). A regular relational database is great at exact lookups like "find the record with ID=123," but it can't do "find the most similar vector." Vector databases use approximate nearest-neighbor algorithms like HNSW and IVF to find similar items among millions of vectors in milliseconds.

That's the foundation. The challenge starts here.

Too Many Optimization Techniques

From the middle of the course, techniques for improving retrieval accuracy started coming in fast. This was the most confusing part.

But after organizing them, I realized everything falls into two categories:

"Make the query smarter before searching" or "Filter the results smarter after searching."

Just framing it that way made things much clearer.

Query-Side Techniques: Don't Just Search With the Raw Question

Technique	What It Does	One-liner
Multi Query	Rephrase a single question multiple ways and search in parallel	Prevents misses from wording variations
RAG-Fusion	Multi Query + merge results using RRF scores	Rank-based fusion, like a voting system
Decomposition	Break a complex question into sub-questions and search individually	Works well for "compare A and B" type queries
Step Back	Abstract a specific question one level up before searching	"X in Jan 2024" → "X trends over time"
HyDE	Have an LLM write a hypothetical answer, then search using that answer	Closes the vector distance gap between questions and documents

The common idea is this: a user's question is usually not optimized for search. So you use an LLM to transform it into a more search-friendly form before running the query.

The most interesting one for me was HyDE.

Questions and answers, even on the same topic, have very different writing styles. Questions are short; answers are long and explanatory — so they end up far apart in vector space. HyDE has the LLM write a "hypothetical answer," which produces a vector much closer to the real answer. The idea is: even if the hypothetical answer is wrong, its "shape" helps the search find the right documents. That was a real shift in perspective.

The RRF (Reciprocal Rank Fusion) in RAG-Fusion was also eye-opening. It merges multiple search result lists, but instead of a simple majority vote, it sums the reciprocal of each document's rank across lists. Documents that appear frequently and rank highly win. Think of it like asking several friends for restaurant recommendations and combining their rankings into a final list.

One caveat with HyDE: if the LLM doesn't know much about the domain, the hypothetical answer will be garbage, and so will the search. It also adds an extra LLM call (+200–500ms latency), so you need to be selective about when to use it.

Result-Side Techniques: Filter What Comes Back

Technique	What It Does	One-liner
Re-ranking	Broad retrieval (Top-50) → precise model re-scores (Top-5)	Same idea as rough filtering → precise scoring in recommendation systems
Routing	Route questions to different data sources by type	Switch between vector DB / SQL / web search
Hybrid Search	Vector search + BM25 keyword search	Catches proper nouns and numbers that vectors miss

Re-ranking seemed like the best bang for the buck. The approach is simple: first, pull ~50 results with vector search, then use a precise model like CrossEncoder or Cohere Rerank to narrow it down to 5. According to the course material, Cohere Rerank alone improved NDCG@10 by 36%, at a cost of $0.002 per query. High impact for low effort — it's often recommended as the first optimization to add.

Hybrid search compensates for vector search's weaknesses using keyword search (BM25). Vector search is great at finding "semantically similar" content, but it struggles with exact matches for things like "October 15, 2023" or "Model A-7B." By combining BM25 keyword matching and merging results with RRF, you get the best of both worlds. This is considered a best practice.

Routing directs questions to different data sources depending on their type. At first I thought, "Can't we just use the vector DB for everything?" But for a query like "Top 10 sales in 2023," SQL gives a far more accurate result. For questions about the latest news, a web search makes more sense. The implementation uses an LLM to classify the question and route it to the appropriate source. Around 5–20 categories is considered appropriate.

The Most Interesting Takeaway: Using LLMs as "Tools" Throughout the Pipeline

This is where my understanding changed the most.

Within a RAG pipeline, the LLM isn't just "the thing that generates the final answer." It plays many different roles:

An LLM that rephrases queries (Multi Query)
An LLM that decomposes questions (Decomposition)
An LLM that generates hypothetical answers (HyDE)
An LLM that evaluates retrieval relevance (CRAG)
An LLM that self-checks whether the generation is accurate (Self-RAG)
An LLM that creates document summaries (Multi-representation)

Multiple LLMs with different roles appear within a single pipeline. And depending on the role, you might use a "small model (fast, cheap)" for some tasks and a "large model (high accuracy)" for others.

Going from "LLM = all-purpose answer machine" to "LLM = a component in the pipeline" was the biggest mental shift for me. From a system design perspective as an engineer, this framing feels much more natural. I think this way of thinking can apply to other AI-related designs as well.

Advanced Research-Level Techniques (Brief Overview)

The latter part of the course covered more research-oriented techniques. I may not use them immediately, but I want to at least know the keywords, so here are my notes.

RAPTOR: Recursively builds a summary tree from documents

Chunk the documents → soft-cluster with GMM → summarize each cluster with an LLM → re-cluster the summaries → repeat to build a hierarchical tree.

Its strength is handling both specific questions like "Who attended the 1956 Dartmouth Conference?" and abstract ones like "What were the key stages of AI development?" For retrieval, "flattening" (searching all levels at once) is recommended.

The trade-off: construction cost is ~7x higher than standard approaches, and storage increases by ~80%. It shines when dealing with large volumes of long documents.

ColBERT: Multi-vector search at the token level

Instead of the traditional one-document-one-vector approach, ColBERT assigns an individual vector to each token. The MaxSim algorithm sums the maximum similarity between each query token and each document token to produce a score.

Accuracy improves by 17–20%, but storage goes up 12x and memory 24x. ColBERT v2 introduced residual compression, reducing index size to about 1/6. At scale, a practical approach is to first do a rough filter with standard vector search, then use ColBERT for precise scoring.

CRAG: Classifies retrieval results as Correct / Ambiguous / Incorrect and handles each case

An LLM evaluates the retrieval results → if correct, use them as-is; if incorrect, fall back to web search; if ambiguous, reorganize and supplement the knowledge.

It reduced hallucination rates from 15% to 4% in a medical domain. Best suited for high-risk domains (healthcare, legal, finance). Latency increases by ~140%, so it's a trade-off with real-time requirements.

Self-RAG: Self-evaluates during generation using reflection tokens

The model uses special tokens (Reflection Tokens) to self-judge: "Is retrieval needed?" "Is this document relevant?" "Is the generated content grounded in facts?" It achieves even higher accuracy than CRAG.

However, training costs are $50K–$100K, and currently only 7B and 13B models are available. Inference latency is the slowest of all approaches. This feels like a heavily research-oriented technique for now.

All of these are interesting, and the approaches are well-reasoned. But for small-scale development, they seem like clear overkill. The right move is probably to keep them in your back pocket and pull them out when retrieval accuracy becomes a bottleneck.

What I'd Actually Use as an Engineer

Based on what I learned, here's what I'd want to adopt as I build things out:

Basic RAG (chunking + embedding + vector DB + LLM generation)
Hybrid search (vector + BM25) → prevents misses on proper nouns
Re-ranking → best cost-to-accuracy improvement
Routing → when supporting multiple data sources

On the other hand, I'm choosing not to use RAPTOR, ColBERT, or Self-RAG for now. I just can't see a clear case where I'd need to go that far yet. But there may come a time when they're needed, so understanding why they'd be needed feels worth keeping in mind.

What Became Clear After Organizing All of This

RAG isn't as simple as "search + generate." There are optimization opportunities at every stage, and the sheer number of techniques can feel overwhelming.

But after organizing everything, it really comes down to two things:

Make the query smarter before searching
Filter the results smarter after searching

And the "making it smarter" part is where LLMs are used as tools.

Rather than memorizing technique names, understanding this structure seems more useful in the long run. When a new technique comes along, you can quickly classify it: "This is a query-side improvement" or "This is result-side filtering."

I'd like to dig deeper into the advanced techniques when I have time. CRAG's "evaluate → fallback" pattern in particular feels like an idea that could apply well beyond RAG, to other system designs too.

DEV Community: nagi