The Same Question, Completely Different Results
Vector retrieval has a fragility that's easy to overlook: rephrase the same question, and the results can change dramatically.
"How does the BGE model perform on Chinese text?" and "Which embedding is recommended for Chinese?" are semantically near-identical — but their embedding vectors sit at different positions in high-dimensional space, often returning different document sets entirely.
This is a structural property of Bi-Encoders: query and document are each encoded without knowing the other exists, making the result sensitive to subtle phrasing differences.
Previous articles optimized the document side — better chunking strategies help documents get found. This article works on the query side: transform the question itself before it touches the vector index, so retrieval is more stable and more complete.
Three strategies:
- Multi-Query: Generate multiple phrasings, retrieve from each, merge results
- HyDE: Generate a hypothetical answer first, then retrieve using that answer
- Query Decomposition: Break a complex question into sub-questions and retrieve each independently
Multi-Query: Multiple Angles, Wider Recall
Core Idea
A single query maps to a single point in the vector space. That point might happen to be far from some relevant documents. Search from multiple directions and you cover a larger area.
original query → LLM rewrite → [phrasing 1, phrasing 2, phrasing 3]
↓
retrieve each, merge, deduplicate
↓
return top-K results
Implementation
from langchain_classic.retrievers import MultiQueryRetriever
MULTI_QUERY_PROMPT = ChatPromptTemplate.from_messages([
("system", "You are a question-rewriting assistant."),
("human",
"Rewrite the following question into 3 different phrasings "
"from different angles, to retrieve more relevant documents "
"from a vector database.\n"
"Output one question per line. No numbering, no explanation.\n\n"
"Original question: {question}"),
])
# Option A: LangChain's built-in wrapper
retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
llm=llm,
)
# Option B: Manual — full control over prompt and merge logic
multi_query_chain = MULTI_QUERY_PROMPT | llm | StrOutputParser()
variants_text = multi_query_chain.invoke({"question": question})
variants = [q.strip() for q in variants_text.strip().split("\n") if q.strip()]
all_docs = base_retriever.invoke(question) # original
for variant in variants:
all_docs.extend(base_retriever.invoke(variant))
return dedup_docs(all_docs)[:TOP_K]
Best for: High-vocabulary-variance scenarios, where users describe the same concept in many ways. Cost: 3 extra retrieval calls per query (plus one LLM rewrite call).
HyDE: Search with a Fake Answer
Core Idea
HyDE (Hypothetical Document Embeddings), proposed in 2022, is built on a key observation:
A question's embedding and its answer's embedding live in different semantic spaces.
The vector index stores documents (answer space). But retrieval uses a query (question space). These two distributions don't perfectly overlap, even for semantically related content.
HyDE's fix: have the LLM generate a hypothetical answer first, then embed that instead of the question. The hypothetical answer lives in the same semantic space as real documents — it's closer to what you're looking for.
query → LLM generates hypothetical answer (~100 words) → embed the answer
↓
vector search (find nearest docs)
↓
return real documents to LLM
The hypothetical answer doesn't need to be correct — it just needs to be semantically in the right neighborhood so the embedding lands near the relevant documents.
Implementation
HYDE_PROMPT = ChatPromptTemplate.from_messages([
("system", "You are a technical knowledge assistant."),
("human",
"Write a hypothetical answer to the following question in about 100 words. "
"This answer will be used for vector retrieval, not shown to the user. "
"It doesn't need to be perfectly accurate — just semantically close to "
"what a real answer would look like.\n\n"
"Question: {question}"),
])
hyde_chain = HYDE_PROMPT | llm | StrOutputParser()
hypothetical_answer = hyde_chain.invoke({"question": question})
# Embed the hypothetical answer, not the original question
hyp_embedding = embeddings.embed_query(hypothetical_answer)
results = vectorstore.similarity_search_by_vector(hyp_embedding, k=TOP_K)
Best for: Scenarios where question phrasing and document language diverge — user asks conversationally, document is written in technical language. Cost: one extra LLM call plus one extra embedding computation.
Query Decomposition: Break It Down
Core Idea
Some questions are inherently multi-hop:
"For a RAG system targeting Chinese, what embedding model and vector database should I choose?"
This contains two independent sub-questions:
- Which embedding model is recommended for Chinese text?
- Which vector database is best for enterprise use?
One retrieval pass trying to answer both at once will usually do justice to neither.
Query Decomposition: decompose first, retrieve each part separately, then give the LLM all the results together.
complex question → LLM decomposes → [sub-question 1, sub-question 2, ...]
↓
retrieve each independently, merge
↓
all docs passed to LLM for unified answer
Implementation
DECOMPOSE_PROMPT = ChatPromptTemplate.from_messages([
("system", "You are a question analysis assistant."),
("human",
"Break down the following complex question into 2-3 simple sub-questions, "
"each of which can be answered by an independent retrieval.\n"
"Output one sub-question per line. No numbering, no explanation.\n\n"
"Original question: {question}"),
])
decompose_chain = DECOMPOSE_PROMPT | llm | StrOutputParser()
sub_questions_text = decompose_chain.invoke({"question": question})
sub_questions = [q.strip() for q in sub_questions_text.strip().split("\n") if q.strip()]
all_docs = []
for sub_q in sub_questions:
all_docs.extend(base_retriever.invoke(sub_q))
return dedup_docs(all_docs)[:TOP_K]
Best for: Questions spanning multiple concepts or requiring synthesis across topics. Cost: one LLM decomposition call plus 2–3 retrieval calls.
Experimental Results
==================================================================================
RAGAS Metrics Comparison (Four Query Optimization Strategies)
==================================================================================
Metric Naive Multi-Query HyDE Decomposed
────────────────────────────────────────────────────────────────
context_recall 0.625 0.625 0.750 0.875 ◀
context_precision 0.583 0.583 0.726 ◀ 0.590
faithfulness 0.833 0.883 0.946 ◀ 0.911
answer_relevancy 0.406 0.412 0.377 0.474 ◀
==================================================================================
Reading the numbers:
Multi-Query (context_recall = 0.625, same as Naive)
Surprising — rewriting queries didn't improve recall at all on this knowledge base. The reason: with only 8 documents, the base vector search is already hitting near-perfect recall. The three rewritten variants retrieve the same documents as the original query — merging and deduplication produces no increment. Multi-Query's value becomes apparent at scale, when different phrasings genuinely reach different regions of the vector space.
HyDE (context_recall = 0.750, +0.125; context_precision = 0.726, +0.143)
Both metrics improve, and faithfulness reaches 0.946 — the highest of all four strategies. The hypothetical answer's embedding genuinely lands closer to the document space, finding more relevant documents and ranking them better. A clean win on retrieval quality.
Query Decomposition (context_recall = 0.875, +0.250)
The largest recall improvement. Breaking questions into sub-questions and retrieving each separately surfaces documents that a single query misses. The resulting document pool is more comprehensive, and faithfulness rises to 0.911 as a knock-on effect.
The Core Difference Between the Three
| Multi-Query | HyDE | Query Decomposition | |
|---|---|---|---|
| Problem solved | Phrasing sensitivity, single perspective | Query-answer semantic space mismatch | Multi-hop or multi-concept questions |
| What changes | How the question is asked (multiple phrasings) | What is used to search (answer instead of question) | What is asked (whole → parts) |
| Extra LLM calls | 1 (rewrite) | 1 (hypothetical answer) | 1 (decompose) |
| Extra retrieval calls | 3 | 0 | 2–3 |
| Top metric in this experiment | — | context_precision, faithfulness | context_recall |
| Best scenario | Large knowledge bases, conversational queries | Technical docs, large question-answer style gap | Multi-concept questions, synthesis tasks |
The key distinction is the axis of transformation:
- Multi-Query varies how to ask — same intent, different words
- HyDE varies what to search with — question becomes answer
- Query Decomposition varies what to ask — one question becomes many
Strategies Can Stack
These three are not mutually exclusive. You can combine them based on the scenario:
# Example: HyDE + Multi-Query stacked
# 1. Generate hypothetical answer and embed it
hyp_embedding = embeddings.embed_query(hyde_answer)
# 2. Also retrieve with 3 rewritten variants
all_docs = vectorstore.similarity_search_by_vector(hyp_embedding, k=4)
for variant in multi_query_variants:
all_docs.extend(base_retriever.invoke(variant))
return dedup_docs(all_docs)[:TOP_K]
Stacking widens the recall net further, but API call count grows proportionally. In production, adaptive selection based on query type tends to outperform blanket stacking — reserve the heavier strategies for queries that actually need them.
Full Code
Complete code is open-sourced at:
https://github.com/chendongqi/llm-in-action/tree/main/13-query-optimization
Key file:
-
query_optimization.py— Full four-strategy comparison experiment
How to run:
git clone https://github.com/chendongqi/llm-in-action
cd 13-query-optimization
cp .env.example .env
pip install -r requirements.txt
python query_optimization.py
Summary
This article benchmarked three query optimization strategies against a naive baseline:
- Multi-Query — Helps when vocabulary is varied and the knowledge base is large; on small knowledge bases the redundancy means no gain
- HyDE — Bridges the question-answer semantic gap by searching with a generated answer; best improvement in ranking quality (context_precision +0.143, faithfulness +0.113)
- Query Decomposition — Handles multi-hop questions by splitting and retrieving independently; strongest improvement in recall (context_recall +0.250)
All three optimizations happen before the query touches the vector index — no changes to chunking, no changes to embeddings, no reindexing required. They are among the cheapest wins available in a RAG system.
Top comments (0)