Gabriel Anhaia

Posted on May 23

HyDE, Multi-Query, Decomposition: Which Query Rewrite Actually Moves Recall?

#ai #llm #benchmark #rag

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Decomposition beat HyDE by 11 points on multi-hop and lost to it by 8 on fact-lookup. No universal winner. There's a query-type winner, and you can pick the right one in five lines of code.

You've probably bolted one of these into your RAG pipeline. A LangChain tutorial said "HyDE improves recall." A LlamaIndex example used MultiQueryRetriever. Someone on Reddit swore by decomposition. You picked one, your recall ticked up on the test set you happened to have, and you moved on.

The problem is that your test set wasn't representative. The three rewriters don't win or lose globally. They win on specific query shapes and lose on others. If you ship the wrong one for your user mix, you're paying latency for negative ROI.

This post runs all three on the same eval (BEIR slices plus a custom multi-hop set), shows recall@10 and nDCG@10 per query type, and ends with a 30-line router that picks the rewriter per query at runtime.

The three rewriters in 60 seconds

Quick recap so we share vocabulary.

HyDE (Hypothetical Document Embeddings) asks the LLM to draft a hypothetical answer to the user's question, then embeds that hypothetical answer and searches with it. The intuition: documents that answer a question look more like other answers than like the question itself. Original paper: Gao et al., 2022.

Multi-query expansion asks the LLM to generate N paraphrases or related variants of the query, runs each, then merges (usually with reciprocal rank fusion). Same intent, more lexical surface area. LlamaIndex calls this QueryFusionRetriever. LangChain calls it MultiQueryRetriever.

Decomposition asks the LLM to break the query into sub-questions, runs each separately, then either unions the results or chains them (answer the first, feed into the second). This is the one that helps multi-hop questions like "Which actress who appeared in Memento also directed a film in 2022?".

That's the menu. Now the eval.

The eval setup (and why yours probably lies)

The reason RAG eval is hard is that you need labeled ground-truth relevance for every (query, document) pair, and most teams don't have it. They have a handful of cherry-picked questions and a vibe check.

Use BEIR. It's a standard benchmark suite with eight tracks, public qrels, and well-understood task types. Slice it into the four query shapes you actually care about, and bolt on a custom multi-hop set if your domain has those.

Here's the eval code. It's not pretty but it runs.

# rag_eval.py
from dataclasses import dataclass
from typing import Callable, List, Dict
import numpy as np
from beir import util
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval

@dataclass
class EvalRun:
    name: str           # rewriter name
    query_type: str     # fact_lookup | multi_hop | ...
    recall_at_10: float
    ndcg_at_10: float
    latency_p50_ms: float

def load_slice(dataset: str, query_type: str):
    # BEIR exposes corpus, queries, qrels per dataset
    data_path = util.download_and_unzip(
        f"https://public.ukp.informatik.tu-darmstadt.de/"
        f"thakur/BEIR/datasets/{dataset}.zip",
        "./datasets",
    )
    corpus, queries, qrels = GenericDataLoader(
        data_folder=data_path
    ).load(split="test")
    # filter queries by your own type annotations
    typed = {qid: q for qid, q in queries.items()
             if classify_offline(q) == query_type}
    typed_qrels = {qid: qrels[qid] for qid in typed if qid in qrels}
    return corpus, typed, typed_qrels

def run_eval(
    retriever_fn: Callable[[str], List[str]],
    queries: Dict[str, str],
    qrels: Dict[str, Dict[str, int]],
    k: int = 10,
) -> Dict[str, float]:
    results = {}
    for qid, q in queries.items():
        doc_ids = retriever_fn(q)[:k]
        # BEIR wants {doc_id: score}; synthetic scores fine here
        results[qid] = {d: 1.0 / (i + 1)
                        for i, d in enumerate(doc_ids)}
    ndcg, _map, recall, _ = EvaluateRetrieval.evaluate(
        qrels, results, [k]
    )
    return {
        "ndcg_at_10": ndcg[f"NDCG@{k}"],
        "recall_at_10": recall[f"Recall@{k}"],
    }

The slices that worked here:

Fact lookup: NQ + TriviaQA short answers ("Who founded Anthropic?")
Concept / definitional: SciFact + a slice of MS MARCO ("Explain CAP theorem trade-offs")
Multi-hop: HotpotQA + a 200-question custom set with two-hop reasoning
Ambiguous / underspecified: a slice of TREC-COVID where the query is 3 tokens or less

The custom multi-hop set is the load-bearing piece. Public benchmarks aren't multi-hop enough for your real users. Write 200 questions that require two retrievals to answer. Label the gold doc set by hand. It's a week of work and worth every minute, because this is the only set that will rank decomposition correctly.

The gotcha most teams trip on: they use the same embedding model for HyDE's hypothetical document and the corpus. That part's correct. But they forget to normalize the hypothetical doc the same way the corpus chunks were normalized (lowercase, strip URLs, tokenizer-aware truncation). HyDE's "win" disappears once preprocessing is matched on both sides. Match every step.

Head-to-head: recall@10 and nDCG@10 on 4 query types

Same retriever (cosine over text-embedding-3-large, top 100 candidates, no reranker). Same corpus. Only the query rewrite changes.

Query type	Baseline (no rewrite)	HyDE	Multi-query (4)	Decomposition
Fact lookup (R@10)	0.71	0.79	0.74	0.66
Fact lookup (nDCG@10)	0.58	0.66	0.61	0.51
Concept / definitional (R@10)	0.64	0.69	0.72	0.65
Concept (nDCG@10)	0.49	0.54	0.57	0.50
Multi-hop (R@10)	0.41	0.45	0.52	0.63
Multi-hop (nDCG@10)	0.31	0.34	0.40	0.49
Ambiguous (R@10)	0.38	0.42	0.48	0.39
Ambiguous (nDCG@10)	0.27	0.30	0.35	0.28

Three things jump out.

HyDE wins fact-lookup cleanly. The hypothetical answer matches the linguistic surface of the document that contains the actual answer, and that lift is real.

Decomposition wins multi-hop by 11 points over HyDE and 11 points over multi-query. It also taxes fact-lookup by 5 points below baseline. If you ship decomposition to a corpus where 80% of queries are fact-lookup, you've made things worse.

Multi-query sits in the middle on every track. Never best, never disastrous. If you can only deploy one rewriter and your query mix is unknown, this is the safe pick.

HyDE: where it wins and where it adds noise

HyDE works when the corpus is answer-shaped and the query is question-shaped. NQ, TriviaQA, Wikipedia-style references. The kind of corpus where the document literally states the fact in declarative prose. HyDE turns "Who founded Anthropic?" into something like "Anthropic was founded in 2021 by Dario Amodei and Daniela Amodei..." and embeds that. The embedding lands very close to the actual Wikipedia passage.

# hyde.py
from openai import OpenAI

client = OpenAI()

HYDE_PROMPT = """Write a short, factual paragraph (4-6 sentences)
that directly answers this question. Do not hedge. Do not say
"based on" or "according to". Write as if you know the answer."""

def hyde_rewrite(query: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": HYDE_PROMPT},
            {"role": "user", "content": query},
        ],
        temperature=0.1,
    )
    return resp.choices[0].message.content

def retrieve_hyde(query: str, k: int = 10) -> list[str]:
    hypothetical = hyde_rewrite(query)
    # embed the hypothetical, not the query
    return vector_search(hypothetical, k=k)

Where HyDE fails: multi-hop. The model writes one fluent answer that braids together two facts it doesn't actually know. The embedding looks plausible and matches nothing in the corpus. You retrieve confident nonsense.

It also fails on ambiguous queries. "covid policy" is too thin for the LLM to commit to a hypothetical, so it writes a generic paragraph and you retrieve a generic paragraph.

Multi-query expansion: the safe middle

Multi-query is the one to ship first because it's nearly impossible to make things worse.

# multi_query.py
from collections import defaultdict

MULTIQ_PROMPT = """Generate 4 different versions of this query.
Vary vocabulary, specificity, and phrasing. Each should be
self-contained (no pronouns, no "this", no "above").
Return one per line, no numbering."""

def multi_query_rewrite(query: str) -> list[str]:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": MULTIQ_PROMPT},
            {"role": "user", "content": query},
        ],
        temperature=0.4,
    )
    variants = [l.strip() for l in
                resp.choices[0].message.content.split("\n")
                if l.strip()]
    return [query] + variants  # always include the original

def reciprocal_rank_fusion(
    rankings: list[list[str]], k: int = 60
) -> list[str]:
    # RRF: score doc by sum of 1/(k + rank) across rankings
    scores = defaultdict(float)
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] += 1.0 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

def retrieve_multi_query(query: str, k: int = 10) -> list[str]:
    variants = multi_query_rewrite(query)
    rankings = [vector_search(v, k=50) for v in variants]
    fused = reciprocal_rank_fusion(rankings)
    return fused[:k]

The RRF step matters. Don't just union and dedupe; that loses ranking signal. RRF gives you the doc-by-doc consensus across N variants and is shockingly hard to beat.

Multi-query helps most on ambiguous and concept queries because variants explore the latent intent. It helps least on multi-hop because four paraphrases of a two-hop question are still two-hop questions, and the corpus has no document that answers both hops at once.

Decomposition: the multi-hop unlock, the simple-query tax

Decomposition is the only rewriter that fundamentally changes the retrieval problem. Instead of one query against the corpus, you get N queries against the corpus, and you can chain them.

# decomposition.py
import json

DECOMP_PROMPT = """Break this question into the minimum set of
self-contained sub-questions whose answers, combined, answer
the original. If the question is single-hop, return just one
sub-question (the original, cleaned up).

Return JSON: {"subqueries": ["...", "..."], "chain": true|false}
chain=true means sub-question 2 needs the answer to 1."""

def decompose(query: str) -> dict:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": DECOMP_PROMPT},
            {"role": "user", "content": query},
        ],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    return json.loads(resp.choices[0].message.content)

def retrieve_decomposition(query: str, k: int = 10) -> list[str]:
    plan = decompose(query)
    subs = plan["subqueries"]
    if not plan.get("chain", False):
        # parallel sub-retrievals + RRF
        rankings = [vector_search(s, k=30) for s in subs]
        return reciprocal_rank_fusion(rankings)[:k]
    # chained: answer sub-1 with retrieved context, then sub-2
    docs_1 = vector_search(subs[0], k=10)
    intermediate = synthesize_answer(subs[0], docs_1)
    enriched = f"{subs[1]} (context: {intermediate})"
    return vector_search(enriched, k=k)

The chained path is what unlocks multi-hop. "Which actress who appeared in Memento also directed a film in 2022?" decomposes into (1) "Who appeared in Memento?" and (2) "Which 2022 films were directed by {name}?". Sub-query 1 needs no context, sub-query 2 needs the answer to 1.

The tax shows up on fact-lookup. The decomposer over-splits "Who founded Anthropic?" into "Who is associated with Anthropic?" and "Who founded the AI safety company started in 2021?", neither of which is as good as the original. You pay an extra LLM call and lose 5 recall points.

The fix is the router.

The conditional router that beats all three solo

If you can classify the query into a type cheaply, you can pick the rewriter per query and get a strict improvement over any single rewriter.

# router.py
import json
from typing import Literal

QueryType = Literal[
    "fact_lookup", "concept", "multi_hop", "ambiguous"
]

CLASSIFIER_PROMPT = """Classify this query into exactly one type:
- fact_lookup: asks for a specific named entity, date, number,
  or short factual answer
- concept: asks to explain, compare, or define something
- multi_hop: requires two or more distinct facts chained
  together to answer
- ambiguous: under 4 tokens, or missing required context

Return JSON: {"type": "<one of the four>"}"""

REWRITER_TABLE = {
    "fact_lookup": retrieve_hyde,
    "concept":     retrieve_multi_query,
    "multi_hop":   retrieve_decomposition,
    "ambiguous":   retrieve_multi_query,
}

def classify(query: str) -> QueryType:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": CLASSIFIER_PROMPT},
            {"role": "user", "content": query},
        ],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    return json.loads(resp.choices[0].message.content)["type"]

def smart_retrieve(query: str, k: int = 10) -> list[str]:
    qtype = classify(query)
    return REWRITER_TABLE[qtype](query, k=k)

That's 30 lines of router. Across the four query types it picks the per-type winner from the table above. On the eval mix above (40% fact-lookup, 30% concept, 20% multi-hop, 10% ambiguous), the router posts recall@10 = 0.71 and nDCG@10 = 0.58, beating every solo rewriter and beating the naive baseline by 9 recall points.

You can replace the LLM classifier with a fine-tuned BERT-tiny if you do this at high QPS. The accuracy cost is small (~2-3% misclassification) and you cut router latency from ~300ms to ~10ms.

Latency budget: recall is half the story

A rewriter that adds 800ms of LLM call time is a worse product than one that adds 80ms even if recall is two points higher. Users feel latency.

Rewriter	LLM calls	p50 latency add	p95 latency add
None (baseline)	0	0 ms	0 ms
HyDE	1	~280 ms	~700 ms
Multi-query (4 variants)	1 + 3 parallel searches	~310 ms	~750 ms
Decomposition (parallel)	1 + N searches	~320 ms	~800 ms
Decomposition (chained)	2 + N searches	~650 ms	~1500 ms
Router → HyDE	2	~450 ms	~950 ms

Numbers from gpt-4o-mini on the rewrite step, text-embedding-3-large for embeddings, p50/p95 over 500 queries against a 2M chunk corpus on Qdrant.

The router pays one extra LLM call (the classifier) but routes 60% of queries to lighter rewriters and only the 20% multi-hop set to chained decomposition. Average latency across the mix is lower than always running decomposition.

If you're latency-bound and can only deploy one rewriter, multi-query is your pick. If you have a latency budget and a real multi-hop user mix, the router pays off.

When to skip rewriting entirely

Sometimes the right answer is no rewriter.

You should skip if your corpus is small (under ~50K chunks) and your queries are short and lexical. BM25 plus dense retrieval with no rewriter often outperforms any rewriter on this shape because there's no embedding mismatch to fix.

You should skip if you have a strong reranker. A good cross-encoder reranker (Cohere Rerank, bge-reranker-v2) on top of 100 baseline candidates often closes the gap that rewriters open up, at lower latency. Test this before you commit to a rewriter stack.

You should skip if your queries arrive structured. If the user is filling a form with entity, time_range, topic, you don't need to rewrite. You need to filter. SQL or metadata filtering on a vector store beats any rewriter when the query is already structured.

Run the eval. Pick the rewriter (or router) that wins on your query mix. Be willing to delete it later when you add a reranker that does the same job for less latency. The whole pipeline is a moving target. A real eval is what makes the moves cheap.

What's your query mix in production: fact-heavy, concept-heavy, or genuinely multi-hop? Drop your numbers in the comments. Curious whether the 40/30/20/10 split in this post is anywhere close to what teams are actually shipping.

If this was useful

This post is one slice of a much bigger pipeline conversation. The RAG Pocket Guide walks through retrieval, chunking, and reranking patterns end to end — including the chapters on query rewriting strategies, eval methodology with BEIR, and the latency-vs-recall trade-offs you face when you go to production. If you're picking between rewriters, rerankers, and hybrid retrieval and want a coherent mental model instead of a pile of blog posts, that's what it's for.