This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.
RAG Retrieval Optimization: Hybrid Search, Re-Ranking, Query Transformation
Introduction
Retrieval quality is the single biggest factor in RAG system performance. Even the best LLM cannot produce accurate answers from irrelevant context. This article covers three optimization layers: hybrid search that combines embedding similarity with keyword matching, re-ranking that refines initial results, and query transformation that bridges the gap between user questions and searchable terms.
Hybrid Search
Pure vector search excels at semantic similarity but misses exact keyword matches. Pure keyword search finds exact terms but misses conceptually related content. Hybrid search combines both:
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, HybridFusion
client = QdrantClient(host="localhost", port=6333)
def hybrid_search(
query: str, collection: str = "documents", limit: int = 10
) -> list[dict]:
# Generate dense vector
dense_vector = embedding_model.encode(query).tolist()
# Sparse vector (BM25-style)
sparse_vector = sparse_encoder.encode(query)
results = client.search_batch(
collection_name=collection,
requests=[
# Dense search
{
"vector": dense_vector,
"limit": limit * 2,
"with_payload": True,
},
# Sparse search
{
"vector": sparse_vector,
"limit": limit * 2,
"with_payload": True,
},
],
)
# Fusion: Reciprocal Rank Fusion
return rrf_fusion(results[0], results[1], k=60)
Reciprocal Rank Fusion
RRF combines ranked lists from multiple retrieval methods:
def rrf_fusion(dense_results: list, sparse_results: list, k: int = 60) -> list[dict]:
scores = {}
for rank, result in enumerate(dense_results):
scores[result.id] = scores.get(result.id, 0) + 1 / (k + rank + 1)
for rank, result in enumerate(sparse_results):
scores[result.id] = scores.get(result.id, 0) + 1 / (k + rank + 1)
reranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return [{"id": id_, "score": score} for id_, score in reranked[:limit]]
RRF is simple, effective, and requires no training. The constant k (typically 60) prevents a single high rank from dominating.
Cross-Encoder Re-Ranking
After initial retrieval, a cross-encoder model re-scores candidates with higher accuracy:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def retrieve_and_rerank(query: str, top_k: int = 50, rerank_top: int = 5) -> list[dict]:
# First stage: fast bi-encoder retrieval
candidates = hybrid_search(query, limit=top_k)
# Second stage: cross-encoder re-ranking
pairs = [(query, cand["text"]) for cand in candidates]
scores = reranker.predict(pairs)
for cand, score in zip(candidates, scores):
cand["rerank_score"] = float(score)
candidates.sort(key=lambda x: x["rerank_score"], reverse=True)
return candidates[:rerank_top]
Cross-encoders are 10-100x slower than bi-encoders but significantly more accurate. The two-stage pattern (wide bi-encoder retrieval, narrow cross-encoder re-ranking) balances speed and quality.
Query Transformation
User queries are rarely optimal for retrieval. Transform them before searching:
def transform_query(user_query: str, technique: str = "expansion") -> str:
if technique == "expansion":
return expand_query(user_query)
elif technique == "decomposition":
return decompose_query(user_query)
elif technique == "hypothetical":
return hyde_query(user_query)
def expand_query(query: str) -> str:
"""Generate search-friendly expansions of the original query."""
expansions = call_llm(f"""
Generate 3 alternative phrasings of this query for better search retrieval.
Keep the core meaning but vary terminology.
Original: {query}
""")
return f"{query}\n{expansions}"
def hyde_query(query: str) -> str:
"""Hypothetical Document Embeddings: generate a hypothetical ideal document,
then use its embedding for retrieval."""
hypothetical = call_llm(f"Write a short passage that perfectly answers: {query}")
return hypothetical
Query Decomposition
Complex questions should be split into sub-queries, each searched independently:
def decompose_and_retrieve(question: str) -> list[dict]:
sub_queries = call_llm(f"""
Break this question into 2-4 independent sub-questions:
{question}
""")
sub
Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.
Found this useful? Check out more developer guides and tool comparisons on AI Study Room.
Top comments (0)