RAG Retrieval Optimization: Hybrid Search, Re-Ranking, Query Transformation

#ai #machinelearning #llm

This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.

RAG Retrieval Optimization: Hybrid Search, Re-Ranking, Query Transformation

Introduction

Retrieval quality is the single biggest factor in RAG system performance. Even the best LLM cannot produce accurate answers from irrelevant context. This article covers three optimization layers: hybrid search that combines embedding similarity with keyword matching, re-ranking that refines initial results, and query transformation that bridges the gap between user questions and searchable terms.

Hybrid Search

Pure vector search excels at semantic similarity but misses exact keyword matches. Pure keyword search finds exact terms but misses conceptually related content. Hybrid search combines both:

from qdrant_client import QdrantClient

from qdrant_client.models import Filter, HybridFusion

client = QdrantClient(host="localhost", port=6333)

def hybrid_search(

    query: str, collection: str = "documents", limit: int = 10

) -> list[dict]:

    # Generate dense vector

    dense_vector = embedding_model.encode(query).tolist()

    # Sparse vector (BM25-style)

    sparse_vector = sparse_encoder.encode(query)

    results = client.search_batch(

        collection_name=collection,

        requests=[

            # Dense search

            {

                "vector": dense_vector,

                "limit": limit * 2,

                "with_payload": True,

            },

            # Sparse search

            {

                "vector": sparse_vector,

                "limit": limit * 2,

                "with_payload": True,

            },

        ],

    )

    # Fusion: Reciprocal Rank Fusion

    return rrf_fusion(results[0], results[1], k=60)

Reciprocal Rank Fusion

RRF combines ranked lists from multiple retrieval methods:

def rrf_fusion(dense_results: list, sparse_results: list, k: int = 60) -> list[dict]:

    scores = {}

    for rank, result in enumerate(dense_results):

        scores[result.id] = scores.get(result.id, 0) + 1 / (k + rank + 1)

    for rank, result in enumerate(sparse_results):

        scores[result.id] = scores.get(result.id, 0) + 1 / (k + rank + 1)

    reranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)

    return [{"id": id_, "score": score} for id_, score in reranked[:limit]]

RRF is simple, effective, and requires no training. The constant k (typically 60) prevents a single high rank from dominating.

Cross-Encoder Re-Ranking

After initial retrieval, a cross-encoder model re-scores candidates with higher accuracy:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def retrieve_and_rerank(query: str, top_k: int = 50, rerank_top: int = 5) -> list[dict]:

    # First stage: fast bi-encoder retrieval

    candidates = hybrid_search(query, limit=top_k)

    # Second stage: cross-encoder re-ranking

    pairs = [(query, cand["text"]) for cand in candidates]

    scores = reranker.predict(pairs)

    for cand, score in zip(candidates, scores):

        cand["rerank_score"] = float(score)

    candidates.sort(key=lambda x: x["rerank_score"], reverse=True)

    return candidates[:rerank_top]

Cross-encoders are 10-100x slower than bi-encoders but significantly more accurate. The two-stage pattern (wide bi-encoder retrieval, narrow cross-encoder re-ranking) balances speed and quality.

Query Transformation

User queries are rarely optimal for retrieval. Transform them before searching:

def transform_query(user_query: str, technique: str = "expansion") -> str:

    if technique == "expansion":

        return expand_query(user_query)

    elif technique == "decomposition":

        return decompose_query(user_query)

    elif technique == "hypothetical":

        return hyde_query(user_query)

def expand_query(query: str) -> str:

    """Generate search-friendly expansions of the original query."""

    expansions = call_llm(f"""

    Generate 3 alternative phrasings of this query for better search retrieval.

    Keep the core meaning but vary terminology.

    Original: {query}

    """)

    return f"{query}\n{expansions}"

def hyde_query(query: str) -> str:

    """Hypothetical Document Embeddings: generate a hypothetical ideal document,

    then use its embedding for retrieval."""

    hypothetical = call_llm(f"Write a short passage that perfectly answers: {query}")

    return hypothetical

Query Decomposition

Complex questions should be split into sub-queries, each searched independently:

def decompose_and_retrieve(question: str) -> list[dict]:

    sub_queries = call_llm(f"""

    Break this question into 2-4 independent sub-questions:

    {question}

    """)

    sub

Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.

Found this useful? Check out more developer guides and tool comparisons on AI Study Room.

DEV Community

RAG Retrieval Optimization: Hybrid Search, Re-Ranking, Query Transformation

RAG Retrieval Optimization: Hybrid Search, Re-Ranking, Query Transformation

Introduction

Hybrid Search

Reciprocal Rank Fusion

Cross-Encoder Re-Ranking

Query Transformation

Query Decomposition

Top comments (0)