Lucas

Posted on Feb 12

Building Hybrid Search for RAG: Combining pgvector and Full-Text Search with Reciprocal Rank Fusion

#rag #python #ai #postgres

Most RAG tutorials show you the happy path: embed your documents, store them in a vector database, do a similarity search, done. Ship it.

Then you put it in front of real users and discover that semantic search alone misses obvious keyword matches, exact product names get hallucinated into approximate nonsense, and your retrieval quality sits at maybe 60%.

This post covers how we fixed it — using nothing but PostgreSQL. pgvector for semantic search, tsvector for full-text search, and Reciprocal Rank Fusion to merge the results. No Pinecone. No Elasticsearch. Just Postgres.

TL;DR — Pure vector search got us ~62% retrieval precision. Adding full-text search + RRF fusion bumped it to ~84%, with near-perfect exact-match queries. Zero additional infrastructure — it's all just SQL.

Why Not Just Use Vector Search?

Vector (semantic) search is great at understanding meaning. If a user asks "how do I reduce cloud costs?" it will find documents about "infrastructure cost optimization" even if those exact words don't appear.

But it has blind spots. Here's how the two approaches compare:

Query type	Vector search	Full-text search
"infrastructure cost optimization"	Finds it even if phrased differently	Misses unless exact words match
"PCI-DSS v4" (exact standard)	Returns generic "security compliance" docs	Nails it
"ECS" (acronym)	Matches "container services" broadly	Finds exact AWS ECS references
"SQS DLQ config" (precise query)	Diluted by semantically similar noise	Direct hit
"how do I handle errors gracefully?"	Understands intent, finds relevant patterns	Misses if docs use different wording

Neither approach is sufficient on its own. The answer is to use both.

Architecture Overview

Both searches run in parallel against the same PostgreSQL database. RRF merges the ranked lists into a single result set.

Setting Up the Database

pgvector for Semantic Search

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE document_chunks (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id UUID NOT NULL REFERENCES documents(id),
    content TEXT NOT NULL,
    embedding vector(3072),  -- text-embedding-3-large dimensions
    metadata JSONB DEFAULT '{}',
    created_at TIMESTAMPTZ DEFAULT now()
);

-- HNSW index for approximate nearest neighbor search
CREATE INDEX idx_chunks_embedding ON document_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 128);

Why HNSW over IVFFlat? HNSW gives better recall at query time without needing to tune nprobe. IVFFlat is faster to build but requires choosing the number of lists upfront and tuning probe count per query. For production RAG where retrieval quality matters more than index build time, HNSW is the better default.

tsvector for Full-Text Search

ALTER TABLE document_chunks
    ADD COLUMN search_vector tsvector
    GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;

CREATE INDEX idx_chunks_search ON document_chunks USING gin(search_vector);

Using a GENERATED ALWAYS column means the search vector stays in sync with content automatically — no triggers, no application logic to maintain.

The Search Queries

Semantic Search

async def semantic_search(
    session: AsyncSession,
    query_embedding: list[float],
    workspace_id: uuid.UUID,
    limit: int = 20,
) -> list[SearchResult]:
    """Find chunks by vector similarity using cosine distance."""
    stmt = (
        select(
            DocumentChunk.id,
            DocumentChunk.content,
            DocumentChunk.document_id,
            (1 - DocumentChunk.embedding.cosine_distance(query_embedding)).label(
                "similarity"
            ),
        )
        .join(Document, DocumentChunk.document_id == Document.id)
        .where(Document.workspace_id == workspace_id)
        .order_by(
            DocumentChunk.embedding.cosine_distance(query_embedding)
        )
        .limit(limit)
    )

    results = await session.execute(stmt)
    return [
        SearchResult(
            chunk_id=row.id,
            content=row.content,
            document_id=row.document_id,
            score=float(row.similarity),
        )
        for row in results
    ]

Full-Text Search

async def fulltext_search(
    session: AsyncSession,
    query: str,
    workspace_id: uuid.UUID,
    limit: int = 20,
) -> list[SearchResult]:
    """Find chunks by full-text relevance using ts_rank."""
    ts_query = func.plainto_tsquery("english", query)

    stmt = (
        select(
            DocumentChunk.id,
            DocumentChunk.content,
            DocumentChunk.document_id,
            func.ts_rank(DocumentChunk.search_vector, ts_query).label("relevance"),
        )
        .join(Document, DocumentChunk.document_id == Document.id)
        .where(
            Document.workspace_id == workspace_id,
            DocumentChunk.search_vector.op("@@")(ts_query),
        )
        .order_by(func.ts_rank(DocumentChunk.search_vector, ts_query).desc())
        .limit(limit)
    )

    results = await session.execute(stmt)
    return [
        SearchResult(
            chunk_id=row.id,
            content=row.content,
            document_id=row.document_id,
            score=float(row.relevance),
        )
        for row in results
    ]

Reciprocal Rank Fusion

Here's where the magic happens. RRF merges two ranked lists without needing to normalize their scores — which is important because cosine similarity scores and ts_rank scores are on completely different scales.

The formula is simple:

RRF_score(d) = Σ  1 / (k + rank_i(d))

Where k is a constant (typically 60) and rank_i(d) is the rank of document d in the i-th result list.

Why not just normalize and average the scores? Cosine similarity and ts_rank have fundamentally different distributions. Normalizing them into a common range introduces arbitrary decisions about scaling. RRF side-steps this entirely by only caring about rank position, not raw scores. A document ranked #1 in both lists will always beat a document ranked #5 in one and #1 in another — regardless of what the underlying scores looked like.

def reciprocal_rank_fusion(
    result_lists: list[list[SearchResult]],
    k: int = 60,
) -> list[SearchResult]:
    """
    Merge multiple ranked result lists using Reciprocal Rank Fusion.

    The k parameter controls how much we trust top-ranked results.
    Higher k = more equal weighting. Default 60 is from the original
    Cormack et al. (2009) paper.
    """
    rrf_scores: dict[uuid.UUID, float] = {}
    results_by_id: dict[uuid.UUID, SearchResult] = {}

    for result_list in result_lists:
        for rank, result in enumerate(result_list, start=1):
            rrf_scores[result.chunk_id] = rrf_scores.get(
                result.chunk_id, 0.0
            ) + 1.0 / (k + rank)
            results_by_id[result.chunk_id] = result

    sorted_ids = sorted(rrf_scores, key=rrf_scores.get, reverse=True)

    return [
        SearchResult(
            chunk_id=chunk_id,
            content=results_by_id[chunk_id].content,
            document_id=results_by_id[chunk_id].document_id,
            score=rrf_scores[chunk_id],
        )
        for chunk_id in sorted_ids
    ]

Putting It Together

async def hybrid_search(
    session: AsyncSession,
    query: str,
    query_embedding: list[float],
    workspace_id: uuid.UUID,
    limit: int = 10,
) -> list[SearchResult]:
    """
    Run semantic and full-text search in parallel,
    then merge results with RRF.
    """
    semantic_results, fulltext_results = await asyncio.gather(
        semantic_search(session, query_embedding, workspace_id, limit=20),
        fulltext_search(session, query, workspace_id, limit=20),
    )

    fused = reciprocal_rank_fusion([semantic_results, fulltext_results])

    return fused[:limit]

Both searches run concurrently with asyncio.gather. We fetch 20 candidates from each, fuse them, and return the top 10. Over-fetching from each source gives RRF more signal to work with.

Why Not a Dedicated Vector Database?

We considered Pinecone, Weaviate, and Qdrant. We stuck with PostgreSQL + pgvector for several reasons:

Concern	Dedicated vector DB	PostgreSQL + pgvector
Infrastructure	Separate service to monitor, back up, patch, pay for	Already running Aurora PostgreSQL
Consistency	Eventual consistency between stores	Single transaction — deletes are atomic
Full-text search	Need a third system (Elasticsearch, etc.)	`tsvector` is built in
Compliance (SOC2)	Every new service adds audit scope	One set of encryption, access controls, audit trails
Performance at our scale	Faster at 10M+ vectors	<50ms HNSW search at ~500K chunks

When would we switch? If we hit tens of millions of vectors and needed sub-10ms p99 latency, a dedicated solution would make sense. At ~500K chunks, pgvector is more than fast enough.

Production Lessons

Here's what we learned running this in production:

1. Over-fetch, then trim.
Pulling 20 results from each source and fusing to 10 consistently outperformed pulling 10 from each. More candidates = better fusion signal.

2. Full-text search needs query preprocessing.
plainto_tsquery handles most cases, but for advanced users we added websearch_to_tsquery support which handles quoted phrases and boolean operators.

3. Monitor both sides independently.
We track semantic search latency and full-text search latency separately. If pgvector's HNSW index needs a rebuild, we see it in the semantic search p99 without it being masked by fast full-text results.

4. Embedding model upgrades require re-indexing.
When we considered switching embedding models, we realized every chunk needs re-embedding. We built this into our pipeline as a background job with progress tracking. Plan for this from day one.

5. RRF's k=60 is a reasonable default, but test it.
We ran evaluations with k values from 10 to 100. For our dataset, 60 was optimal. Your mileage will vary — the important thing is to have an evaluation set so you can actually measure.

Results

After switching from pure semantic search to hybrid search with RRF:

Metric	Before (vector only)	After (hybrid + RRF)
Retrieval precision	~62%	~84%
Exact-match queries (product names, error codes)	Unreliable	Near-perfect
Additional infrastructure	—	None. It's all just SQL.

Hybrid search isn't novel research — the pattern has been around for years in traditional information retrieval. But in the RAG gold rush, most tutorials skip it entirely in favor of pure vector search. If your RAG pipeline is struggling with retrieval quality, this is likely the highest-ROI improvement you can make.

[Code examples in this post are illustrative patterns based on production experience, not copy-paste from a proprietary codebase.]

Top comments (1)

anthony “playanth98” attia • Mar 21 • Edited

Great article. I am implementing a spring boot application with spring ai. I develop a rag system in order to create a chat bot with my company intern documentation (confluence, Jira) etc. With multiple input data I ve decided to create multiple pgvector tables . One table for confluence and an other for Jira. So the rrf algorithm can be very useful for my use case. Cause my first chat gave me very approximative answers. But I'm a new in ai backend development. And I don't know if it is good practice creating multiple table pgvector store or not. I would like to have some expert ai enginer with good experience instead of base my trust on chatgpt . And the spring ai documentation has only very simple use case . Here is my GitHub repo github.com/playanth95978/enterpris...