Hybrid Search in 100 Lines: BM25 + pgvector with RRF Merge

#ai #python #postgres #rag

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A user typed ERR_BLOCKED_BY_CLIENT into your support search and got nothing. Zero hits. The chunk with that exact string was sitting in your chunks table, embedded by text-embedding-3-small, indexed under HNSW, and the cosine search ranked it 47th behind a pile of "browser error" prose. The retriever picked top-5. The error code never made it to the LLM.

This is the failure mode every team meets in the second month of running a vector-only RAG. Embeddings smear identifiers, version strings, SKUs, and stack traces into the surrounding semantic neighborhood. They are good at meaning and bad at tokens. The fix is to stop choosing. Run lexical and semantic retrieval in parallel, fuse the rankings, and let each method cover the other's blind spots. You can do it in roughly 100 lines of Python on top of Postgres, with no new infrastructure.

The schema: one table, two indexes

A single chunks table holds the document text, a tsvector for BM25-style full-text search, and a vector(1536) column for OpenAI embeddings. Both columns get their own index.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunks (
  id          BIGSERIAL PRIMARY KEY,
  doc_id      TEXT NOT NULL,
  content     TEXT NOT NULL,
  ts          TSVECTOR
              GENERATED ALWAYS AS
              (to_tsvector('english', content)) STORED,
  embedding   VECTOR(1536)
);

The ts column is a stored generated column, so Postgres recomputes it whenever content changes. No trigger to maintain, no application code to remember. The embedding column you fill yourself when you ingest a chunk.

Now the two indexes. GIN over the tsvector for fast lexical search, HNSW over the vector for approximate nearest neighbors.

CREATE INDEX chunks_ts_idx
  ON chunks USING GIN (ts);

CREATE INDEX chunks_vec_idx
  ON chunks USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

m = 16 and ef_construction = 64 are the defaults most teams reach for on 1,536-dimensional embeddings; bump ef_construction to 200 if your ingest is offline and you want better recall at query time. The pgvector docs cover the trade-offs in detail (pgvector README).

Ingest: embed, insert, done

Connect with psycopg, register the pgvector adapter, and the rest is two SQL statements.

import os
import psycopg
import numpy as np
from openai import OpenAI
from pgvector.psycopg import register_vector

oai = OpenAI()
DSN = os.environ["PG_DSN"]


def embed(text: str) -> np.ndarray:
    r = oai.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return np.array(r.data[0].embedding, dtype=np.float32)

text-embedding-3-small returns 1,536-dim vectors and costs about a fifth of -large. Use -large if your eval numbers say so; do not pay for it by default.

The insert path is unremarkable. Register the vector type once, then pass numpy arrays straight through.

def ingest(doc_id: str, chunks: list[str]) -> None:
    with psycopg.connect(DSN) as conn:
        register_vector(conn)
        with conn.cursor() as cur:
            for c in chunks:
                cur.execute(
                    "INSERT INTO chunks (doc_id, content, embedding)"
                    " VALUES (%s, %s, %s)",
                    (doc_id, c, embed(c)),
                )
        conn.commit()

register_vector teaches psycopg to serialize numpy arrays into pgvector's wire format (pgvector-python README). Without it you end up casting strings, which is slow and ugly.

The two retrievals, in parallel

Now the interesting half. Two queries, same connection, run back to back. The first ranks by ts_rank_cd, Postgres's BM25-flavored cover-density score over the tsvector. The second ranks by cosine distance against the query embedding.

BM25_SQL = """
SELECT id,
       ts_rank_cd(ts, plainto_tsquery('english', %s)) AS s
FROM chunks
WHERE ts @@ plainto_tsquery('english', %s)
ORDER BY s DESC
LIMIT %s
"""

VEC_SQL = """
SELECT id, 1 - (embedding <=> %s) AS s
FROM chunks
ORDER BY embedding <=> %s
LIMIT %s
"""

Two notes. plainto_tsquery parses raw user text into a tsquery without forcing them to know Postgres operators, which is much friendlier than to_tsquery for free-form input. And <=> is pgvector's cosine distance operator; 1 - distance gives you a similarity score in [0, 1] if you want one for logging, though we don't actually need it for fusion.

Run both queries and pull two ranked id lists.

def retrieve(query: str, k: int = 50) -> tuple[list[int], list[int]]:
    qvec = embed(query)
    with psycopg.connect(DSN) as conn:
        register_vector(conn)
        with conn.cursor() as cur:
            cur.execute(BM25_SQL, (query, query, k))
            bm25_ids = [r[0] for r in cur.fetchall()]
            cur.execute(VEC_SQL, (qvec, qvec, k))
            vec_ids = [r[0] for r in cur.fetchall()]
    return bm25_ids, vec_ids

We pull k = 50 from each side even though the final answer is top-10. You want a deeper candidate pool than you ship; RRF will discard the long tail.

If you care about p50 latency, run the two cursors on two connections from a pool and await both. For a tutorial-sized service the sequential path is fine and the planner often parallelizes the scans anyway.

RRF: the merge that does not need tuning

Reciprocal Rank Fusion is the boring algorithm that beats every clever weighted-sum scheme on benchmarks the field actually agrees on (Cormack, Clarke, Buettcher, 2009). For each retriever, score every document as 1 / (k + rank), sum across retrievers, sort.

def rrf(rankings: list[list[int]], k: int = 60) -> list[int]:
    scores: dict[int, float] = {}
    for ranked in rankings:
        for rank, doc_id in enumerate(ranked):
            scores[doc_id] = (
                scores.get(doc_id, 0.0) + 1.0 / (k + rank + 1)
            )
    return sorted(scores, key=scores.get, reverse=True)

The k = 60 constant is from the original paper. It dampens the head: rank 1 contributes 1/61, rank 2 contributes 1/62, and the gap is small. That is the point. RRF refuses to trust either retriever's absolute scores, because BM25's ts_rank_cd and cosine similarity live on different scales and there is no honest way to combine them. Ranks, you can combine.

Plug RRF on top of the parallel retrievers and you have hybrid search.

def hybrid_search(query: str, top_k: int = 10) -> list[int]:
    bm25_ids, vec_ids = retrieve(query, k=50)
    fused = rrf([bm25_ids, vec_ids])
    return fused[:top_k]

Fetch the actual content with one more SELECT id, content FROM chunks WHERE id = ANY(%s) and you are done. That whole pipeline (schema, ingest, retrieve, fuse) fits under 100 lines if you are not generous with whitespace.

What this catches that vector search alone does not

Three patterns repeat across teams that switch a vector-only RAG to hybrid:

Identifier and code-token queries. Error codes, SKU numbers, function names, version strings. Embeddings squash these into a neighborhood of "things that look like identifiers." BM25 nails them because the token either appears or it doesn't.
Rare-word queries. A user types tombstoned consumer offset and the embedding pulls back ten chunks about Kafka consumer groups in general. BM25 sees tombstoned is rare, weights it, and surfaces the one chunk that uses the word.
Negation and entity disambiguation. Vector search is famously bad at "the API that does not require auth" or "Stripe checkout, not Stripe billing." BM25 still treats these as bags of words, but the lexical signal is enough to surface the right document among the top 50, where RRF picks it up.

Vector search keeps winning the cases BM25 always lost: paraphrases, synonyms, cross-language queries when you embed multilingual, and the long tail of "the user described the symptom but not the term." That is the whole point of fusing them. Each method covers the other's blind spot, RRF does not require you to commit to a weight, and you ship one Postgres database instead of two retrieval stacks.

What to tune, what to leave alone

Leave RRF's k = 60 alone. The paper's number holds up across corpora and the gains from sweeping it are smaller than the noise in your eval set. Leave HNSW's m = 16 alone unless your recall@10 plateaus below what your queries need; bumping m blows up index size faster than it helps recall.

Tune the candidate pool. k = 50 per retriever is a safe default; on small corpora k = 20 is often enough, on large ones push to k = 100 and let RRF cut. Tune the embedding model only after you have run the eval; do not assume -large wins. And tune your tsvector configuration. If your corpus is multilingual, 'english' is the wrong dictionary and you will silently lose recall on every non-English chunk.

The shape of this pipeline is small on purpose. One table. Two indexes. Two queries. One fusion function. Most of the production weight is in the embedding pipeline and the eval rig, not the retriever. When a teammate asks why hybrid search is not its own service, point at this file.

If this was useful

The RAG Pocket Guide walks through hybrid search end-to-end: when BM25 wins, when reranking pays for itself on top of RRF, and the eval rig that tells you whether any of it actually moved your numbers. If you are running RAG on Postgres and want the production patterns without the marketing detour, it is for you.