Gabriel Anhaia

Posted on Apr 18

Your Vector Database Is Not a Search Engine. Here's Why That's Killing Your RAG.

#rag #ai #llm #database

Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A user types SKU-47291 into your support bot. Your vector database returns the three most semantically related product descriptions. None of them are SKU-47291. The bot confidently tells the user that the closest match (SKU-47280) is probably what they meant. The user reports a bug. You are the bug.

You shipped a search engine that cannot search. You shipped a retriever that fails the only query a human ever typed with real intent: the one where they already know the exact string they want.

This is the RAG failure mode nobody warned you about when you picked Pinecone, Weaviate, Qdrant, or pgvector. Embeddings are not an upgrade to lexical search. They are a different tool. If you use one without the other, you will eat this class of bug in production forever.

Where pure vector search fails, predictably

Embedding models collapse text into a 768- or 1536-dimensional vector. Tokens with similar meaning end up near each other in that space. That is the entire superpower — and the entire failure mode.

Five query shapes break it. You can generate the failures on your own data in an afternoon:

1. Rare identifiers. SKU-47291, CVE-2024-3094, KB-0007745, error code E_PERMISSION_DENIED_42. The embedding model has never seen this exact string. It tokenizes into subword fragments, averages them into a vector, and lands somewhere in the middle of a cluster of other ID-shaped strings. A user typing SKU-47291 gets back SKU-47280, SKU-47305, SKU-47291-RED — in some order that has nothing to do with the actual identifier.

2. Proper nouns. User searches for Dvorak keyboard. Embedding returns QWERTY keyboard, ergonomic keyboard, mechanical keyboard. All semantically close. None of them are Dvorak. The signal that the user cared about the word "Dvorak" specifically is lost in the averaging.

3. Acronyms. A user searches RCE in JPA. The embedding model knows RCE vaguely, knows JPA vaguely, returns results about Java ORM vulnerabilities in general — and misses the specific CVE about RCE in JPA that a BM25 index would surface in position one.

4. Negation and exact phrases. "How do I cancel **without* a fee?"* The embedding model does not reliably encode negation. It returns cancellation policies that include fees. The one page that has the literal phrase no cancellation fee ranks below three pages that are semantically about cancellations-and-fees.

5. Numeric codes and version strings. Postgres 15.4, Python 3.11.7, node v20.9.0. Sub-version granularity is below the embedding's resolution. It will give you Python 3.11 answers when the user needed the 3.11.7 changelog.

These are not edge cases. They are the modal query on every technical product — every e-commerce catalog, every documentation site, every customer support bot. If you measure on a real query log, these five shapes together are often more than 30% of retrieval traffic.

Why lexical search solves this in one line

BM25 is forty years old. It runs under OpenSearch, Elasticsearch, and Postgres ts_rank. It knows nothing about meaning. It counts exact term matches, weights by rarity (TF-IDF), and normalizes for document length.

# BM25 with rank-bm25, Python. This is the entire algorithm you need
# for exact-match queries.
from rank_bm25 import BM25Okapi

corpus = [doc.split() for doc in documents]
bm25 = BM25Okapi(corpus)

query = "SKU-47291".split()
scores = bm25.get_scores(query)

The rarer the term, the higher it scores. SKU-47291 appears in exactly one document in your catalog. BM25 ranks that document first, every time, with no ambiguity. An embedding model cannot do this because it cannot represent "this exact token appeared, and it is rare." That information is lost the moment the tokenizer splits it.

Pure vector search was sold as the successor to keyword search. It is not. It is complementary.

What hybrid search actually is

Hybrid search means you run two retrievers in parallel, BM25 and dense vectors, and fuse their ranked lists into a single list. Then you optionally rerank the top k with a cross-encoder.

The fusion step is where teams usually get it wrong. You cannot add the scores directly. BM25 scores are unbounded floats in the 0–50 range on typical corpora. Cosine similarity on normalized embeddings is in 0–1. Adding them means BM25 always wins.

Two techniques that work:

Reciprocal Rank Fusion (RRF). Ignore the raw scores. Use only the rank each retriever assigned. For each document, score = Σ 1/(k + rank_i) across retrievers, where k is a dampening constant (60 is the folklore default from the original Cormack et al. paper). Simple, stable across retrievers with incompatible score scales, needs no tuning. Elastic, Qdrant, and Weaviate all ship RRF as a primitive now.

Weighted score fusion with normalization. Normalize both score streams to 0–1 per query (min-max over the top-k returned), then weight and sum. Slightly more tunable than RRF, slightly more fragile because a single outlier score skews the normalization. Use it when you have per-query-type labels to tune the weights on.

Here is the helper I reach for when a vendor does not ship native hybrid. It runs the two retrievers, normalizes, fuses, and returns a single ranked list:

from dataclasses import dataclass
from typing import Callable

@dataclass
class Hit:
    doc_id: str
    score: float

def min_max_normalize(hits: list[Hit]) -> list[Hit]:
    if not hits:
        return []
    lo = min(h.score for h in hits)
    hi = max(h.score for h in hits)
    if hi - lo < 1e-9:
        return [Hit(h.doc_id, 1.0) for h in hits]
    return [Hit(h.doc_id, (h.score - lo) / (hi - lo)) for h in hits]

def hybrid_search(
    query: str,
    lexical_retriever: Callable[[str, int], list[Hit]],
    vector_retriever: Callable[[str, int], list[Hit]],
    top_k: int = 50,
    alpha: float = 0.5,
) -> list[Hit]:
    """Weighted fusion of lexical + vector results.

    alpha=0.0 → pure lexical. alpha=1.0 → pure vector.
    Start at 0.5, tune per query class.
    """
    lex = min_max_normalize(lexical_retriever(query, top_k))
    vec = min_max_normalize(vector_retriever(query, top_k))

    fused: dict[str, float] = {}
    for h in lex:
        fused[h.doc_id] = (1 - alpha) * h.score
    for h in vec:
        fused[h.doc_id] = fused.get(h.doc_id, 0.0) + alpha * h.score

    ranked = sorted(fused.items(), key=lambda kv: kv[1], reverse=True)
    return [Hit(doc_id, score) for doc_id, score in ranked]

Four things to notice:

The retrievers are injected, not coupled. lexical_retriever can be a BM25 index, OpenSearch, Postgres ts_rank, anything that returns (doc_id, score) pairs.
Normalization is per-query. A query that returns weak scores from both retrievers still gets a fair fusion.
alpha is a single tunable knob. Measure on your real query log, per query class — alpha=0.2 for identifier-heavy traffic, alpha=0.7 for conversational traffic.
Fusion alone is not the whole pipeline. You feed the top 50 fused results into a cross-encoder reranker (cross-encoder/ms-marco-MiniLM-L-6-v2, BAAI/bge-reranker-v2-m3, Cohere Rerank 3) and take the top 5 for the LLM context. The reranker sees both the query and the passage together and is much better at relevance than either retriever alone.

What your vendor actually ships

The honest version of the vendor landscape as of April 2026:

Pinecone. Ships a sparse-dense hybrid index with managed sparse embeddings (pinecone-sparse-english-v0) — not pure BM25, but learned sparse. Fusion is done in the serverless architecture with an explicit alpha parameter. Works. Cost per query is higher than a pure dense index. The sparse side is opinionated; if your corpus is code or multilingual, measure before trusting it.

Weaviate. Native hybrid search with either RRF or relativeScoreFusion. BM25 runs over the inverted index shard-local, then fuses with HNSW vector search. The ergonomics are the best of the dedicated vector DBs — a single hybrid query parameter, an alpha knob, and it works.

Qdrant. Added native hybrid search and fusion via its Query API. You define a prefetch that runs a sparse and a dense retrieval, then a fusion step (rrf or dbsf). You control the sparse encoder — BM25, SPLADE, whatever you wire up. More configuration than Weaviate, more control as a result.

pgvector + tsvector. The sleeper pick. Postgres has had full-text search since the 2000s. pgvector adds HNSW dense search. You run two queries, fuse in SQL with a CTE and a rank fusion, and you have hybrid search on infrastructure you already operate. Paradedb and native extensions have made BM25-in-Postgres serious in the last year. The Supabase hybrid search recipe is battle-tested. If you already have Postgres, stop shopping.

OpenSearch / Elasticsearch. If you already run it for logs, you are one plugin away. OpenSearch ships neural search and an explicit hybrid query that fuses BM25 with k-NN. ES has equivalent with its rrf retriever. If you wanted a vector DB mostly to say you have one, and you already operate OpenSearch, you may already have the right tool.

No vendor is cheating you by omitting hybrid — the space corrected fast during 2024–2025 and every serious player has a hybrid story now. The failure mode is teams adopting pure vector search from a 2022 tutorial and never upgrading.

The measurement you are probably not doing

The reason this bug survives in production: teams measure retrieval quality on a benchmark that looks like conversational questions, not on their real query log. BEIR, MTEB, and the standard academic sets are dominated by well-formed natural-language questions. On those benchmarks, dense retrieval beats BM25 by meaningful margins.

On a real query log (a support console, a product catalog, an internal wiki), the distribution is bimodal. Half the queries are conversational. Half are lookups: product codes, error codes, internal tool names, proper nouns, version strings.

The fix costs nothing:

Export the last 30 days of queries from your app.
Label 200 by hand: lookup vs conversational.
Measure top-1 and top-5 recall, per class, for dense-only vs BM25-only vs hybrid (alpha=0.5).
Plot the deltas. The lookup class will move twenty to forty points when you add BM25 back in.

That plot is the business case for the migration. You do not need to win the argument on theory — the numbers will do it.

What to do Monday morning

If your RAG system is dense-only today, you have a short list:

Add a BM25 or sparse retriever alongside your dense index. On Postgres, tsvector. On OpenSearch, you already have it. On Pinecone/Weaviate/Qdrant, turn on their hybrid mode.
Fuse with RRF first. It is the boringly correct default. Tune to weighted fusion only when you have labeled data and a reason.
Add a cross-encoder reranker on the top 30–50. This alone often delivers a bigger quality jump than the fusion step, and it runs in 30–80ms for reasonable k.
Log your retriever's top-k alongside the query in every trace. When a user complains about a bad answer, you want to see the ranked list that produced it — not just the LLM output. OpenTelemetry's GenAI semconv has the span shape for this.
Build the query-class breakdown above. Keep it as a dashboard. Regressions in the lookup class are the first sign that a model upgrade or embedding change broke your retrieval.

Your vector database is a retriever. It is a good one for one class of query. It is not a search engine, and it will not become one by increasing the embedding dimension. The fix is forty years old and costs you a fusion function.

If this was useful

Retrieval is one failure class in a long list. Observability for LLM Applications covers the rest — drift detection, judge-meta-evaluation, tool-call traces, cost attribution, the twelve other things your APM cannot see. Chapter 7 covers RAG specifically, including hybrid retrieval traces and how to alert on retrieval regressions per query class.

Book: Observability for LLM Applications — paperback and hardcover now; ebook April 22.
Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI tools.
Me: xgabriel.com · github.com/gabrielanhaia.

Top comments (1)

mote • Apr 18

Solid breakdown of the five query failure modes — the rare identifiers one is particularly insidious because it fails silently and the user just gets a confident wrong answer.

The hybrid search pattern (BM25 + dense retrieval with RRF or linear interpolation) is the right fix for cloud/server deployments. One dimension worth adding: the latency profile of hybrid search at the edge.

On resource-constrained hardware (embedded devices, Raspberry Pi class), running two separate indexes doubles your memory footprint and adds an extra fusion step on already-limited CPU. The query path that feels trivial in a cloud benchmark can tip you over the available heap on a 256MB device.

What I've seen work at the edge is pre-classifying queries before hitting either index: if the query matches a pattern (alphanumeric ID, version string, known acronym), route it straight to BM25 and skip the vector path entirely. That cuts hybrid overhead to near-zero for exact-match cases, while still getting semantic recall for free-form natural language queries.

Does your implementation distinguish query types at routing time, or always run both paths and fuse?