How dense embedding retrieval replaced BM25 in modern AI search, what the mechanism actually does, and why exact-match SEO tactics quietly stopped working.
There is a page I audited last year that ranks well gets cited, gets quoted, gets used as a source by AI assistants for a phrase nobody types. The literal string appears nowhere in the document. The document is about the topic, plainly and accurately, in clear prose. The query is a paraphrase. Twenty years of SEO heuristics would predict this page does not match. The retrieval stack thinks it matches better than half the pages that do contain the literal phrase. The inverse also happens: a page that uses a query's exact terms three times in the title and twice in the H1, and is not getting cited at all, because the embedding model thinks the page is about something different from what the user asked. Same query class, two outcomes and the difference is mechanical. The retrieval stack changed underneath, and most of the SEO heuristics the industry still teaches are heuristics about a stack that is now the second-stage filter, not the first.
I built my mental model the slow way. I read the BEIR benchmark paper end to end, then DPR, then ColBERT, then HNSW, and then sat with a public embedding model and a corpus of my own running similarity computations against synonym pairs, paraphrase pairs, and adversarial pairs until the behaviour stopped surprising me. After that I started watching what happened to AI citations when pages were rewritten in different ways exact-match tightened, paraphrases added, exact-match stripped while semantics preserved. The pattern that fell out is not subtle, and it overturns several pieces of SEO advice that are still being repeated as if they were neutral facts.
This post is the field report. The shift from sparse to dense first-stage retrieval, what an embedding model actually represents about a page and a query, why approximate nearest neighbour search is the workhorse of the recall step, why dense-only retrieval fails in specific predictable ways and why hybrid retrieval is the production answer, and what all of that means for content design. It is technical because the mechanism is technical. The shortcuts the SEO industry has been selling are shortcuts to the wrong stack.
The Two Decades of BM25
For roughly twenty years, the dominant first-stage retrieval algorithm on the open web and inside almost every search engine, on-site search, and Lucene/Elasticsearch deployment was BM25, formalised by Robertson and Zaragoza in their 2009 retrospective "The Probabilistic Relevance Framework: BM25 and Beyond." BM25 is a sparse, lexical, term-frequency-based scorer. It builds an inverted index of terms to documents. At query time it scores documents by how often the query terms appear, weighted by inverse document frequency, with saturation and length normalisation parameters bolted on. The mathematics is closed-form, the index is small, and the recall is reasonable for queries whose terms overlap exactly with the document.
BM25 has properties the SEO industry built an entire grammar around. It rewards keyword presence. It is sensitive to keyword frequency up to a saturation point. It penalises long documents to prevent stuffing. It cannot match a paraphrase. It cannot infer that "vehicle" and "automobile" are the same concept. It cannot tell that "how to fix a slow website" and "improving page load performance" are about the same question. The keyword-research industry, on-page-optimisation playbooks, exact-match domain folklore, the H1-must-contain-the-target-keyword reflex all of that grammar is downstream of how BM25 scores documents. When the retrieval stack scores on lexical overlap, the rational thing for authors is to engineer lexical overlap. So they did, for two decades.
The thing that changed, quietly enough that most SEO commentary missed it, is that BM25 stopped being the only thing and on a growing share of the queries that matter for AI search, stopped being the dominant thing at the recall step.
The Dense Retrieval Era
Dense retrieval was not a single moment. It was a slow accumulation of papers that each made the dense approach better, cheaper, or more general. The two reference points worth knowing by name are DPR Karpukhin et al., 2020 and ColBERT Khattab and Zaharia, 2020. DPR demonstrated that a dual-encoder, where query and passage are each encoded independently into a dense vector and scored by inner product, could outperform BM25 on open-domain question answering by a substantial margin. ColBERT pushed the thinking further by keeping per-token embeddings and computing a late-interaction score, improving fine-grained matching while remaining tractable.
The third reference point that brought rigour to the comparison is the BEIR benchmark Thakur et al., 2021. BEIR took eighteen heterogeneous IR datasets, ran the major sparse and dense retrievers across all of them in zero-shot mode, and published the comparison. The headline result was less tidy than the dense-retrieval marketing wanted: dense models trained on one domain did not always transfer to another, and BM25 remained surprisingly hard to beat on certain tasks. The honest reading of BEIR is that neither sparse nor dense is a universal winner alone, and hybrid systems combining both tend to dominate.
That honest reading is the one production search systems implement. It is also the one most SEO advice ignores.
What an Embedding Model Sees in a Page
The mechanism is worth tracing. An embedding model takes a sequence of tokens your page's text, broken into sub-word tokens by the model's tokeniser and runs them through a stack of transformer layers. Each token attends to every other token (or a windowed subset). The output is a sequence of contextualised token embeddings: each token carries information about the words that surround it. The model pools that sequence into a single vector often the embedding of a [CLS] token, sometimes mean-pooling, sometimes a learned head. The result is a fixed-size vector, typically between 384 and 3072 dimensions depending on the model.
What that vector represents is meaning, not surface text. Two paragraphs saying the same thing in different words produce vectors close in the embedding space. A paragraph about "the impact of caching on web performance" and a paragraph about "how stale responses speed up rendering" sit near each other even though they share almost no tokens. This is what dense retrieval does that BM25 never could. It is also why content that is "well-written about the topic" can outrank content that is "engineered for the keyword" the model is not counting tokens, it is comparing meaning.
The flip side is that the embedding model is a learned model, not a dictionary. It has a training distribution. Concepts well-represented in training are mapped cleanly. Concepts that were absent or rare are mapped sloppily. Specific identifiers SKUs, model numbers, error codes, brand names that look like generic words frequently sit in regions of the embedding space with very low resolution. That is one of the dense-retrieval failure modes, and we will come back to it.
What an Embedding Model Sees in a Query
The query goes through the same model. The user's words or the query the LLM has rewritten on the user's behalf get tokenised, embedded, and pooled into a vector in the same space as the documents. The retrieval step is a nearest-neighbour search: which document vectors are closest to the query vector by cosine similarity or inner product?
The query embedding does several things a BM25 query cannot. It handles paraphrase: "fastest way to deploy a Next.js app" lands near documents about "Next.js deployment latency," even though "fastest" is missing from one and "latency" from the other. It handles synonym disjunction softly: a query about "vehicles" partially matches documents about "cars" without a configured dictionary. It handles intent inference up to a point: a question lands closer to documents that answer it than to documents that ask similar questions, because the model has learned the difference from training data.
What it does not do is handle exact identifiers well. A query for the SKU BTX-449-G2 returns high similarity only if the model tokenised it the same way for document and query, and embeddings of rare tokens are noisy. A query for the precise string error E_INVALID_REDIRECT may end up near generic documents about redirect errors and miss the document that contains the exact string verbatim, because the model treats the rare code as low-information. That is why hybrid retrieval exists.
Before we get there, there is a piece between the user's input and the embedding step that most operators forget about.
The Invisible Query Rewrite
When a model produces a search-grounded answer, the query that hits the retrieval stack is rarely the user's literal text. The model rewrites the question into one or more search queries sometimes expanding into sub-queries, sometimes paraphrasing, sometimes filling in implicit context from the conversation. ChatGPT search, Perplexity, Gemini grounded mode, Claude with the web search tool, and Bing Chat all do some form of query rewriting before retrieval. The stack downstream sees the rewritten query, not the user's words.
This matters for content design. Optimising for the literal user query is a fool's errand because you do not see the literal query you see the query the model decided to send, already normalised and paraphrased. What you can optimise for is the cluster of paraphrases the model is likely to produce around a given intent. This is why writing "the same answer phrased multiple ways within one page" tends to win over "the same keyword repeated multiple times within one page" the paraphrased pages match more of the rewrite distribution, which is what actually hits the index.
Approximate Nearest Neighbour at Scale
In principle the recall step is just nearest-neighbour search. In practice, exact nearest-neighbour search over hundreds of millions of vectors is infeasible at AI-search latencies. The production answer is approximate nearest neighbour, or ANN, and the dominant open-source algorithm is HNSW Hierarchical Navigable Small World graphs described by Malkov and Yashunin in 2016.
HNSW is a graph-based index. The intuition is worth holding clearly because it explains why ANN is "good enough" for the first stage.
HNSW conceptual structure (top layer is sparse, bottom layer is full)
Layer 2 (sparse, long edges): o ----------- o ----------- o
\ / /
Layer 1 (denser, medium edges): o --- o --- o --- o --- o
\ / \ \ /
Layer 0 (full, short local edges): o-o-o-o-o-o-o-o-o-o-o-o
^
query enters at top,
greedy descent narrows
neighbourhood at each layer
A query enters at the top layer, which has few nodes connected by long edges. The algorithm greedily walks toward the query's nearest neighbour, drops down to the next layer using the current best node as the entry point, and repeats. By the time the search reaches the bottom layer which contains every vector the candidate region is already narrowed to a small neighbourhood, and the bottom-layer search only explores a few hundred nodes instead of the full corpus. The result is sub-linear search time with high recall, configurable through parameters that trade off recall against latency.
Faiss, the open-source library from Meta, implements HNSW alongside several other ANN structures including IVF (inverted file with coarse quantisation) and product quantisation. Pinecone, Weaviate, Qdrant, Milvus, pgvector, Vespa every production vector database is a variation on these ideas. HNSW dominates the discussion because it has consistently strong recall on high-dimensional vectors with reasonable memory overhead.
The catch and it is the catch that hybrid retrieval was invented to address is that ANN is approximate. The recall step returns the top-k by approximate similarity, not the true top-k. For most queries the top few results are stable. For queries with rare terms or out-of-distribution embeddings, the approximate index can miss the document that lexical search would have found trivially. Combined with the embedding model's own weaknesses on rare and exact terms, the dense-only path has predictable failure modes.
Where Dense Alone Loses
There is a class of queries where pure dense retrieval is reliably worse than BM25.
Queries with exact identifiers product SKUs, model numbers, error codes, version strings, ISBNs, regulatory references are dense-retrieval's worst case. The embedding model has typically not seen BTX-449-G2 enough during training to give it a meaningful position in vector space. BM25 treats it as a token and finds the document instantly.
Queries with brand names that overlap common words Apple, Square, Notion, Linear, Vector are a related case. The embedding model maps "Apple" closer to "fruit," "company," and "computer" by some learned blend. The query "Apple support phone number" sits in a region where consumer-electronics documents and grocery-aisle documents coexist. BM25 does not care about meaning and scores by literal token overlap.
Queries about domains under-represented in training niche legal corpora, regional regulatory texts, deeply specialised technical fields also tend to favour BM25 because the embedding model's resolution in those regions of the space is poor.
Queries with negation and quantifiers "papers that do not use BERT," "websites without a privacy policy" are hard for embedding models, which struggle to invert meaning. BM25 with explicit operators handles these better than naive dense retrieval, although in practice the LLM usually rewrites the query into something the dense retriever can handle.
This is the empirical content of the BEIR result. Across eighteen datasets, no single retriever wins everywhere, and the cases where dense loses are not random they cluster around the failure modes above.
Hybrid Retrieval Is the Production Answer
Production AI search systems do not pick sparse or dense. They run both, fuse the results, and let the rerank stage clean it up.
The two common fusion approaches are Reciprocal Rank Fusion a simple, training-free recipe that sums the reciprocal of each document's rank in each list and learned combiners that train a model to score documents using both BM25 and dense scores as features. Vespa, Weaviate, Elasticsearch's hybrid search, Qdrant's BM25 + dense pipelines, and OpenSearch's neural-sparse hybrid all implement variations of these patterns. The rerank step that follows (a heavier cross-encoder that re-scores the top candidates) is its own conversation, and I am keeping it deliberately brief here. The point for retrieval is that the rerank cleans up the noise the recall step admitted, and the recall step is now hybrid rather than purely lexical.
Here is the comparison that matters, framed as the characteristics of each path:
| Property | Sparse (BM25) | Dense (embedding-based) | Hybrid (sparse + dense) |
|---|---|---|---|
| Matches exact terms | Yes, by construction | Weakly, via tokenisation | Yes (sparse rescues this) |
| Matches paraphrases | No | Yes | Yes (dense provides this) |
| Handles synonyms | Only with explicit dictionary | Yes, learned | Yes |
| Handles rare identifiers | Yes | Weakly | Yes (sparse rescues this) |
| Handles negation | Yes, with operators | Poorly | Partial |
| Robust to OOD vocabulary | Yes | Poorly | Yes (sparse rescues this) |
| Recall vs latency at scale | Inverted index, sub-linear | ANN graph, sub-linear | Run both, fuse |
| Index size | Small (token postings) | Large (vector per chunk) | Sum of both |
| Cold-start on new content | Immediate (just index tokens) | Requires embedding compute | Both |
That table is the operational summary of two decades of BM25 plus six years of dense-retrieval-at-scale. It explains why the production answer is hybrid and why neither extreme of the SEO debate "keywords are dead" or "keywords are all that matter" is correct. They are both signals. The retrieval stack uses both. Content that wins in AI search is content that survives both filters.
What Still Matters from the Lexical Era
The dense retriever does not erase the lexical signal it adds a second signal next to it. Everything BM25 ever rewarded still partially matters, but the marginal return on stuffing the same term thirty times has collapsed. What survives:
Exact entity names. Brand names, product names, person names, location names these are what hybrid retrieval rescues from dense-only failure. If your brand is Acme Software, that exact string needs to appear once on the page in plain text where the indexer can find it, somewhere unambiguous, with the surrounding paraphrases the embedding model can latch onto.
Exact identifiers. SKUs, error codes, version strings, model numbers. Same story. Once on the page in the canonical form is what you need.
Structured data. Schema.org JSON-LD remains load-bearing because it gives the indexing pipeline a clean entity graph that does not depend on parsing prose.
Brand spellings and variations. If users search for both e-mail and email or Wi-Fi and WiFi, both forms benefit from being present somewhere on the site. Embedding models are mostly robust here, not perfectly, and the BM25 leg is exact-only.
What is no longer worth doing and was probably never worth doing as much as the SEO playbooks insisted is keyword density manipulation, exact-phrase repetition, and synonym dictionaries pasted into footers. The marginal return from these tactics in a hybrid stack is approximately zero, and in some cases negative because the embedding pooling step degrades under repetition.
Designing Content for Both Filters
The practical content rule is short and unromantic: write the answer once in the canonical phrasing, then write the paraphrases around it, then make sure the structure is parseable.
The mechanism for each clause is real. Canonical phrasing gives BM25 the exact-match signal it needs. Paraphrases widen the embedding space the page covers, so the page lands close to a wider distribution of query rewrites. Parseable structure short paragraphs, one thought per chunk, headings that match the prose, schema where appropriate feeds the chunker and the structured-data layer downstream.
The thing the SEO industry got wrong, and is still getting wrong, is the assumption that you must choose between exact-match and semantic richness. The hybrid stack does not force a choice. It rewards both, scored by different paths and fused. Pages that try to win on exact-match alone fail the dense filter on paraphrases. Pages that try to win on semantic richness alone fail the sparse filter on exact identifiers and brand names. Pages that do both which is what good prose has always been match more of the query distribution.
How to Verify You Are Winning at the Embedding Layer
This is the part of the post where I tell you to stop guessing and start measuring, because the measurement is cheap and the alternative is folklore.
Pick a public embedding model text-embedding-3-small from OpenAI, voyage-3 from Voyage, or a BGE model from BAAI (free). Pick a corpus of your own pages. Embed each page. Take a list of queries you believe should match those pages literal phrasings, paraphrases, adversarial cases embed those, and compute cosine similarity between every query and every page.
What you are looking for is not absolute numbers embedding similarities are model-specific and not directly comparable across models. You are looking for ranks and gaps. For a query that should match page A, does page A come first? If it is buried under three tangentially related pages, your content is failing the dense filter, and the failure is diagnosable. Often the fix is a missing paraphrase, a buried answer the pooling step is averaging away, or a structure where the topic shifts halfway through and the pooled vector lands between two centroids.
Run the same exercise with BM25 most search libraries (Elasticsearch, OpenSearch, Vespa, Tantivy, Whoosh) implement it in a few lines. Compare. The cases where the same query ranks the page differently between the two paths are the cases where hybrid retrieval will cover or expose your content. That comparison is the thing the SEO industry pretends not to need to do, because doing it makes the folklore harder to sell.
I run this against my own content periodically, against competitors' content, and against the queries I expect AI assistants to rewrite mine into. It is the cheapest piece of due diligence in modern content engineering and consistently produces actionable findings.
A Note on Google
Whenever the dense-retrieval story comes up, someone asks "did Google switch to vectors?" The honest answer is that Google's retrieval stack is hybrid, partially private, and has been neural-augmented since well before the LLM era RankBrain (2015) and BERT integration (2019) are the named layers, but those are not the entire stack, and the company has not published a definitive "we switched from BM25 to vectors on date X" statement because the truth is more complicated. The on-the-record position is hybrid: lexical features plus learned ranking plus several layers of neural processing in concert. AI Overview and Gemini's grounded mode add their own retrieval and synthesis on top. Treating Google's stack as either "still BM25 underneath" or "all vectors now" both mis-frame it. It is layered, hybrid, mostly private. The operational stance: assume both filters are present, design content that survives both, do not bet against either signal.
The Synthesis
The recall step in modern AI search is dense, not lexical but the production stack is hybrid, and that is the framing the SEO industry has not absorbed. Embedding models match meaning. BM25 matches tokens. Both fire. The pages cited by AI assistants are the pages that survive both filters, not the pages that game one.
The single sentence: retrieval is no longer keyword match; it is hybrid recall where the dense signal handles paraphrase and intent and the sparse signal rescues exact identifiers, and content design that ignores either filter loses on the queries the other one would have caught.
If you only have time to internalise three things, in order:
- The first stage is hybrid, not lexical. Dense retrieval handles paraphrase, intent, and synonyms. Sparse retrieval handles exact identifiers, brand names, and rare terms. Both fire on every query in production stacks. Content that engineers for one and ignores the other loses on the queries the other one would have caught.
- The user's literal query is not the query that hits the retrieval stack. LLM rewrites paraphrase, expand, and normalise the query before retrieval. Optimising for the literal user phrasing is optimising for a string the index never sees. Optimising for the cluster of paraphrases around an intent is what moves the needle.
- Measure your content with a public embedding model. It costs almost nothing. Compute similarity between your pages and the queries you expect. Cases where a topically correct page ranks low in cosine similarity are cases where your content is failing the dense filter, and the failure is usually diagnosable. The SEO industry mostly does not do this, which is why so much advice is still keyword-stack folklore.
The page that ranks for the phrase nobody types is not magic. It is a page whose embedding sits close to the embedding of the query the user actually asked, in a space the model learned from a corpus closed before either of you wrote anything. The page that wins the exact-match phrase but does not get cited is the inverse: the lexical filter passed it, the dense filter dropped it, and the rerank step never saw it. Both outcomes are mechanical, both are addressable, and the content design that addresses both is what wins the hybrid retrieval stack which is the stack that decides what AI assistants see.
The retrieval-stack synthesis here is my own reading of the primary literature Robertson and Zaragoza's The Probabilistic Relevance Framework: BM25 and Beyond (Foundations and Trends in Information Retrieval, 2009), Karpukhin et al., Dense Passage Retrieval for Open-Domain Question Answering (arXiv:2004.04906, 2020), Khattab and Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (arXiv:2004.12832, 2020), Thakur et al., BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models (arXiv:2104.08663, 2021), Malkov and Yashunin, Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (arXiv:1603.09320, 2016), and the Faiss library source and documentation combined with observable behaviour from running my own embedding-similarity computations against my own corpus and watching what happened to AI citations after content rewrites. Where I have written "in my testing" or "the pattern I observe," that is exactly what I mean. The directional claims about exact-match SEO no longer paying are mechanistic embedding similarity is computable on any public model and the audit is reproducible but I am not making quantitative promises and the magnitude of any individual rewrite varies by domain, model, and query distribution. Provider behaviour is moving; verify against current docs and current model behaviour before shipping a strategy.
Top comments (0)