Great point on the overlap being underestimated — it's the difference between "almost found" and "actually found" in production.
Your patent search use case is a perfect example of where BM25 shines. Exact phrase matching on domain-specific terminology like "solid-state battery electrolyte" is something embeddings genuinely struggle with — semantic similarity doesn't help when the user needs that exact compound noun.
For the hybrid approach you're considering: LangChain has an EnsembleRetriever that combines BM25 and vector retrievers using Reciprocal Rank Fusion. You set weights per retriever — something like 0.6 BM25 / 0.4 vector would keep exact matches dominant while letting embeddings surface conceptually related patents the keyword search misses.
At 3.5M documents the practical bottleneck is usually the reranking step. A cross-encoder reranker (like cross-encoder/ms-marco-MiniLM-L-6-v2) on the top-k merged results adds latency but measurably improves precision — worth testing on your corpus to see if the accuracy gain justifies the cost.
Would be curious how FTS5 performs on queries where the user describes a concept rather than using the exact patent terminology.
Patent lawyer turned AI engineer. Processed 4M patents with local LLM on RTX 5090. Building PatentLLM — AI-powered patent search. Also ranked #1 on Floodgate (shogi AI). Writing about local LLM etc.
Thanks for the EnsembleRetriever tip — bookmarking that for later.
For now, I solved the concept query problem a different way. I pre-built a keyword index from the actual patent corpus — a controlled vocabulary extracted from the data itself. When a user queries in natural language, the LLM selects search terms only from this existing index, not from its own imagination. So "self-driving car obstacle detection" gets mapped to terms like "autonomous driving," "obstacle detection," "lidar" that are guaranteed to exist in the database.
The LLM can't hallucinate search terms that don't match anything, so precision went way up. It's working well enough for now, but at 3.5M docs I'm sure there are edge cases where embedding reranking on top would help. Might try that 0.6/0.4 split you mentioned as a next step.
The controlled vocabulary approach is smart — constraining the LLM's search term selection to terms that actually exist in your corpus eliminates the hallucinated-query problem at the source. That's fundamentally more reliable than trying to catch bad queries downstream.
The 0.6/0.4 BM25/vector split would complement this well. Your keyword index handles precision (exact patent terminology), and embeddings would catch the conceptual gaps — cases where an inventor describes a mechanism differently than prior art but means the same thing. At 3.5M documents, even a small improvement in recall at that scale translates to real patent coverage gains.
One thing worth testing: you could use the embedding reranker selectively — only on queries where the keyword index returns fewer than N results. That way you're not adding latency to queries that already have strong exact matches, and you only pay the embedding cost when BM25 alone isn't enough.
The controlled vocabulary approach is clever — constraining the LLM's term selection to your actual corpus index eliminates hallucinated search terms at the source. That's essentially a retrieval-side guardrail, which pairs well with the synthesis-side constraints in the article. At 3.5M docs, the hybrid could work well: let your vocabulary index handle precision (exact patent terminology), then use embedding reranking to catch semantic near-misses that BM25 would drop. The 0.6/0.4 split is a starting point — with patent data you might want to skew heavier toward BM25 (0.7/0.3) since exact phrasing matters more in legal/technical domains.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Great point on the overlap being underestimated — it's the difference between "almost found" and "actually found" in production.
Your patent search use case is a perfect example of where BM25 shines. Exact phrase matching on domain-specific terminology like "solid-state battery electrolyte" is something embeddings genuinely struggle with — semantic similarity doesn't help when the user needs that exact compound noun.
For the hybrid approach you're considering: LangChain has an
EnsembleRetrieverthat combines BM25 and vector retrievers using Reciprocal Rank Fusion. You set weights per retriever — something like 0.6 BM25 / 0.4 vector would keep exact matches dominant while letting embeddings surface conceptually related patents the keyword search misses.At 3.5M documents the practical bottleneck is usually the reranking step. A cross-encoder reranker (like
cross-encoder/ms-marco-MiniLM-L-6-v2) on the top-k merged results adds latency but measurably improves precision — worth testing on your corpus to see if the accuracy gain justifies the cost.Would be curious how FTS5 performs on queries where the user describes a concept rather than using the exact patent terminology.
Thanks for the EnsembleRetriever tip — bookmarking that for later.
For now, I solved the concept query problem a different way. I pre-built a keyword index from the actual patent corpus — a controlled vocabulary extracted from the data itself. When a user queries in natural language, the LLM selects search terms only from this existing index, not from its own imagination. So "self-driving car obstacle detection" gets mapped to terms like "autonomous driving," "obstacle detection," "lidar" that are guaranteed to exist in the database.
The LLM can't hallucinate search terms that don't match anything, so precision went way up. It's working well enough for now, but at 3.5M docs I'm sure there are edge cases where embedding reranking on top would help. Might try that 0.6/0.4 split you mentioned as a next step.
The controlled vocabulary approach is smart — constraining the LLM's search term selection to terms that actually exist in your corpus eliminates the hallucinated-query problem at the source. That's fundamentally more reliable than trying to catch bad queries downstream.
The 0.6/0.4 BM25/vector split would complement this well. Your keyword index handles precision (exact patent terminology), and embeddings would catch the conceptual gaps — cases where an inventor describes a mechanism differently than prior art but means the same thing. At 3.5M documents, even a small improvement in recall at that scale translates to real patent coverage gains.
One thing worth testing: you could use the embedding reranker selectively — only on queries where the keyword index returns fewer than N results. That way you're not adding latency to queries that already have strong exact matches, and you only pay the embedding cost when BM25 alone isn't enough.
The controlled vocabulary approach is clever — constraining the LLM's term selection to your actual corpus index eliminates hallucinated search terms at the source. That's essentially a retrieval-side guardrail, which pairs well with the synthesis-side constraints in the article. At 3.5M docs, the hybrid could work well: let your vocabulary index handle precision (exact patent terminology), then use embedding reranking to catch semantic near-misses that BM25 would drop. The 0.6/0.4 split is a starting point — with patent data you might want to skew heavier toward BM25 (0.7/0.3) since exact phrasing matters more in legal/technical domains.