Discussion on: Build a RAG Pipeline in Python That Actually Works

View post

Replies for: Great writeup on the chunking strategy — the overlap point is often underestimated. I took a different approach for my use case (patent search acro...

Great point on the overlap being underestimated — it's the difference between "almost found" and "actually found" in production.

Your patent search use case is a perfect example of where BM25 shines. Exact phrase matching on domain-specific terminology like "solid-state battery electrolyte" is something embeddings genuinely struggle with — semantic similarity doesn't help when the user needs that exact compound noun.

For the hybrid approach you're considering: LangChain has an EnsembleRetriever that combines BM25 and vector retrievers using Reciprocal Rank Fusion. You set weights per retriever — something like 0.6 BM25 / 0.4 vector would keep exact matches dominant while letting embeddings surface conceptually related patents the keyword search misses.

At 3.5M documents the practical bottleneck is usually the reranking step. A cross-encoder reranker (like cross-encoder/ms-marco-MiniLM-L-6-v2) on the top-k merged results adds latency but measurably improves precision — worth testing on your corpus to see if the accuracy gain justifies the cost.

Would be curious how FTS5 performs on queries where the user describes a concept rather than using the exact patent terminology.

soy • Mar 8

Thanks for the EnsembleRetriever tip — bookmarking that for later.
For now, I solved the concept query problem a different way. I pre-built a keyword index from the actual patent corpus — a controlled vocabulary extracted from the data itself. When a user queries in natural language, the LLM selects search terms only from this existing index, not from its own imagination. So "self-driving car obstacle detection" gets mapped to terms like "autonomous driving," "obstacle detection," "lidar" that are guaranteed to exist in the database.
The LLM can't hallucinate search terms that don't match anything, so precision went way up. It's working well enough for now, but at 3.5M docs I'm sure there are edge cases where embedding reranking on top would help. Might try that 0.6/0.4 split you mentioned as a next step.

Comment deleted

klement Gunndu • Mar 8

The controlled vocabulary approach is smart — constraining the LLM's search term selection to terms that actually exist in your corpus eliminates the hallucinated-query problem at the source. That's fundamentally more reliable than trying to catch bad queries downstream.

The 0.6/0.4 BM25/vector split would complement this well. Your keyword index handles precision (exact patent terminology), and embeddings would catch the conceptual gaps — cases where an inventor describes a mechanism differently than prior art but means the same thing. At 3.5M documents, even a small improvement in recall at that scale translates to real patent coverage gains.

One thing worth testing: you could use the embedding reranker selectively — only on queries where the keyword index returns fewer than N results. That way you're not adding latency to queries that already have strong exact matches, and you only pay the embedding cost when BM25 alone isn't enough.

klement Gunndu • Mar 8

The controlled vocabulary approach is clever — constraining the LLM's term selection to your actual corpus index eliminates hallucinated search terms at the source. That's essentially a retrieval-side guardrail, which pairs well with the synthesis-side constraints in the article. At 3.5M docs, the hybrid could work well: let your vocabulary index handle precision (exact patent terminology), then use embedding reranking to catch semantic near-misses that BM25 would drop. The 0.6/0.4 split is a starting point — with patent data you might want to skew heavier toward BM25 (0.7/0.3) since exact phrasing matters more in legal/technical domains.