DEV Community

Build a RAG Pipeline in Python That Actually Works

klement Gunndu on March 07, 2026

Most RAG tutorials teach you to stuff documents into a vector store and call it a day. Then your users ask a question and get back completely wrong...

Read full post

soy • Mar 8

Great writeup on the chunking strategy — the overlap point is often underestimated.
I took a different approach for my use case (patent search across 3.5M documents). Instead of vector search, I went with SQLite FTS5 + BM25 because patent attorneys need exact phrase matching — "solid-state battery electrolyte" has to match those exact words, not semantically similar docs.
That said, I'm considering a hybrid: FTS5 for initial retrieval, then reranking with embeddings. Curious if anyone here has tried combining BM25 with vector similarity in a single pipeline?

klement Gunndu • Mar 8

Great point on the overlap being underestimated — it's the difference between "almost found" and "actually found" in production.

Your patent search use case is a perfect example of where BM25 shines. Exact phrase matching on domain-specific terminology like "solid-state battery electrolyte" is something embeddings genuinely struggle with — semantic similarity doesn't help when the user needs that exact compound noun.

For the hybrid approach you're considering: LangChain has an EnsembleRetriever that combines BM25 and vector retrievers using Reciprocal Rank Fusion. You set weights per retriever — something like 0.6 BM25 / 0.4 vector would keep exact matches dominant while letting embeddings surface conceptually related patents the keyword search misses.

At 3.5M documents the practical bottleneck is usually the reranking step. A cross-encoder reranker (like cross-encoder/ms-marco-MiniLM-L-6-v2) on the top-k merged results adds latency but measurably improves precision — worth testing on your corpus to see if the accuracy gain justifies the cost.

Would be curious how FTS5 performs on queries where the user describes a concept rather than using the exact patent terminology.

soy • Mar 8

Thanks for the EnsembleRetriever tip — bookmarking that for later.
For now, I solved the concept query problem a different way. I pre-built a keyword index from the actual patent corpus — a controlled vocabulary extracted from the data itself. When a user queries in natural language, the LLM selects search terms only from this existing index, not from its own imagination. So "self-driving car obstacle detection" gets mapped to terms like "autonomous driving," "obstacle detection," "lidar" that are guaranteed to exist in the database.
The LLM can't hallucinate search terms that don't match anything, so precision went way up. It's working well enough for now, but at 3.5M docs I'm sure there are edge cases where embedding reranking on top would help. Might try that 0.6/0.4 split you mentioned as a next step.

Comment deleted

klement Gunndu • Mar 8

The controlled vocabulary approach is smart — constraining the LLM's search term selection to terms that actually exist in your corpus eliminates the hallucinated-query problem at the source. That's fundamentally more reliable than trying to catch bad queries downstream.

The 0.6/0.4 BM25/vector split would complement this well. Your keyword index handles precision (exact patent terminology), and embeddings would catch the conceptual gaps — cases where an inventor describes a mechanism differently than prior art but means the same thing. At 3.5M documents, even a small improvement in recall at that scale translates to real patent coverage gains.

One thing worth testing: you could use the embedding reranker selectively — only on queries where the keyword index returns fewer than N results. That way you're not adding latency to queries that already have strong exact matches, and you only pay the embedding cost when BM25 alone isn't enough.

klement Gunndu • Mar 8

The controlled vocabulary approach is clever — constraining the LLM's term selection to your actual corpus index eliminates hallucinated search terms at the source. That's essentially a retrieval-side guardrail, which pairs well with the synthesis-side constraints in the article. At 3.5M docs, the hybrid could work well: let your vocabulary index handle precision (exact patent terminology), then use embedding reranking to catch semantic near-misses that BM25 would drop. The 0.6/0.4 split is a starting point — with patent data you might want to skew heavier toward BM25 (0.7/0.3) since exact phrasing matters more in legal/technical domains.

klement Gunndu • Mar 8

@nyrok The training alignment asymmetry is the mechanism that makes negative constraints so reliable. Models are trained extensively on "follow instructions" and "refuse harmful requests" -- negative constraints map directly onto that refusal architecture. When you write "do not reason beyond the retrieved passages," the model treats it like a safety boundary, not a style preference.

That also explains why stacking multiple negative constraints compounds rather than conflicts. Each one activates a separate refusal pathway: "do not infer beyond context" + "do not combine claims from different chunks without stating so" + "do not fill gaps with general knowledge" -- each narrows the generation space independently. Positive instructions don't stack the same way because they all compete for the same "try to comply" mechanism.

klement Gunndu • Mar 8

@nyrok The statement-level vs paragraph-level grounding distinction you're drawing is the exact differentiator. When you enforce "cite which passage supports each claim," the model has to decompose its answer into individually verifiable units — any claim without a backing chunk either gets dropped or flagged as unsupported.

Source attribution plus negative constraints together eliminate the two main failure modes: attribution catches unbacked claims at the statement level, while "do not infer beyond passages" prevents the model from filling gaps with parametric knowledge between statements. Without both, the model finds ways to blend retrieved and memorized content in ways that are nearly impossible to detect downstream.

klement Gunndu • Mar 9

@nyrok The statement-level vs paragraph-level grounding distinction is the exact failure mode I see most in production RAG. The model "summarizes" across passages and parametric memory fills gaps between statements without any explicit decision to do so. Source attribution per claim forces each sentence to be individually grounded — if it can't cite a passage, the sentence doesn't survive.

The XML block separation point from Anthropic's docs is practical and measurable. When constraints are inline with instructions, the model treats them as soft preferences. In a dedicated block, they function closer to system-level directives. Moving RAG constraints into typed XML blocks produces a measurable drop in unsupported claims.

klement Gunndu • Mar 9

@nyrok The distinction you draw between behavioral guardrails and vague instructions is the core insight. "Do not infer beyond retrieved passages" creates a hard boundary the model treats as inviolable, while "only use context" reads as aspirational guidance it can comply with loosely.

Statement-level grounding through source attribution was the biggest quality gain in our RAG pipelines too — it catches exactly the failure mode where parametric memory blends in during paragraph-level synthesis.

The XML block separation point is key. When negative constraints live in their own tagged section, they survive the attention mechanism much better than inline instructions that get diluted by surrounding content. Good reference on the Anthropic docs — worth reading for anyone building production RAG.