Building RAG that doesn't hallucinate

#ai #rag #genai #llama

Every RAG tutorial promises the same thing: hook a vector database up to an LLM, and suddenly your model is "grounded" and "won't hallucinate anymore." Then you actually build one, point it at real research papers, and watch it confidently cite a claim that isn't anywhere in the source document. RAG doesn't eliminate hallucination by default — it just gives the model more rope to hang itself with, dressed up as "context." Fixing that, for PaperMind, came down to two unglamorous things: chunking well, and refusing to hide the model's uncertainty from the user.

The retrieval pipeline

PaperMind's job is to let someone ask questions against a corpus of research papers and get answers grounded in the actual text — not in whatever LLaMA 3.1 happens to remember from pretraining. The pipeline behind that is a fairly standard RAG shape on the surface: documents get chunked, embedded, and stored in Pinecone; a query gets embedded the same way; the most relevant chunks get retrieved and stuffed into the prompt; LLaMA 3.1, served through Groq, generates the answer from that context.
The standard shape is also where most RAG systems quietly fail, and it's worth being specific about where.

Why naive chunking breaks things

The default move in most RAG walkthroughs is fixed-size chunking — split every document into, say, 500-token blocks and move on. For research papers, this is close to actively hostile to retrieval quality. A 500-token window will frequently cut a sentence in half, separate a claim from the citation that supports it, or split a table from the caption that explains what it means. When that broken chunk gets retrieved and handed to the LLM as "context," the model is now trying to answer a question using a fragment that's missing exactly the information that would have made the answer correct — and it'll often fill the gap with something plausible-sounding instead of saying "I don't have enough information."
That's the actual mechanism behind a lot of RAG hallucination. It's not that the model is "ignoring" the context — it's that the context it was handed was already broken before it ever reached the prompt.
The fix in PaperMind is semantic chunking: instead of splitting on a fixed token count, chunks are formed around semantically coherent units — keeping a claim together with its supporting sentences, keeping a section's argument intact rather than slicing it at an arbitrary boundary. This is more expensive to compute than fixed-size splitting and it's not a solved problem — there's no chunking strategy that's perfect for every paper structure — but it consistently produces retrieved context that actually contains complete thoughts, which matters more for answer quality than almost any other knob in the pipeline.

Pinecone, and the boring part that actually matters

The vector store itself — Pinecone, in this case — is the least interesting part of the system to talk about and one of the most important to get right operationally. The embeddings need to be generated with a model whose notion of "similarity" actually matches what counts as relevant for research-paper Q&A — abstract semantic similarity isn't quite the same thing as "this chunk would help answer this specific question." Tuning the retrieval — how many chunks to pull back, how to handle the score threshold below which a chunk probably isn't actually relevant — turned out to matter more for final answer quality than swapping the LLM ever did.

The part most RAG demos skip: chunk-score transparency

This is the piece I think actually made PaperMind trustworthy rather than just functional: surfacing the retrieval scores to the user instead of hiding them behind the final generated answer.
Every RAG system already computes a similarity score for each retrieved chunk — that's how it decides what to retrieve in the first place. Almost no RAG demo shows that number to the user. The answer just appears, fully formed, with the same tone of confidence whether the underlying retrieval was a strong match or a desperate scrape of the least-bad chunk available.
PaperMind surfaces the chunk scores alongside the answer, so a user can see not just "here's the answer" but "here's the answer, and here's how confident the retrieval step actually was in the material it found." When the top retrieved chunk has a low similarity score, that's a signal worth seeing — it usually means the answer is more synthesis-from-weak-evidence than direct citation, and a user who can see that score knows to double check before treating the answer as settled. This is a small UI decision with an outsized effect on trust: it turns the system from "is this thing lying to me" into "I can see exactly how grounded this particular answer is."

What I'd tell someone building their first RAG system

If I had to compress this into the two things that actually matter, beyond getting embeddings and an LLM call working: chunk like the structure of your documents actually matters, because it does, and never let the final answer hide how confident the retrieval step was. The LLM generating fluent, confident-sounding text is the easy part — it's good at that regardless of whether the underlying evidence supports it. The hard part, and the part that actually determines whether your RAG system is trustworthy in production, is making sure the retrieval step is honest about what it found, and making sure that honesty doesn't get lost between the vector store and the chat bubble the user reads.

DEV Community

Building RAG that doesn't hallucinate

Top comments (0)