How to stop your RAG assistant from hallucinating (a practical guide)

#ai #llm #rag #tutorial

A RAG (retrieval-augmented generation) assistant is supposed to answer from your documents, not from the model's imagination. But teams keep shipping bots that confidently invent prices, policies, and features that don't exist. The hallucination usually isn't the model "being creative" — it's a retrieval and prompting problem you can fix. Here's the checklist I actually use.

Hallucination is usually a retrieval failure, not a model failure

If the right chunk never makes it into the context, the model fills the gap by guessing. So before blaming the LLM, ask: did retrieval even return the correct passage? Most "the AI lied" bugs are really "we fed it the wrong context" bugs. Fix retrieval first; it's where the biggest wins are.

Chunk for meaning, not for byte count

Splitting documents every N characters cuts sentences in half and scatters one answer across chunks.

Split on semantic boundaries — headings, sections, list items — not arbitrary lengths.
Keep chunks focused: one idea per chunk retrieves far more accurately than a wall of mixed topics.
Add a little overlap so context isn't lost at the seams.
Attach metadata (source, title, date) to every chunk — you'll need it for filtering and citations.

Retrieve better than plain vector search

Pure embedding search misses exact terms (product codes, names, numbers).

Use hybrid search: combine semantic (vector) with keyword (BM25) so both meaning and exact matches are covered.
Add a reranker to reorder the top candidates before they hit the prompt — it consistently lifts answer quality.
Tune how many chunks you pass. Too few starves the model; too many buries the answer in noise.

Make "I don't know" a first-class answer

This single instruction prevents most embarrassing hallucinations: tell the model to answer only from the provided context, and to say it doesn't know when the context doesn't contain the answer.

If the answer is not in the provided context, say you don't have that information — do not guess.

A bot that admits a gap is trustworthy. A bot that confidently makes something up costs you a customer.

Demand citations and ground every claim

Have the model cite the source chunk for each statement. Two benefits: users can verify, and you can automatically flag answers where the cited text doesn't actually support the claim. If it can't cite, it shouldn't assert.

Evaluate, don't vibe-check

You can't improve what you don't measure.

Build a small test set of real questions with known correct answers.
Track groundedness (is the answer supported by retrieved context?) and retrieval hit rate (did the right chunk show up?).
Re-run it on every change to chunking, embeddings, or prompts so you catch regressions before users do.

A reliable default stack

Clean, semantically chunked docs with metadata → hybrid retrieval + reranker → a strict "answer only from context, cite sources, admit unknowns" prompt → an eval set guarding the whole thing. Boring, measurable, and it stops the confident nonsense.

Done right, a RAG assistant becomes the thing customers trust for accurate answers 24/7 instead of a liability. I build RAG assistants and AI automation that stay grounded in real data as a freelancer — you can see examples at vengstudio.online. Questions about retrieval or grounding? Drop them in the comments.

Top comments (2)

Tae Kim • May 31

On "demand citations": in my extraction pipeline the model still fabricated the quotes themselves, stitching unrelated sentences with "...". What finally fixed it was a deterministic substring check at commit time, where the quote either appears verbatim in the source or the whole extraction is rejected. Citing isn't grounding until you verify the citation.

pante5ter • May 31

Exactly — that's the gap between "cited" and "grounded." A model will happily emit a citation that points at a real source while the quoted text is a Frankenstein of stitched fragments. The deterministic substring check at commit time is the right enforcement: verbatim match or reject the extraction. I've also had luck rejecting at the span level — keep the supported sentences, drop the unsupported ones — so one bad quote doesn't nuke an otherwise good answer. How are you handling near-verbatim drift (whitespace/punctuation/casing) in the substring check — normalizing both sides first?