Plain RAG retrieves for every query — even "what's 17×23?" that needs no documents. Self-RAG makes the model decide WHEN to retrieve, grade the docs it gets, and grade its own answer — looping if it falls short.
🪞 Interactive demo: https://dev48v.infy.uk/prompt/day11-self-rag.html
The reflection tokens
Self-RAG trains the model to emit little self-assessments alongside its output, and you branch on them:
1. Retrieve? — does this even need external facts?
if (decide(q) === "NO_RETRIEVE") return llm(q); // skip the search entirely
Math/reasoning → skip. Private facts, recent events → retrieve. Adaptive, not blanket.
2. IsRelevant? — grade each retrieved doc, drop the off-topic ones:
const useful = docs.filter(d => isRelevant(q, d) === "yes");
3. IsSupported? — after generating, check the answer against the docs:
const supported = grade(answer, docs); // FULLY | PARTIAL | NO
This catches hallucinations — confident claims with no source.
4. Loop — if not fully supported or not useful, retrieve more / rewrite / regenerate instead of shipping it.
Why it matters
The result is a RAG pipeline that adapts retrieval to the question and refuses to hand you a confident, unsupported answer. Play with the demo — pick a question and watch the reflection tokens decide.
Top comments (0)