The Hidden Assumption in Traditional RAG
Traditional RAG pipelines never question one assumption: every question needs retrieval.
User asks "How do you evaluate a RAG system?" — retrieve.
User asks "What is 1 + 1?" — also retrieve.
User asks "Write a Python function to find the GCD" — retrieve again.
The last two questions need no external knowledge whatsoever. Forcing retrieval wastes resources and worse, potentially injects irrelevant documents into the context, confusing the LLM.
Self-RAG, proposed by Asai et al. in 2023, solves this with a "reflection" mechanism: the model outputs special reflection tokens throughout generation to decide when to retrieve, whether retrieved content is relevant, and whether the final answer is grounded in documents.
Self-RAG's Four Reflection Tokens
In the original paper, Self-RAG trains a model capable of emitting four special tokens:
| Token | Meaning | Possible values |
|---|---|---|
[Retrieve] |
Should we retrieve? |
yes / no / continue
|
[IsRel] |
Is the retrieved doc relevant? |
relevant / irrelevant
|
[IsSup] |
Is the generated content supported by docs? |
fully supported / partially supported / no support
|
[IsUse] |
Is this response useful to the user? |
1–5
|
These tokens run through the entire generation process, letting the model adapt at each stage rather than blindly "always retrieve, always use."
In practice, you don't need to train a specialized model with these tokens. Using a standard LLM with carefully designed prompts to simulate each reflection node already produces strong results.
Implementing Self-RAG with LangGraph
Overall Flow
User question
↓
[decide] Does this need retrieval?
├─ yes → [retrieve] vector search top-4
│ ↓
│ [filter] score each doc for relevance, drop irrelevant ones
│ ↓
│ [rag_generate] generate answer from relevant docs
│ ↓
└─ no → [direct_generate] generate answer directly
↓
[support_check] is the answer grounded in documents?
↓
Final answer
State Design
LangGraph's core concept is State — the object that flows between nodes:
class SelfRAGState(TypedDict):
question: str
need_retrieve: str # "yes" | "no"
retrieved_docs: list[Document]
relevant_docs: list[Document]
answer: str
support_verdict: str # "supported" | "unsupported"
token_count: int
path: list[str] # execution trace for analysis
Key Node Implementations
Node 1: Retrieve Decision (decide)
RETRIEVE_DECISION_PROMPT = ChatPromptTemplate.from_messages([
("system",
"You are a routing decision component for a RAG system. Decide whether "
"the following question requires retrieving from an external knowledge base.\n\n"
"Retrieve: specific technical details, parameters, recommendations, factual content\n"
"Don't retrieve: common sense, math, logic, small talk, greetings\n\n"
"Output only yes or no."),
("human", "Question: {question}"),
])
def make_decide_node(llm):
chain = RETRIEVE_DECISION_PROMPT | llm | StrOutputParser()
def decide(state):
result = chain.invoke({"question": state["question"]})
verdict = "yes" if "yes" in result.lower() else "no"
return {**state, "need_retrieve": verdict}
return decide
Node 2: Relevance Filter (filter)
RELEVANCE_PROMPT = ChatPromptTemplate.from_messages([
("system", "Decide whether the following document is relevant to the question "
"and would help answer it.\n"
"Output only relevant or irrelevant."),
("human", "Question: {question}\n\nDocument: {document}"),
])
def make_filter_node(llm):
chain = RELEVANCE_PROMPT | llm | StrOutputParser()
def filter_docs(state):
relevant = []
for doc in state["retrieved_docs"]:
result = chain.invoke({
"question": state["question"],
"document": doc.page_content[:300],
})
if "relevant" in result.lower() and "irrelevant" not in result.lower():
relevant.append(doc)
# Fallback: if everything gets filtered, keep original results
return {**state, "relevant_docs": relevant or state["retrieved_docs"]}
return filter_docs
Conditional routing after decide
def route_after_decide(state) -> Literal["retrieve", "direct_generate"]:
return "retrieve" if state["need_retrieve"] == "yes" else "direct_generate"
graph.add_conditional_edges(
"decide",
route_after_decide,
{"retrieve": "retrieve", "direct_generate": "direct_generate"},
)
Full Graph Construction
graph = StateGraph(SelfRAGState)
graph.add_node("decide", make_decide_node(llm))
graph.add_node("retrieve", make_retrieve_node(retriever))
graph.add_node("filter", make_filter_node(llm))
graph.add_node("rag_generate", make_rag_generate_node(llm))
graph.add_node("direct_generate", make_direct_generate_node(llm))
graph.add_node("support_check", make_support_node(llm))
graph.set_entry_point("decide")
graph.add_conditional_edges("decide", route_after_decide, {...})
graph.add_edge("retrieve", "filter")
graph.add_edge("filter", "rag_generate")
graph.add_edge("rag_generate", "support_check")
graph.add_edge("direct_generate", "support_check")
graph.add_edge("support_check", END)
self_rag_app = graph.compile()
Experiment Design
The test set contains two types of questions:
- 8 RAG-domain questions: Embedding model selection, vector databases, chunking strategies — require knowledge base retrieval
-
3 no-retrieval questions:
What is 1+1?,What's the weather like?,Write a Python GCD function
The second category is the key test — can Self-RAG correctly identify and skip retrieval for them?
Experimental Results
Routing Accuracy
Self-RAG routing decisions:
[✓ Retrieve] What is RAG and what problem does it solve?
[✓ Retrieve] Which vector database should I use for enterprise apps?
[✓ Retrieve] Which embedding model is recommended for Chinese text?
[✓ Retrieve] What chunk size is recommended for document splitting?
[✓ Retrieve] How do you evaluate a RAG system?
[✓ Retrieve] What are the advantages of hybrid search over pure vector search?
[✓ Retrieve] What role does Reranking play in RAG?
[✓ Retrieve] What problem does Parent-Child chunking solve?
[✗ Skip] What is 1 + 1?
[✗ Skip] What's the weather like today?
[✗ Skip] Write a Python function to find the GCD
Summary: retrieve=8, skip=3
11/11 correct. All 8 domain questions triggered retrieval; all 3 non-retrieval questions went directly to generation.
Relevance Filter in Action
Execution paths (Self-RAG):
Q1: decide→yes → retrieve → filter(4/4) → rag_generate → support→supported
Q2: decide→yes → retrieve → filter(2/4) → rag_generate → support→supported
Q3: decide→yes → retrieve → filter(2/4) → rag_generate → support→unsupported
Q4: decide→yes → retrieve → filter(1/4) → rag_generate → support→supported
Q5: decide→yes → retrieve → filter(2/4) → rag_generate → support→supported
Q6: decide→yes → retrieve → filter(4/4) → rag_generate → support→unsupported
Q7: decide→yes → retrieve → filter(1/4) → rag_generate → support→supported
Q8: decide→yes → retrieve → filter(2/4) → rag_generate → support→supported
filter(1/4) means 3 out of 4 retrieved documents were judged irrelevant and discarded. This is where the context_precision gain comes from — the LLM receives a cleaner, tighter context.
support→unsupported on Q3 and Q6 signals that the LLM's answer went beyond what the documents actually said. A full Self-RAG implementation would re-generate at this point. Our simplified version outputs the answer as-is and flags it.
RAGAS Metrics
======================================================================
RAGAS Metrics Comparison (8 domain questions)
======================================================================
Metric Always Retrieve Self-RAG Delta
──────────────────────────────────────────────────────────
context_recall 0.625 0.625 →+0.000
context_precision 0.583 0.688 ↑+0.104 ◀
faithfulness 0.845 0.866 ↑+0.021
answer_relevancy 0.404 0.401 →-0.003
======================================================================
- context_precision +0.104: Direct contribution from the filter node. Always-retrieve passes all 4 docs to the LLM regardless of quality; Self-RAG filters to only the relevant ones, so the docs that remain rank better by definition.
- context_recall unchanged: The filter didn't drop relevant documents (the fallback logic held).
- faithfulness +0.021: Cleaner context, slightly fewer hallucinations.
Token Cost: The Counterintuitive Finding
Token cost comparison (all 11 questions, estimated):
Always retrieve: ~6,600 tokens
Self-RAG: ~16,050 tokens
Self-RAG uses: ~2.4x more tokens
Self-RAG actually costs more — this is the most surprising result, and it deserves a careful explanation.
Each reflection node requires an extra LLM call:
| Node | Always Retrieve | Self-RAG |
|---|---|---|
| Retrieve decision (decide) | — | ✓ once per question |
| Relevance filter (filter) | — | ✓ once per doc (up to 4×) |
| Generate | ✓ | ✓ |
| Support evaluation | — | ✓ once per question |
Even though 3 questions skipped retrieval entirely, the overhead from decide + filter + support_check on the remaining 8 questions more than cancels that out. In this experiment, 73% of questions still needed retrieval — not enough skipped to offset the reflection overhead.
When does Self-RAG actually save tokens? When the fraction of "no retrieval needed" questions is high enough (typically >50%), and the cost of retrieval+generation far exceeds the cost of the decision node. In typical RAG deployments — where users are mostly asking questions about the knowledge base — this condition rarely holds.
Self-RAG's Real Value
The experiment clarifies where Self-RAG's value actually lies:
The value is quality improvement, not token savings.
- The filter node improves context_precision by +0.104 — a larger jump than most retrieval optimization strategies
- The routing decision prevents irrelevant documents from poisoning responses to non-domain questions
- The support evaluation creates observability — you can tell which answers are grounded and which might need a fallback
Self-RAG is a good fit when:
- You have a mixed-intent system: some questions need knowledge base retrieval, others are general-purpose (calculations, creative tasks, small talk)
- Answer quality is the priority and you can absorb the extra token cost and latency
- You need observability: knowing whether each answer is document-supported matters
Self-RAG is a poor fit when:
- Nearly all questions are knowledge-base questions (very little routing benefit)
- Token budget is extremely tight
- You need minimal latency — each additional node adds a round-trip
Full Code
Complete code is open-sourced at:
https://github.com/chendongqi/llm-in-action/tree/main/14-self-rag
Key file:
-
self_rag.py— Full Self-RAG implementation with LangGraph
How to run:
git clone https://github.com/chendongqi/llm-in-action
cd 14-self-rag
cp .env.example .env
pip install -r requirements.txt
python self_rag.py
Summary
This article implemented a simplified Self-RAG system with LangGraph. Key findings:
- Routing: 11/11 correct — LLM reliably distinguishes "needs retrieval" domain questions from "no retrieval needed" general questions
- Quality gain is real — the filter node cuts context noise, lifting context_precision by +0.104
- Token cost rises 2.4x — reflection nodes (decide + filter + support) outweigh the savings from skipping 3 retrievals. This is the honest engineering tradeoff.
Self-RAG upgrades "blind execution" into "conscious decision-making." The cost is added complexity and higher token usage. The payoff is more precise context, better-grounded answers, and a system that knows when it doesn't need to retrieve. In mixed-intent or conversational RAG systems, that tradeoff is often worth making.
Top comments (0)