DEV Community

Cover image for RAG Series (14): Self-RAG — Let the Model Decide Whether to Retrieve
WonderLab
WonderLab

Posted on

RAG Series (14): Self-RAG — Let the Model Decide Whether to Retrieve

The Hidden Assumption in Traditional RAG

Traditional RAG pipelines never question one assumption: every question needs retrieval.

User asks "How do you evaluate a RAG system?" — retrieve.
User asks "What is 1 + 1?" — also retrieve.
User asks "Write a Python function to find the GCD" — retrieve again.

The last two questions need no external knowledge whatsoever. Forcing retrieval wastes resources and worse, potentially injects irrelevant documents into the context, confusing the LLM.

Self-RAG, proposed by Asai et al. in 2023, solves this with a "reflection" mechanism: the model outputs special reflection tokens throughout generation to decide when to retrieve, whether retrieved content is relevant, and whether the final answer is grounded in documents.


Self-RAG's Four Reflection Tokens

In the original paper, Self-RAG trains a model capable of emitting four special tokens:

Token Meaning Possible values
[Retrieve] Should we retrieve? yes / no / continue
[IsRel] Is the retrieved doc relevant? relevant / irrelevant
[IsSup] Is the generated content supported by docs? fully supported / partially supported / no support
[IsUse] Is this response useful to the user? 15

These tokens run through the entire generation process, letting the model adapt at each stage rather than blindly "always retrieve, always use."

In practice, you don't need to train a specialized model with these tokens. Using a standard LLM with carefully designed prompts to simulate each reflection node already produces strong results.


Implementing Self-RAG with LangGraph

Overall Flow

User question
    ↓
[decide] Does this need retrieval?
    ├─ yes → [retrieve] vector search top-4
    │              ↓
    │          [filter] score each doc for relevance, drop irrelevant ones
    │              ↓
    │          [rag_generate] generate answer from relevant docs
    │              ↓
    └─ no  → [direct_generate] generate answer directly
                  ↓
             [support_check] is the answer grounded in documents?
                  ↓
              Final answer
Enter fullscreen mode Exit fullscreen mode

State Design

LangGraph's core concept is State — the object that flows between nodes:

class SelfRAGState(TypedDict):
    question: str
    need_retrieve: str           # "yes" | "no"
    retrieved_docs: list[Document]
    relevant_docs: list[Document]
    answer: str
    support_verdict: str         # "supported" | "unsupported"
    token_count: int
    path: list[str]              # execution trace for analysis
Enter fullscreen mode Exit fullscreen mode

Key Node Implementations

Node 1: Retrieve Decision (decide)

RETRIEVE_DECISION_PROMPT = ChatPromptTemplate.from_messages([
    ("system",
     "You are a routing decision component for a RAG system. Decide whether "
     "the following question requires retrieving from an external knowledge base.\n\n"
     "Retrieve: specific technical details, parameters, recommendations, factual content\n"
     "Don't retrieve: common sense, math, logic, small talk, greetings\n\n"
     "Output only yes or no."),
    ("human", "Question: {question}"),
])

def make_decide_node(llm):
    chain = RETRIEVE_DECISION_PROMPT | llm | StrOutputParser()
    def decide(state):
        result = chain.invoke({"question": state["question"]})
        verdict = "yes" if "yes" in result.lower() else "no"
        return {**state, "need_retrieve": verdict}
    return decide
Enter fullscreen mode Exit fullscreen mode

Node 2: Relevance Filter (filter)

RELEVANCE_PROMPT = ChatPromptTemplate.from_messages([
    ("system", "Decide whether the following document is relevant to the question "
               "and would help answer it.\n"
               "Output only relevant or irrelevant."),
    ("human", "Question: {question}\n\nDocument: {document}"),
])

def make_filter_node(llm):
    chain = RELEVANCE_PROMPT | llm | StrOutputParser()
    def filter_docs(state):
        relevant = []
        for doc in state["retrieved_docs"]:
            result = chain.invoke({
                "question": state["question"],
                "document": doc.page_content[:300],
            })
            if "relevant" in result.lower() and "irrelevant" not in result.lower():
                relevant.append(doc)
        # Fallback: if everything gets filtered, keep original results
        return {**state, "relevant_docs": relevant or state["retrieved_docs"]}
    return filter_docs
Enter fullscreen mode Exit fullscreen mode

Conditional routing after decide

def route_after_decide(state) -> Literal["retrieve", "direct_generate"]:
    return "retrieve" if state["need_retrieve"] == "yes" else "direct_generate"

graph.add_conditional_edges(
    "decide",
    route_after_decide,
    {"retrieve": "retrieve", "direct_generate": "direct_generate"},
)
Enter fullscreen mode Exit fullscreen mode

Full Graph Construction

graph = StateGraph(SelfRAGState)

graph.add_node("decide",          make_decide_node(llm))
graph.add_node("retrieve",        make_retrieve_node(retriever))
graph.add_node("filter",          make_filter_node(llm))
graph.add_node("rag_generate",    make_rag_generate_node(llm))
graph.add_node("direct_generate", make_direct_generate_node(llm))
graph.add_node("support_check",   make_support_node(llm))

graph.set_entry_point("decide")
graph.add_conditional_edges("decide", route_after_decide, {...})
graph.add_edge("retrieve",        "filter")
graph.add_edge("filter",          "rag_generate")
graph.add_edge("rag_generate",    "support_check")
graph.add_edge("direct_generate", "support_check")
graph.add_edge("support_check",   END)

self_rag_app = graph.compile()
Enter fullscreen mode Exit fullscreen mode

Experiment Design

The test set contains two types of questions:

  • 8 RAG-domain questions: Embedding model selection, vector databases, chunking strategies — require knowledge base retrieval
  • 3 no-retrieval questions: What is 1+1?, What's the weather like?, Write a Python GCD function

The second category is the key test — can Self-RAG correctly identify and skip retrieval for them?


Experimental Results

Routing Accuracy

Self-RAG routing decisions:
  [✓ Retrieve] What is RAG and what problem does it solve?
  [✓ Retrieve] Which vector database should I use for enterprise apps?
  [✓ Retrieve] Which embedding model is recommended for Chinese text?
  [✓ Retrieve] What chunk size is recommended for document splitting?
  [✓ Retrieve] How do you evaluate a RAG system?
  [✓ Retrieve] What are the advantages of hybrid search over pure vector search?
  [✓ Retrieve] What role does Reranking play in RAG?
  [✓ Retrieve] What problem does Parent-Child chunking solve?
  [✗ Skip]     What is 1 + 1?
  [✗ Skip]     What's the weather like today?
  [✗ Skip]     Write a Python function to find the GCD

Summary: retrieve=8, skip=3
Enter fullscreen mode Exit fullscreen mode

11/11 correct. All 8 domain questions triggered retrieval; all 3 non-retrieval questions went directly to generation.

Relevance Filter in Action

Execution paths (Self-RAG):
  Q1: decide→yes → retrieve → filter(4/4) → rag_generate → support→supported
  Q2: decide→yes → retrieve → filter(2/4) → rag_generate → support→supported
  Q3: decide→yes → retrieve → filter(2/4) → rag_generate → support→unsupported
  Q4: decide→yes → retrieve → filter(1/4) → rag_generate → support→supported
  Q5: decide→yes → retrieve → filter(2/4) → rag_generate → support→supported
  Q6: decide→yes → retrieve → filter(4/4) → rag_generate → support→unsupported
  Q7: decide→yes → retrieve → filter(1/4) → rag_generate → support→supported
  Q8: decide→yes → retrieve → filter(2/4) → rag_generate → support→supported
Enter fullscreen mode Exit fullscreen mode

filter(1/4) means 3 out of 4 retrieved documents were judged irrelevant and discarded. This is where the context_precision gain comes from — the LLM receives a cleaner, tighter context.

support→unsupported on Q3 and Q6 signals that the LLM's answer went beyond what the documents actually said. A full Self-RAG implementation would re-generate at this point. Our simplified version outputs the answer as-is and flags it.

RAGAS Metrics

======================================================================
  RAGAS Metrics Comparison (8 domain questions)
======================================================================

  Metric               Always Retrieve    Self-RAG     Delta
  ──────────────────────────────────────────────────────────
  context_recall            0.625          0.625     →+0.000
  context_precision         0.583          0.688     ↑+0.104  ◀
  faithfulness              0.845          0.866     ↑+0.021
  answer_relevancy          0.404          0.401     →-0.003
======================================================================
Enter fullscreen mode Exit fullscreen mode
  • context_precision +0.104: Direct contribution from the filter node. Always-retrieve passes all 4 docs to the LLM regardless of quality; Self-RAG filters to only the relevant ones, so the docs that remain rank better by definition.
  • context_recall unchanged: The filter didn't drop relevant documents (the fallback logic held).
  • faithfulness +0.021: Cleaner context, slightly fewer hallucinations.

Token Cost: The Counterintuitive Finding

Token cost comparison (all 11 questions, estimated):
  Always retrieve: ~6,600 tokens
  Self-RAG:       ~16,050 tokens
  Self-RAG uses:   ~2.4x more tokens
Enter fullscreen mode Exit fullscreen mode

Self-RAG actually costs more — this is the most surprising result, and it deserves a careful explanation.

Each reflection node requires an extra LLM call:

Node Always Retrieve Self-RAG
Retrieve decision (decide) ✓ once per question
Relevance filter (filter) ✓ once per doc (up to 4×)
Generate
Support evaluation ✓ once per question

Even though 3 questions skipped retrieval entirely, the overhead from decide + filter + support_check on the remaining 8 questions more than cancels that out. In this experiment, 73% of questions still needed retrieval — not enough skipped to offset the reflection overhead.

When does Self-RAG actually save tokens? When the fraction of "no retrieval needed" questions is high enough (typically >50%), and the cost of retrieval+generation far exceeds the cost of the decision node. In typical RAG deployments — where users are mostly asking questions about the knowledge base — this condition rarely holds.


Self-RAG's Real Value

The experiment clarifies where Self-RAG's value actually lies:

The value is quality improvement, not token savings.

  • The filter node improves context_precision by +0.104 — a larger jump than most retrieval optimization strategies
  • The routing decision prevents irrelevant documents from poisoning responses to non-domain questions
  • The support evaluation creates observability — you can tell which answers are grounded and which might need a fallback

Self-RAG is a good fit when:

  • You have a mixed-intent system: some questions need knowledge base retrieval, others are general-purpose (calculations, creative tasks, small talk)
  • Answer quality is the priority and you can absorb the extra token cost and latency
  • You need observability: knowing whether each answer is document-supported matters

Self-RAG is a poor fit when:

  • Nearly all questions are knowledge-base questions (very little routing benefit)
  • Token budget is extremely tight
  • You need minimal latency — each additional node adds a round-trip

Full Code

Complete code is open-sourced at:

https://github.com/chendongqi/llm-in-action/tree/main/14-self-rag

Key file:

  • self_rag.py — Full Self-RAG implementation with LangGraph

How to run:

git clone https://github.com/chendongqi/llm-in-action
cd 14-self-rag
cp .env.example .env
pip install -r requirements.txt
python self_rag.py
Enter fullscreen mode Exit fullscreen mode

Summary

This article implemented a simplified Self-RAG system with LangGraph. Key findings:

  1. Routing: 11/11 correct — LLM reliably distinguishes "needs retrieval" domain questions from "no retrieval needed" general questions
  2. Quality gain is real — the filter node cuts context noise, lifting context_precision by +0.104
  3. Token cost rises 2.4x — reflection nodes (decide + filter + support) outweigh the savings from skipping 3 retrievals. This is the honest engineering tradeoff.

Self-RAG upgrades "blind execution" into "conscious decision-making." The cost is added complexity and higher token usage. The payoff is more precise context, better-grounded answers, and a system that knows when it doesn't need to retrieve. In mixed-intent or conversational RAG systems, that tradeoff is often worth making.


References

Top comments (0)