WonderLab

Posted on May 12

RAG Series (14): Self-RAG — Let the Model Decide Whether to Retrieve

#ai #rag #ragas #langchain

The Hidden Assumption in Traditional RAG

Traditional RAG pipelines never question one assumption: every question needs retrieval.

User asks "How do you evaluate a RAG system?" — retrieve.
User asks "What is 1 + 1?" — also retrieve.
User asks "Write a Python function to find the GCD" — retrieve again.

The last two questions need no external knowledge whatsoever. Forcing retrieval wastes resources and worse, potentially injects irrelevant documents into the context, confusing the LLM.

Self-RAG, proposed by Asai et al. in 2023, solves this with a "reflection" mechanism: the model outputs special reflection tokens throughout generation to decide when to retrieve, whether retrieved content is relevant, and whether the final answer is grounded in documents.

Self-RAG's Four Reflection Tokens

In the original paper, Self-RAG trains a model capable of emitting four special tokens:

Token	Meaning	Possible values
`[Retrieve]`	Should we retrieve?	`yes` / `no` / `continue`
`[IsRel]`	Is the retrieved doc relevant?	`relevant` / `irrelevant`
`[IsSup]`	Is the generated content supported by docs?	`fully supported` / `partially supported` / `no support`
`[IsUse]`	Is this response useful to the user?	`1`–`5`

These tokens run through the entire generation process, letting the model adapt at each stage rather than blindly "always retrieve, always use."

In practice, you don't need to train a specialized model with these tokens. Using a standard LLM with carefully designed prompts to simulate each reflection node already produces strong results.

Implementing Self-RAG with LangGraph

Overall Flow

User question
    ↓
[decide] Does this need retrieval?
    ├─ yes → [retrieve] vector search top-4
    │              ↓
    │          [filter] score each doc for relevance, drop irrelevant ones
    │              ↓
    │          [rag_generate] generate answer from relevant docs
    │              ↓
    └─ no  → [direct_generate] generate answer directly
                  ↓
             [support_check] is the answer grounded in documents?
                  ↓
              Final answer

State Design

LangGraph's core concept is State — the object that flows between nodes:

class SelfRAGState(TypedDict):
    question: str
    need_retrieve: str           # "yes" | "no"
    retrieved_docs: list[Document]
    relevant_docs: list[Document]
    answer: str
    support_verdict: str         # "supported" | "unsupported"
    token_count: int
    path: list[str]              # execution trace for analysis

Key Node Implementations

Node 1: Retrieve Decision (decide)

RETRIEVE_DECISION_PROMPT = ChatPromptTemplate.from_messages([
    ("system",
     "You are a routing decision component for a RAG system. Decide whether "
     "the following question requires retrieving from an external knowledge base.\n\n"
     "Retrieve: specific technical details, parameters, recommendations, factual content\n"
     "Don't retrieve: common sense, math, logic, small talk, greetings\n\n"
     "Output only yes or no."),
    ("human", "Question: {question}"),
])

def make_decide_node(llm):
    chain = RETRIEVE_DECISION_PROMPT | llm | StrOutputParser()
    def decide(state):
        result = chain.invoke({"question": state["question"]})
        verdict = "yes" if "yes" in result.lower() else "no"
        return {**state, "need_retrieve": verdict}
    return decide

Node 2: Relevance Filter (filter)

RELEVANCE_PROMPT = ChatPromptTemplate.from_messages([
    ("system", "Decide whether the following document is relevant to the question "
               "and would help answer it.\n"
               "Output only relevant or irrelevant."),
    ("human", "Question: {question}\n\nDocument: {document}"),
])

def make_filter_node(llm):
    chain = RELEVANCE_PROMPT | llm | StrOutputParser()
    def filter_docs(state):
        relevant = []
        for doc in state["retrieved_docs"]:
            result = chain.invoke({
                "question": state["question"],
                "document": doc.page_content[:300],
            })
            if "relevant" in result.lower() and "irrelevant" not in result.lower():
                relevant.append(doc)
        # Fallback: if everything gets filtered, keep original results
        return {**state, "relevant_docs": relevant or state["retrieved_docs"]}
    return filter_docs

Conditional routing after decide

def route_after_decide(state) -> Literal["retrieve", "direct_generate"]:
    return "retrieve" if state["need_retrieve"] == "yes" else "direct_generate"

graph.add_conditional_edges(
    "decide",
    route_after_decide,
    {"retrieve": "retrieve", "direct_generate": "direct_generate"},
)

Full Graph Construction

graph = StateGraph(SelfRAGState)

graph.add_node("decide",          make_decide_node(llm))
graph.add_node("retrieve",        make_retrieve_node(retriever))
graph.add_node("filter",          make_filter_node(llm))
graph.add_node("rag_generate",    make_rag_generate_node(llm))
graph.add_node("direct_generate", make_direct_generate_node(llm))
graph.add_node("support_check",   make_support_node(llm))

graph.set_entry_point("decide")
graph.add_conditional_edges("decide", route_after_decide, {...})
graph.add_edge("retrieve",        "filter")
graph.add_edge("filter",          "rag_generate")
graph.add_edge("rag_generate",    "support_check")
graph.add_edge("direct_generate", "support_check")
graph.add_edge("support_check",   END)

self_rag_app = graph.compile()

Experiment Design

The test set contains two types of questions:

8 RAG-domain questions: Embedding model selection, vector databases, chunking strategies — require knowledge base retrieval
3 no-retrieval questions: What is 1+1?, What's the weather like?, Write a Python GCD function

The second category is the key test — can Self-RAG correctly identify and skip retrieval for them?

Experimental Results

Routing Accuracy

Self-RAG routing decisions:
  [✓ Retrieve] What is RAG and what problem does it solve?
  [✓ Retrieve] Which vector database should I use for enterprise apps?
  [✓ Retrieve] Which embedding model is recommended for Chinese text?
  [✓ Retrieve] What chunk size is recommended for document splitting?
  [✓ Retrieve] How do you evaluate a RAG system?
  [✓ Retrieve] What are the advantages of hybrid search over pure vector search?
  [✓ Retrieve] What role does Reranking play in RAG?
  [✓ Retrieve] What problem does Parent-Child chunking solve?
  [✗ Skip]     What is 1 + 1?
  [✗ Skip]     What's the weather like today?
  [✗ Skip]     Write a Python function to find the GCD

Summary: retrieve=8, skip=3

11/11 correct. All 8 domain questions triggered retrieval; all 3 non-retrieval questions went directly to generation.

Relevance Filter in Action

Execution paths (Self-RAG):
  Q1: decide→yes → retrieve → filter(4/4) → rag_generate → support→supported
  Q2: decide→yes → retrieve → filter(2/4) → rag_generate → support→supported
  Q3: decide→yes → retrieve → filter(2/4) → rag_generate → support→unsupported
  Q4: decide→yes → retrieve → filter(1/4) → rag_generate → support→supported
  Q5: decide→yes → retrieve → filter(2/4) → rag_generate → support→supported
  Q6: decide→yes → retrieve → filter(4/4) → rag_generate → support→unsupported
  Q7: decide→yes → retrieve → filter(1/4) → rag_generate → support→supported
  Q8: decide→yes → retrieve → filter(2/4) → rag_generate → support→supported

filter(1/4) means 3 out of 4 retrieved documents were judged irrelevant and discarded. This is where the context_precision gain comes from — the LLM receives a cleaner, tighter context.

support→unsupported on Q3 and Q6 signals that the LLM's answer went beyond what the documents actually said. A full Self-RAG implementation would re-generate at this point. Our simplified version outputs the answer as-is and flags it.

RAGAS Metrics

======================================================================
  RAGAS Metrics Comparison (8 domain questions)
======================================================================

  Metric               Always Retrieve    Self-RAG     Delta
  ──────────────────────────────────────────────────────────
  context_recall            0.625          0.625     →+0.000
  context_precision         0.583          0.688     ↑+0.104  ◀
  faithfulness              0.845          0.866     ↑+0.021
  answer_relevancy          0.404          0.401     →-0.003
======================================================================

context_precision +0.104: Direct contribution from the filter node. Always-retrieve passes all 4 docs to the LLM regardless of quality; Self-RAG filters to only the relevant ones, so the docs that remain rank better by definition.
context_recall unchanged: The filter didn't drop relevant documents (the fallback logic held).
faithfulness +0.021: Cleaner context, slightly fewer hallucinations.

Token Cost: The Counterintuitive Finding

Token cost comparison (all 11 questions, estimated):
  Always retrieve: ~6,600 tokens
  Self-RAG:       ~16,050 tokens
  Self-RAG uses:   ~2.4x more tokens

Self-RAG actually costs more — this is the most surprising result, and it deserves a careful explanation.

Each reflection node requires an extra LLM call:

Node	Always Retrieve	Self-RAG
Retrieve decision (decide)	—	✓ once per question
Relevance filter (filter)	—	✓ once per doc (up to 4×)
Generate	✓	✓
Support evaluation	—	✓ once per question

Even though 3 questions skipped retrieval entirely, the overhead from decide + filter + support_check on the remaining 8 questions more than cancels that out. In this experiment, 73% of questions still needed retrieval — not enough skipped to offset the reflection overhead.

When does Self-RAG actually save tokens? When the fraction of "no retrieval needed" questions is high enough (typically >50%), and the cost of retrieval+generation far exceeds the cost of the decision node. In typical RAG deployments — where users are mostly asking questions about the knowledge base — this condition rarely holds.

Self-RAG's Real Value

The experiment clarifies where Self-RAG's value actually lies:

The value is quality improvement, not token savings.

The filter node improves context_precision by +0.104 — a larger jump than most retrieval optimization strategies
The routing decision prevents irrelevant documents from poisoning responses to non-domain questions
The support evaluation creates observability — you can tell which answers are grounded and which might need a fallback

Self-RAG is a good fit when:

You have a mixed-intent system: some questions need knowledge base retrieval, others are general-purpose (calculations, creative tasks, small talk)
Answer quality is the priority and you can absorb the extra token cost and latency
You need observability: knowing whether each answer is document-supported matters

Self-RAG is a poor fit when:

Nearly all questions are knowledge-base questions (very little routing benefit)
Token budget is extremely tight
You need minimal latency — each additional node adds a round-trip

Full Code

Complete code is open-sourced at:

https://github.com/chendongqi/llm-in-action/tree/main/14-self-rag

Key file:

self_rag.py — Full Self-RAG implementation with LangGraph

How to run:

git clone https://github.com/chendongqi/llm-in-action
cd 14-self-rag
cp .env.example .env
pip install -r requirements.txt
python self_rag.py

Summary

This article implemented a simplified Self-RAG system with LangGraph. Key findings:

Routing: 11/11 correct — LLM reliably distinguishes "needs retrieval" domain questions from "no retrieval needed" general questions
Quality gain is real — the filter node cuts context noise, lifting context_precision by +0.104
Token cost rises 2.4x — reflection nodes (decide + filter + support) outweigh the savings from skipping 3 retrievals. This is the honest engineering tradeoff.

Self-RAG upgrades "blind execution" into "conscious decision-making." The cost is added complexity and higher token usage. The payoff is more precise context, better-grounded answers, and a system that knows when it doesn't need to retrieve. In mixed-intent or conversational RAG systems, that tradeoff is often worth making.

DEV Community