WonderLab

Posted on May 17

RAG Series (17): Agentic RAG — Giving the Agent Control Over Retrieval

#ai #rag #langchain #ragas

The Silent Failure of Pipeline RAG

Every article in this series has been trying to answer the same question: how do we make retrieval better? Better chunking, reranking, query rewriting, CRAG's web fallback, Graph RAG's relationship traversal.

But one thing has stayed constant throughout: whatever retrieval returns, it gets passed to the LLM.

Pipeline RAG is a linear, fixed sequence:

question → vector search → top-4 docs → LLM generate

At no point does the system ask: "Is what we just retrieved actually sufficient to answer this question?"

The result: when the knowledge base has nothing relevant, the LLM receives four unrelated documents and quietly produces either a hallucinated answer or a deflecting "I cannot answer based on the provided context." The system doesn't know it failed — it just silently delivered a wrong answer.

This is Pipeline RAG's silent failure mode.

What Agentic RAG Changes

Agentic RAG adds one thing: agency.

Three concrete changes:

1. Retrieval is a tool, not a hardcoded step

Vector search, graph traversal, web search — these aren't all used at once or always the same one. The agent picks the appropriate tool based on what kind of question is being asked. Factual questions go to vector search. Relational questions go to graph traversal. Time-sensitive questions go to web search. General knowledge questions skip retrieval entirely.

This is how a human researcher actually decides how to look things up.

2. Retrieval is followed by reflection

After executing retrieval, the agent doesn't immediately generate an answer. It evaluates: how well does this retrieved context actually cover the question? If the score is below a threshold, retrieval is considered to have failed.

3. Failure can be corrected

Low quality → switch strategy and retry. Vector search didn't find anything useful? Try graph traversal. Graph traversal struck out? Try web search. The retry loop has a hard cap (2 attempts in this implementation) to prevent infinite cycles.

Together, these three changes transform a "fixed pipeline" into a "feedback loop with decisions."

The LangGraph Architecture

question
    ↓
[classify]   → analyze question type, pick initial strategy
               factual      → vector
               relational   → graph
               time-sensitive → web
               general knowledge → direct
    ↓
[retrieve]   → run the chosen strategy (three separate nodes)
    ↓
[evaluate]   → score context quality (0.0–1.0), threshold = 0.6
    ↓
 ≥0.6 ──yes──→ [generate] → final answer
    │
    no  (attempts < 2)
    ↓
[re_route]   → pick next untried strategy: vector → graph → web
    ↓
[retrieve] again...
    ↓
(after 2 attempts, generate regardless)

The direct_generate path bypasses all retrieval nodes and goes straight to END.

Key Node Implementations

State: The Execution Trace Is Your Debugger

class AgenticRAGState(TypedDict):
    question:         str
    strategy:         str           # "vector" | "graph" | "web" | "direct"
    tried_strategies: list[str]     # strategies already attempted, prevents repeats
    retrieved_docs:   list[Document]
    quality_score:    float         # score from evaluate node, 0.0–1.0
    answer:           str
    path:             list[str]     # execution trace: ["classify→graph", "graph_retrieve", ...]

The path field is the most useful debugging tool in the entire system. After a run, you can see exactly which path each question took, where re-routing triggered, and what quality scores looked like. This is far more informative than the final RAGAS metrics alone.

The classify Node

CLASSIFY_PROMPT = ChatPromptTemplate.from_messages([
    ("system",
     "Decide which retrieval strategy best fits the question. Output only the strategy name:\n\n"
     "vector  - factual question requiring knowledge base (definitions, parameters, comparisons)\n"
     "graph   - relational question about connections between entities (who made what, how X relates to Y)\n"
     "web     - requires up-to-date information (latest versions, recent papers, current events)\n"
     "direct  - no retrieval needed (common sense, math, translation, code syntax)"),
    ("human", "Question: {question}"),
])

def classify_node(state):
    raw = classify_chain.invoke({"question": state["question"]}).strip().lower()
    strategy = "vector"  # safe default
    for s in ["vector", "graph", "web", "direct"]:
        if s in raw:
            strategy = s
            break
    return {
        **state,
        "strategy": strategy,
        "tried_strategies": [strategy],
        "path": [f"classify→{strategy}"],
    }

The evaluate Node: The Heart of the Architecture

QUALITY_PROMPT = ChatPromptTemplate.from_messages([
    ("system",
     "Rate how well the retrieved context covers the question.\n"
     "Output only a number between 0.0 and 1.0, no explanation:\n"
     "1.0 = complete coverage, can answer directly\n"
     "0.5 = partial, usable but incomplete\n"
     "0.0 = completely unrelated, cannot answer"),
    ("human", "Question: {question}\n\nContext: {context}"),
])

def evaluate_node(state):
    context = "\n\n".join(d.page_content[:300] for d in state["retrieved_docs"])
    raw = quality_chain.invoke({
        "question": state["question"],
        "context": context,
    })
    try:
        score = max(0.0, min(1.0, float(raw.strip())))
    except ValueError:
        score = 0.5  # parse failure → neutral
    return {**state, "quality_score": score}

The Routing Logic

QUALITY_THRESHOLD = 0.6
MAX_ATTEMPTS      = 2

def route_after_evaluate(state) -> str:
    score    = state["quality_score"]
    attempts = len(state["tried_strategies"])
    # Good enough, or we've exhausted retries — generate
    if score >= QUALITY_THRESHOLD or attempts >= MAX_ATTEMPTS:
        return "generate"
    return "re_route"

def re_route_node(state):
    tried = set(state["tried_strategies"])
    # Try strategies in priority order, skip already-tried ones
    for s in ["vector", "graph", "web"]:
        if s not in tried:
            return {
                **state,
                "strategy": s,
                "tried_strategies": list(tried) + [s],
                "path": state["path"] + [f"re_route→{s}"],
            }
    return {**state, "strategy": "vector"}  # last resort

Experimental Results

Routing Behavior

8 test questions, designed to exercise all four question types:

Initial strategy distribution:
  vector:  4 questions  (factual: RAGAS metrics, vector DB use cases)
  graph:   2 questions  (relational: BAAI's two models, Self-RAG/CRAG/Graph RAG comparison)
  direct:  2 questions  (general: translate to English, Python list average)
  web:     0 questions  ← worth discussing

Re-routing triggered: 4 / 6 retrieval questions (67%)

What the agent got right:

The two relational questions ("which organization do bge-large-zh-v1.5 and bge-reranker-v2-m3 both come from, and what RAG stage does each serve?" and "what problem do Self-RAG, CRAG, and Graph RAG each solve?") were correctly routed to graph. The knowledge graph built from the document corpus was traversed to directly find the entity connections.

The two general knowledge questions ("translate 'retrieval-augmented generation' to English" and "how do you calculate a list average in Python?") were correctly routed to direct — no retrieval wasted, no unnecessary LLM calls to the knowledge base.

An honest routing miss:

"What are the latest RAG papers published in 2025?" was classified as vector by GLM-4-flash rather than web.

This is a prompt engineering gap in the classify node, not a framework design flaw. The word "papers" is strongly associated with knowledge-base content, and the LLM read the question as "find me information about papers" rather than "find me current information." Adding a rule — "questions containing temporal markers like 'latest', 'recent', or 'this year' prefer web" — would correct this class of misclassification.

4 re-routes: the evaluate node is doing real work

4 out of 6 retrieval questions triggered re-routing (67%). This means the quality evaluator isn't just going through the motions — it's genuinely rejecting context it judges insufficient and pushing the agent to try a different approach. The system is correcting itself in flight.

RAGAS Metrics

======================================================================
  RAGAS Metrics Comparison (Always-Vector vs Agentic RAG)
======================================================================

  Metric               Always-Vector    Agentic RAG       Delta
  ──────────────────────────────────────────────────────────────
  context_recall            0.611          0.611        →+0.000
  context_precision         0.639          0.681        ↑+0.042  ◀
  faithfulness              0.625          0.625        →+0.000
  answer_relevancy          0.431          0.433        →+0.002
======================================================================

context_precision +0.042; everything else essentially unchanged.

Why the RAGAS Improvement Is Small (And Why That's Not the Point)

This deserves a careful explanation, because the small numbers are easy to misread as "Agentic RAG doesn't work."

RAGAS measures final answer quality, not process robustness.

Our test knowledge base covers these 6 retrieval questions reasonably well. Even when evaluate scores a retrieval as insufficient and the agent switches strategies, the final answer quality doesn't jump dramatically — because the information was already there. When the knowledge base is comprehensive, Agentic and Pipeline RAG produce similar-quality answers.

The real value shows up when the knowledge base falls short:

Scenario	Pipeline RAG	Agentic RAG
KB has the answer	✅ answers correctly	✅ answers correctly
KB has no relevant content	❌ generates from irrelevant docs (hallucination risk)	✅ switches to web search, or acknowledges the gap
Question needs relational reasoning	⚠️ retrieves by similarity, may miss connections	✅ routes to graph traversal
Question needs no retrieval	⚠️ wastes a retrieval call	✅ skips directly to generation

RAGAS in this experiment only tests row one. The value in rows two, three, and four doesn't show up in the metrics — but it's the reason you'd choose Agentic RAG in a real deployment.

This is a recurring theme throughout this series: every optimization has a target scenario. Numbers without scenario context are incomplete.

Pipeline RAG vs Agentic RAG

Dimension	Pipeline RAG	Agentic RAG
Flow	Fixed linear sequence	Dynamic feedback loop
Retrieval strategy	Fixed (usually vector)	Dynamic, chosen per question type
Result evaluation	None	Quality scoring
Failure handling	Generate anyway	Switch strategy and retry
Direct generation	Not supported	Supported for general knowledge
Extra LLM calls	0	classify + evaluate (+ re-route)
Best fit	Comprehensive KB, uniform question types	Mixed intents, coverage gaps

The cost is real. Every question adds at least 2 extra LLM calls (classify + evaluate), more if re-routing triggers. If your question types are uniform and your knowledge base is comprehensive, Pipeline RAG's cost advantage is genuine.

Full Code

Complete code is open-sourced at:

https://github.com/chendongqi/llm-in-action/tree/main/17-agentic-rag

Key file:

agentic_rag.py — full implementation: graph build, LangGraph agent, RAGAS evaluation

How to run:

git clone https://github.com/chendongqi/llm-in-action
cd 17-agentic-rag
cp .env.example .env
pip install -r requirements.txt
python agentic_rag.py

Summary

This article implemented Agentic RAG. Key findings:

Retrieval as a tool, not a fixed step — this is the essential difference from Pipeline RAG. Tools can be selected, evaluated, and swapped. Steps cannot.
Routing accuracy was solid — relational questions correctly went to graph traversal, general knowledge questions correctly skipped retrieval. The classification is genuinely useful.
4 of 6 retrieval questions triggered re-routing — the evaluate node is doing substantive quality control, not just adding latency to rubber-stamp retrieval results.
RAGAS +0.042, but the metric isn't the story — the improvement is small because our knowledge base already covers the test questions. The real value is robustness: what happens when coverage fails. Pipeline RAG silently generates a bad answer. Agentic RAG switches strategies and at least tries to find something better.

Looking back at the arc of this series: Self-RAG answered "should we retrieve?", CRAG answered "is what we retrieved good enough?", Graph RAG answered "how do we handle relational questions?" — Agentic RAG combines all three into a unified decision loop. The system doesn't just execute a fixed plan; it actively routes, evaluates, and corrects its own retrieval behavior. That's the shift from pipeline thinking to agent thinking.

DEV Community