The Silent Failure of Pipeline RAG
Every article in this series has been trying to answer the same question: how do we make retrieval better? Better chunking, reranking, query rewriting, CRAG's web fallback, Graph RAG's relationship traversal.
But one thing has stayed constant throughout: whatever retrieval returns, it gets passed to the LLM.
Pipeline RAG is a linear, fixed sequence:
question → vector search → top-4 docs → LLM generate
At no point does the system ask: "Is what we just retrieved actually sufficient to answer this question?"
The result: when the knowledge base has nothing relevant, the LLM receives four unrelated documents and quietly produces either a hallucinated answer or a deflecting "I cannot answer based on the provided context." The system doesn't know it failed — it just silently delivered a wrong answer.
This is Pipeline RAG's silent failure mode.
What Agentic RAG Changes
Agentic RAG adds one thing: agency.
Three concrete changes:
1. Retrieval is a tool, not a hardcoded step
Vector search, graph traversal, web search — these aren't all used at once or always the same one. The agent picks the appropriate tool based on what kind of question is being asked. Factual questions go to vector search. Relational questions go to graph traversal. Time-sensitive questions go to web search. General knowledge questions skip retrieval entirely.
This is how a human researcher actually decides how to look things up.
2. Retrieval is followed by reflection
After executing retrieval, the agent doesn't immediately generate an answer. It evaluates: how well does this retrieved context actually cover the question? If the score is below a threshold, retrieval is considered to have failed.
3. Failure can be corrected
Low quality → switch strategy and retry. Vector search didn't find anything useful? Try graph traversal. Graph traversal struck out? Try web search. The retry loop has a hard cap (2 attempts in this implementation) to prevent infinite cycles.
Together, these three changes transform a "fixed pipeline" into a "feedback loop with decisions."
The LangGraph Architecture
question
↓
[classify] → analyze question type, pick initial strategy
factual → vector
relational → graph
time-sensitive → web
general knowledge → direct
↓
[retrieve] → run the chosen strategy (three separate nodes)
↓
[evaluate] → score context quality (0.0–1.0), threshold = 0.6
↓
≥0.6 ──yes──→ [generate] → final answer
│
no (attempts < 2)
↓
[re_route] → pick next untried strategy: vector → graph → web
↓
[retrieve] again...
↓
(after 2 attempts, generate regardless)
The direct_generate path bypasses all retrieval nodes and goes straight to END.
Key Node Implementations
State: The Execution Trace Is Your Debugger
class AgenticRAGState(TypedDict):
question: str
strategy: str # "vector" | "graph" | "web" | "direct"
tried_strategies: list[str] # strategies already attempted, prevents repeats
retrieved_docs: list[Document]
quality_score: float # score from evaluate node, 0.0–1.0
answer: str
path: list[str] # execution trace: ["classify→graph", "graph_retrieve", ...]
The path field is the most useful debugging tool in the entire system. After a run, you can see exactly which path each question took, where re-routing triggered, and what quality scores looked like. This is far more informative than the final RAGAS metrics alone.
The classify Node
CLASSIFY_PROMPT = ChatPromptTemplate.from_messages([
("system",
"Decide which retrieval strategy best fits the question. Output only the strategy name:\n\n"
"vector - factual question requiring knowledge base (definitions, parameters, comparisons)\n"
"graph - relational question about connections between entities (who made what, how X relates to Y)\n"
"web - requires up-to-date information (latest versions, recent papers, current events)\n"
"direct - no retrieval needed (common sense, math, translation, code syntax)"),
("human", "Question: {question}"),
])
def classify_node(state):
raw = classify_chain.invoke({"question": state["question"]}).strip().lower()
strategy = "vector" # safe default
for s in ["vector", "graph", "web", "direct"]:
if s in raw:
strategy = s
break
return {
**state,
"strategy": strategy,
"tried_strategies": [strategy],
"path": [f"classify→{strategy}"],
}
The evaluate Node: The Heart of the Architecture
QUALITY_PROMPT = ChatPromptTemplate.from_messages([
("system",
"Rate how well the retrieved context covers the question.\n"
"Output only a number between 0.0 and 1.0, no explanation:\n"
"1.0 = complete coverage, can answer directly\n"
"0.5 = partial, usable but incomplete\n"
"0.0 = completely unrelated, cannot answer"),
("human", "Question: {question}\n\nContext: {context}"),
])
def evaluate_node(state):
context = "\n\n".join(d.page_content[:300] for d in state["retrieved_docs"])
raw = quality_chain.invoke({
"question": state["question"],
"context": context,
})
try:
score = max(0.0, min(1.0, float(raw.strip())))
except ValueError:
score = 0.5 # parse failure → neutral
return {**state, "quality_score": score}
The Routing Logic
QUALITY_THRESHOLD = 0.6
MAX_ATTEMPTS = 2
def route_after_evaluate(state) -> str:
score = state["quality_score"]
attempts = len(state["tried_strategies"])
# Good enough, or we've exhausted retries — generate
if score >= QUALITY_THRESHOLD or attempts >= MAX_ATTEMPTS:
return "generate"
return "re_route"
def re_route_node(state):
tried = set(state["tried_strategies"])
# Try strategies in priority order, skip already-tried ones
for s in ["vector", "graph", "web"]:
if s not in tried:
return {
**state,
"strategy": s,
"tried_strategies": list(tried) + [s],
"path": state["path"] + [f"re_route→{s}"],
}
return {**state, "strategy": "vector"} # last resort
Experimental Results
Routing Behavior
8 test questions, designed to exercise all four question types:
Initial strategy distribution:
vector: 4 questions (factual: RAGAS metrics, vector DB use cases)
graph: 2 questions (relational: BAAI's two models, Self-RAG/CRAG/Graph RAG comparison)
direct: 2 questions (general: translate to English, Python list average)
web: 0 questions ← worth discussing
Re-routing triggered: 4 / 6 retrieval questions (67%)
What the agent got right:
The two relational questions ("which organization do bge-large-zh-v1.5 and bge-reranker-v2-m3 both come from, and what RAG stage does each serve?" and "what problem do Self-RAG, CRAG, and Graph RAG each solve?") were correctly routed to graph. The knowledge graph built from the document corpus was traversed to directly find the entity connections.
The two general knowledge questions ("translate 'retrieval-augmented generation' to English" and "how do you calculate a list average in Python?") were correctly routed to direct — no retrieval wasted, no unnecessary LLM calls to the knowledge base.
An honest routing miss:
"What are the latest RAG papers published in 2025?" was classified as vector by GLM-4-flash rather than web.
This is a prompt engineering gap in the classify node, not a framework design flaw. The word "papers" is strongly associated with knowledge-base content, and the LLM read the question as "find me information about papers" rather than "find me current information." Adding a rule — "questions containing temporal markers like 'latest', 'recent', or 'this year' prefer web" — would correct this class of misclassification.
4 re-routes: the evaluate node is doing real work
4 out of 6 retrieval questions triggered re-routing (67%). This means the quality evaluator isn't just going through the motions — it's genuinely rejecting context it judges insufficient and pushing the agent to try a different approach. The system is correcting itself in flight.
RAGAS Metrics
======================================================================
RAGAS Metrics Comparison (Always-Vector vs Agentic RAG)
======================================================================
Metric Always-Vector Agentic RAG Delta
──────────────────────────────────────────────────────────────
context_recall 0.611 0.611 →+0.000
context_precision 0.639 0.681 ↑+0.042 ◀
faithfulness 0.625 0.625 →+0.000
answer_relevancy 0.431 0.433 →+0.002
======================================================================
context_precision +0.042; everything else essentially unchanged.
Why the RAGAS Improvement Is Small (And Why That's Not the Point)
This deserves a careful explanation, because the small numbers are easy to misread as "Agentic RAG doesn't work."
RAGAS measures final answer quality, not process robustness.
Our test knowledge base covers these 6 retrieval questions reasonably well. Even when evaluate scores a retrieval as insufficient and the agent switches strategies, the final answer quality doesn't jump dramatically — because the information was already there. When the knowledge base is comprehensive, Agentic and Pipeline RAG produce similar-quality answers.
The real value shows up when the knowledge base falls short:
| Scenario | Pipeline RAG | Agentic RAG |
|---|---|---|
| KB has the answer | ✅ answers correctly | ✅ answers correctly |
| KB has no relevant content | ❌ generates from irrelevant docs (hallucination risk) | ✅ switches to web search, or acknowledges the gap |
| Question needs relational reasoning | ⚠️ retrieves by similarity, may miss connections | ✅ routes to graph traversal |
| Question needs no retrieval | ⚠️ wastes a retrieval call | ✅ skips directly to generation |
RAGAS in this experiment only tests row one. The value in rows two, three, and four doesn't show up in the metrics — but it's the reason you'd choose Agentic RAG in a real deployment.
This is a recurring theme throughout this series: every optimization has a target scenario. Numbers without scenario context are incomplete.
Pipeline RAG vs Agentic RAG
| Dimension | Pipeline RAG | Agentic RAG |
|---|---|---|
| Flow | Fixed linear sequence | Dynamic feedback loop |
| Retrieval strategy | Fixed (usually vector) | Dynamic, chosen per question type |
| Result evaluation | None | Quality scoring |
| Failure handling | Generate anyway | Switch strategy and retry |
| Direct generation | Not supported | Supported for general knowledge |
| Extra LLM calls | 0 | classify + evaluate (+ re-route) |
| Best fit | Comprehensive KB, uniform question types | Mixed intents, coverage gaps |
The cost is real. Every question adds at least 2 extra LLM calls (classify + evaluate), more if re-routing triggers. If your question types are uniform and your knowledge base is comprehensive, Pipeline RAG's cost advantage is genuine.
Full Code
Complete code is open-sourced at:
https://github.com/chendongqi/llm-in-action/tree/main/17-agentic-rag
Key file:
-
agentic_rag.py— full implementation: graph build, LangGraph agent, RAGAS evaluation
How to run:
git clone https://github.com/chendongqi/llm-in-action
cd 17-agentic-rag
cp .env.example .env
pip install -r requirements.txt
python agentic_rag.py
Summary
This article implemented Agentic RAG. Key findings:
- Retrieval as a tool, not a fixed step — this is the essential difference from Pipeline RAG. Tools can be selected, evaluated, and swapped. Steps cannot.
- Routing accuracy was solid — relational questions correctly went to graph traversal, general knowledge questions correctly skipped retrieval. The classification is genuinely useful.
- 4 of 6 retrieval questions triggered re-routing — the evaluate node is doing substantive quality control, not just adding latency to rubber-stamp retrieval results.
- RAGAS +0.042, but the metric isn't the story — the improvement is small because our knowledge base already covers the test questions. The real value is robustness: what happens when coverage fails. Pipeline RAG silently generates a bad answer. Agentic RAG switches strategies and at least tries to find something better.
Looking back at the arc of this series: Self-RAG answered "should we retrieve?", CRAG answered "is what we retrieved good enough?", Graph RAG answered "how do we handle relational questions?" — Agentic RAG combines all three into a unified decision loop. The system doesn't just execute a fixed plan; it actively routes, evaluates, and corrects its own retrieval behavior. That's the shift from pipeline thinking to agent thinking.
Top comments (0)