RAG Meets Agent — It's More Than "Giving the LLM a Search Box"
Most people encounter RAG in this form: user asks a question → retrieve from a knowledge base → stuff the results into the prompt → LLM generates an answer.
That's Pipeline RAG. It works. But it has a fundamental problem — it doesn't think.
Pipeline RAG runs retrieval for every question, regardless of whether it's "How much does WonderBot cost?" (genuinely needs the KB) or "How do you average a Python list?" (the LLM already knows). It's like a worker with one tool: no matter what the job is, first make a trip to the warehouse.
Agentic RAG solves this: let the Agent decide when to retrieve, what to retrieve, and whether the result is good enough.
This article focuses on three core capabilities:
- Retrieval decision: Does this question need a KB lookup at all?
- Multi-KB routing: It does need retrieval — but from which knowledge base?
- Quality gating + fallback: Got results — are they good enough? If not, what next?
Pipeline RAG vs Agentic RAG: The Architectural Difference
Pipeline RAG (always retrieves):
User question
↓
Vector retrieval (regardless of question type)
↓
Inject into prompt
↓
LLM generates
Agentic RAG (intelligent decisions):
User question
↓
[Decision node] Does this need retrieval?
├─ No → LLM answers directly (common knowledge / math / general coding)
└─ Yes → Which knowledge base?
├─ product_kb (features / pricing)
├─ ops_kb (deployment / monitoring)
└─ faq_kb (accounts / refunds / invoices)
↓
Is retrieval quality sufficient?
├─ Yes → LLM generates
└─ No → rewrite query → retry (max 2×) → LLM generates
The core difference: LLM is the control center, not a downstream text generator.
Demo 1: Pipeline RAG vs Agentic RAG
Five test questions: three genuinely need the knowledge base, two don't (general knowledge, arithmetic).
Pipeline RAG
def pipeline_rag(question: str) -> dict:
"""Pipeline RAG: retrieve → inject → generate, always."""
docs = unified_retriever.invoke(question)
context = "\n".join(d.page_content for d in docs)
answer = _ask(
f"Answer based on the following reference material.\nReference: {context}",
question,
)
return {"answer": answer, "retrieved": True, "docs": len(docs)}
Agentic RAG
def agentic_rag(question: str) -> dict:
"""Agentic RAG: decide first, then (optionally) retrieve."""
decision = _ask(
"Decide whether the following question requires a knowledge base lookup.\n"
"Needs retrieval: product pricing/features, operations procedures, service policies\n"
"Skip retrieval: general knowledge, arithmetic, standard programming syntax\n"
"Answer only yes or no",
f"Question: {question}",
).strip().lower()
if "yes" not in decision:
answer = _ask("You are a knowledgeable assistant. Answer directly.", question)
return {"answer": answer, "retrieved": False, "docs": 0}
else:
docs = unified_retriever.invoke(question)
context = "\n".join(d.page_content for d in docs)
answer = _ask(f"Answer based on the following reference.\nReference: {context}", question)
return {"answer": answer, "retrieved": True, "docs": len(docs)}
Measured Results
Question Type | Pipeline Retrieval | Agentic Retrieval | Question
─────────────────────────────────────────────────────────────────
Product feat. | ✓ (3 docs) | ✓ (3 docs) | WonderBot Basic plan — monthly API calls?
Ops | ✓ (3 docs) | ✓ (3 docs) | Minimum memory to deploy WonderBot?
User service | ✓ (3 docs) | ✓ (3 docs) | Can I get a refund after 30 days?
General know. | ✓ (3 docs) | ✗ skipped | How to average a Python list?
Arithmetic | ✓ (3 docs) | ✗ skipped | What is 1024 divided by 32?
Pipeline RAG retrieved for all five questions — including "What is 1024 divided by 32?" where KB content offers zero value. Agentic RAG correctly skipped retrieval for the two general-knowledge questions.
Not every question is worth a warehouse trip.
Demo 2: Multi-Knowledge-Base Routing
Real enterprise deployments typically maintain multiple knowledge bases: product docs, ops manuals, user FAQ. Different questions belong in different KBs.
Three Knowledge Bases
PRODUCT_DOCS = [
Document(page_content="WonderBot Pro pricing: Basic ¥99/mo, Pro ¥299/mo, Enterprise custom."),
Document(page_content="API limits: Basic 10K calls/mo, Pro 100K/mo; overage billed at ¥0.01/call."),
Document(page_content="Supported LLMs: GPT-4, Claude 3, Gemini Pro, GLM-4 — switchable in the console."),
Document(page_content="Data security: stored on China-region servers, Level-3 security certified."),
]
OPS_DOCS = [
Document(page_content="Deployment: Docker 20+, ≥8GB RAM, ≥4 CPU cores, recommended: docker-compose."),
Document(page_content="Troubleshooting: service down → docker ps; API timeout → check LLM connectivity."),
Document(page_content="Backup: auto daily at 2am, 30-day retention, restore via restore.sh."),
Document(page_content="Alerts: CPU >80% for 5min, memory >90%, API error rate >5% → WeChat Work webhook."),
]
FAQ_DOCS = [
Document(page_content="Password reset: 'Forgot password' → enter email → check reset link → set new password."),
Document(page_content="Refund policy: full refund within 7 days; prorated within 30 days; none after 30."),
Document(page_content="Invoice: request in Billing Center → fill company info → e-invoice in 3-5 business days."),
Document(page_content="API Key: create/revoke in Developer Settings; max 5 keys per account."),
]
LangGraph Routing
def route_node(state: RoutingState) -> RoutingState:
"""Step 1: LLM selects the target knowledge base"""
decision = _ask(
"Based on the question, pick the right knowledge base (output name only):\n"
"product - product features, pricing, tech specs, supported models\n"
"ops - deployment, operations, troubleshooting, monitoring, backups\n"
"faq - accounts, passwords, refunds, invoices, API Keys",
f"Question: {state['question']}",
).strip().lower()
...
The graph topology is minimal:
route → retrieve → generate
route_node's output determines which retriever retrieve_node uses.
Measured Routing Accuracy
Six questions (two per knowledge base), real execution:
Expected KB | Actual Route | Match | Question
─────────────────────────────────────────────────────────────────────
→ product | product | ✓ | Pro plan monthly price? Which LLMs are supported?
→ product | ops | ✗ | Where is data stored? What security certification?
→ ops | ops | ✓ | How do I troubleshoot API timeouts?
→ ops | ops | ✓ | What alert fires when CPU hits 80%?
→ faq | faq | ✓ | I bought 15 days ago — how much of a refund do I get?
→ faq | ops | ✗ | How do I get a VAT invoice for my company?
Routing accuracy: 4/6 = 67%
Two misroutes worth examining:
- "Where is data stored / security certification" → routed to ops (should be product): the LLM associated "data storage" with infrastructure/operations instead of product capabilities
- "VAT invoice for my company" → routed to ops (should be faq): "company" in the question triggered an ops association
67% accuracy with a single routing prompt is typical baseline performance — useful, but not production-ready for high-stakes routing. Common improvements:
# Improvement: add boundary examples to the routing prompt
route_prompt = """
Determine the correct knowledge base:
product: product pricing / features / supported models / data security certification
ops: service deployment / troubleshooting / monitoring / backup procedures
faq: accounts / passwords / refunds / invoices / API Keys / billing
Examples:
"which models are supported" → product
"invoice" → faq ← billing is always faq, even for companies
"data storage security" → product ← security certs are product features
Question: {question}
"""
Full Example Answer
Question: "How do I troubleshoot API timeouts?" → routed to ops, retrieved and generated:
Routed to: ops_kb
Answer: For API timeout troubleshooting, follow these steps:
1. Check LLM service connectivity to ensure the network connection is healthy.
2. Verify Docker container status using `docker ps` to confirm services are running.
3. If the cause is memory overflow, increase the Docker memory limit.
The KB match was correct, and the answer directly references the troubleshooting steps from the ops documents.
Demo 3: Quality Gating + Query Rewriting Fallback
When retrieval quality is poor, blindly generating from low-quality context produces bad answers. A better approach: rewrite the query and try again.
Core Flow
retrieve → evaluate_quality
├─ score ≥ 0.6 → generate
└─ score < 0.6 and retries < 2 → rewrite_query → retrieve (loop back)
LangGraph Implementation
QUALITY_THRESHOLD = 0.6
MAX_RETRIES = 2
class QualityGateState(TypedDict):
question: str
rewritten_q: str # current query (starts as original question)
context: str
quality_score: float
answer: str
attempts: int
path: list
def qg_rewrite_node(state: QualityGateState) -> QualityGateState:
"""Rewrite the vague query into something more specific"""
rewritten = _ask(
"Rewrite the following vague question as a more specific search query, "
"keeping the original intent but adding relevant keywords. Output only the rewritten query.",
state["question"],
).strip()
return {**state, "rewritten_q": rewritten, "attempts": state["attempts"] + 1}
def should_rewrite(state: QualityGateState) -> str:
if state["quality_score"] >= QUALITY_THRESHOLD:
return "generate" # quality is sufficient
if state["attempts"] >= MAX_RETRIES:
return "generate" # retry limit reached — fallback generate
return "rewrite" # quality too low — rewrite and retry
Measured Results
Three extremely vague questions:
Original Question | Retries | Final Quality | Execution Path
─────────────────────────────────────────────────────────────────────
"how much does it cost" | 2 | 0.00 | retrieve → eval(0.50) → rewrite → retrieve → eval(0.00) → rewrite → retrieve → eval(0.00) → generate
"something went wrong" | 2 | 0.50 | retrieve → eval(0.50) → rewrite → retrieve → eval(0.50) → rewrite → retrieve → eval(0.50) → generate
"money stuff" | 2 | 0.50 | retrieve → eval(0.50) → rewrite → retrieve → eval(0.50) → rewrite → retrieve → eval(0.50) → generate
Detailed trace for "how much does it cost":
Original query: "how much does it cost"
↓ retrieve → pulled backup / deployment / refund docs (unrelated)
↓ evaluate → quality score 0.50 (LLM sees slight relevance)
↓ rewrite → "product price range query" (too generic, lost context)
↓ retrieve → quality drops further
↓ evaluate → quality score 0.00
↓ rewrite → "product price range query" (no improvement)
↓ generate → fallback answer
Final answer: "Based on the provided reference material, pricing information is not
included. If you need pricing details, please contact the service
provider or visit their official website."
This result teaches an important lesson: query rewriting can't fix a fundamentally underspecified question. "How much does it cost" rewritten to "product price range query" lost the product name context entirely, making retrieval worse, not better.
The deeper fix is to add a clarification step when quality remains persistently low:
# Better approach: ask the user to clarify instead of looping
if state["attempts"] >= MAX_RETRIES and state["quality_score"] < 0.3:
return "clarify" # new node: ask "Which product/service are you asking about?"
This is the real challenge in Agentic RAG — low retrieval quality isn't always a retrieval strategy problem. Sometimes the question itself is missing information.
Agentic RAG Design Checklist
Key decision points when building an Agentic RAG system:
Retrieval Decision Layer
- [ ] Define which question types need retrieval (domain-specific vs. general knowledge)
- [ ] Include specific boundary examples in the decision prompt to reduce ambiguity
- [ ] Define explicit
skip_retrievalcategories: pure math, coding syntax, general facts
Knowledge Base Routing Layer
- [ ] Write clear descriptions for each KB (type + typical questions + boundary cases)
- [ ] If routing accuracy < 80%, add Few-shot examples or use a dedicated classifier
- [ ] Support cross-KB retrieval when questions span multiple domains
Quality Gating Layer
- [ ] Set a sensible threshold (0.6 is a reasonable starting point)
- [ ] Cap max retries (2 is usually enough — diminishing returns after that)
- [ ] Log every rewrite and quality score to drive future improvements
- [ ] When quality stays persistently low, escalate to user clarification rather than hallucinating
Production Concerns
- [ ] Add domain-specific examples to routing prompts for problematic edge cases
- [ ] Consider hybrid retrieval (vector + BM25) to improve baseline quality
- [ ] Track which questions skip retrieval and which trigger rewrites — use the data to iterate
Summary
Five key conclusions:
- Pipeline RAG's problem isn't retrieval — it's the lack of judgment: running retrieval on every question wastes resources and introduces irrelevant content that confuses the LLM
- Agentic RAG's essence is LLM-as-scheduler: retrieval, routing, and evaluation are all LLM decisions, not hardwired pipeline steps
- Multi-KB routing accuracy with a plain prompt is limited: 67% baseline with a one-sentence routing prompt is typical. Production needs Few-shot examples or a dedicated model
- Quality gating + query rewriting is not a silver bullet: extremely vague questions may produce worse rewrites — the real fix is asking the user
- LangGraph makes Agentic RAG easy to extend: adding a new KB is one new node + updated routing prompt, no structural changes needed
Next up: Context Engineering — token budget management, dynamic context assembly, and how to make every token count in a 128K context window.
References
- LangGraph Agentic RAG Tutorial
- LangChain RAG Conceptual Guide
- CRAG Paper (Corrective RAG)
- Full demo code for this series: agent-06-agentic-rag
Find more useful knowledge and interesting products on my Homepage
Top comments (0)