Your RAG demo worked perfectly. Then real users arrived and it started giving confidently wrong answers.
This is the most common production AI failure in 2026. And it's not a chunking problem or an embedding problem. It's an architectural one.
TL;DR
- Standard RAG is a one-shot pipeline with no decision point between retrieval and generation
- When retrieval is weak, the LLM hallucinates confidently using bad context
- Agentic RAG adds a control loop: retrieve → evaluate → retry or proceed
- The evaluation step is the entire value add — use a cheap fast model for it
- 2–4x token cost vs single-pass — worth it when wrong answers have real consequences
What standard RAG actually does
User query
↓
Embed → search vector DB → retrieve top-K chunks
↓
Inject chunks into LLM context
↓
Generate answer
↓
Return to user (no checkpoint, no second chance)
Works fine for simple direct questions. Breaks silently on ambiguous, multi-hop, or cross-source queries. The LLM has no way to signal "my context was bad" — it just generates something plausible-sounding and wrong.
The agentic RAG pattern
User query
↓
Agent decides which source to query
↓
Retrieve chunks
↓
┌─── DECISION POINT ──────────────────┐
│ Evaluate: is this sufficient? │
│ → SUFFICIENT: generate answer │
│ → RETRY: rewrite query, search again│
│ → ESCALATE: cannot answer reliably │
└─────────────────────────────────────┘
↓
Generate grounded answer with citations
The decision point between retrieval and generation is the entire architectural difference. Something now asks "was this retrieval good enough?" before the LLM generates.
Complete implementation on AWS Bedrock
import boto3
import json
client = boto3.client("bedrock-runtime", region_name="us-east-1")
retrieval_tools = [
{
"toolSpec": {
"name": "search_knowledge_base",
"description": """Search the primary knowledge base for relevant information.
Use this first for any factual question.
Returns chunks with relevance scores.""",
"inputSchema": {
"json": {
"type": "object",
"properties": {
"query": {"type": "string"},
"max_results": {"type": "integer"}
},
"required": ["query"]
}
}
}
},
{
"toolSpec": {
"name": "evaluate_retrieval_quality",
"description": """Evaluate whether retrieved chunks are sufficient to answer the question.
Use after every retrieval. Returns SUFFICIENT, RETRY, or ESCALATE.""",
"inputSchema": {
"json": {
"type": "object",
"properties": {
"original_query": {"type": "string"},
"retrieved_chunks": {"type": "string"}
},
"required": ["original_query", "retrieved_chunks"]
}
}
}
}
]
def search_knowledge_base(query: str, max_results: int = 3) -> dict:
# Replace with your actual vector DB call (Pinecone, pgvector, Bedrock KB)
return {
"chunks": [
{"text": f"Retrieved chunk for: {query}", "score": 0.87},
{"text": f"Second chunk for: {query}", "score": 0.72}
]
}
def evaluate_retrieval_quality(original_query: str, retrieved_chunks: str) -> dict:
"""
Use a cheap fast model to evaluate — save expensive model for generation.
This is the decision point that makes the loop work.
"""
eval_prompt = f"""Evaluate if the retrieved content is sufficient to answer the question.
Question: {original_query}
Retrieved: {retrieved_chunks}
Respond with exactly: VERDICT|reasoning|suggested_query_if_retry
VERDICT must be: SUFFICIENT, RETRY, or ESCALATE"""
response = client.converse(
modelId="anthropic.claude-3-haiku-20240307-v1:0", # Cheap model for evaluation
messages=[{"role": "user", "content": [{"text": eval_prompt}]}]
)
result = response["output"]["message"]["content"][0]["text"]
parts = result.split("|")
return {
"verdict": parts[0].strip(),
"reasoning": parts[1].strip() if len(parts) > 1 else "",
"suggested_query": parts[2].strip() if len(parts) > 2 else ""
}
def tool_router(tool_name: str, tool_input: dict) -> str:
if tool_name == "search_knowledge_base":
return json.dumps(search_knowledge_base(
tool_input["query"],
tool_input.get("max_results", 3)
))
elif tool_name == "evaluate_retrieval_quality":
return json.dumps(evaluate_retrieval_quality(
tool_input["original_query"],
tool_input["retrieved_chunks"]
))
return json.dumps({"error": f"Unknown tool: {tool_name}"})
def run_agentic_rag(user_query: str) -> str:
system = """You are a precise Q&A agent.
Process:
1. Search knowledge base
2. Evaluate retrieval quality — ALWAYS do this before generating
3. RETRY if evaluation says so (max 3 retries)
4. ESCALATE if cannot find sufficient information
5. Generate grounded answer with citations only if SUFFICIENT
Never generate before evaluating. Cite sources in your answer."""
messages = [{"role": "user", "content": [{"text": user_query}]}]
for _ in range(8): # Safety cap
response = client.converse(
modelId="anthropic.claude-3-sonnet-20240229-v1:0",
system=[{"text": system}],
messages=messages,
toolConfig={"tools": retrieval_tools}
)
stop_reason = response["stopReason"]
output = response["output"]["message"]
messages.append(output)
if stop_reason == "end_turn":
return output["content"][0]["text"]
if stop_reason == "tool_use":
tool_results = []
for block in output["content"]:
if "toolUse" not in block:
continue
tool = block["toolUse"]
result = tool_router(tool["name"], tool["input"])
tool_results.append({
"toolResult": {
"toolUseId": tool["toolUseId"],
"content": [{"text": result}]
}
})
messages.append({"role": "user", "content": tool_results})
return "Max iterations — could not generate reliable answer"
print(run_agentic_rag(
"How does our refund policy work for enterprise customers upgrading mid-cycle?"
))
Three things that make this work
The evaluation tool is the entire value add
evaluate_retrieval_quality uses Haiku (fast, cheap) to judge relevance before the main model generates. That decision point is where quality improvement comes from. Don't use your expensive model for this step.
Tool descriptions are your routing logic
Write them as instructions: "Use this first for any factual question" — not labels like "knowledge base search". The agent routes entirely based on these descriptions.
The system prompt is your quality contract
"Never generate before evaluating" is the single most important instruction. Without it the agent sometimes skips evaluation and hallucinates from bad context.
When to use each pattern
| Pattern | Use when | Token cost |
|---|---|---|
| Standard RAG | Simple single-hop Q&A, latency-sensitive | 1x |
| Agentic RAG | Ambiguous queries, multi-hop, quality matters | 2–4x |
Agentic RAG costs more per query. Worth it when wrong answers have real consequences. Overkill for simple doc lookups.
Production tips
Log every iteration. Which queries needed retries, and why, reveals exactly where your knowledge base has gaps.
Cap at 5–6 iterations. Queries needing more are usually unanswerable from your KB and should escalate instead of burning tokens.
Latency is now a distribution. Some queries resolve in one pass (fast). Some need three (slower). Monitor p95, not average.
For hands-on labs building a full RAG system with Bedrock Knowledge Bases in a real AWS sandbox — Cloud Edventures CCA-001 track, 22 labs, no AWS account needed.
Search: Cloud Edventures CCA-001
Where does your RAG system break down most in production? Drop a comment.
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.