DEV Community

Cover image for Why your RAG system fails in production — and the agentic loop fix
Aj
Aj

Posted on • Originally published at cloudedventures.com

Why your RAG system fails in production — and the agentic loop fix

Your RAG demo worked perfectly. Then real users arrived and it started giving confidently wrong answers.

This is the most common production AI failure in 2026. And it's not a chunking problem or an embedding problem. It's an architectural one.

TL;DR

  • Standard RAG is a one-shot pipeline with no decision point between retrieval and generation
  • When retrieval is weak, the LLM hallucinates confidently using bad context
  • Agentic RAG adds a control loop: retrieve → evaluate → retry or proceed
  • The evaluation step is the entire value add — use a cheap fast model for it
  • 2–4x token cost vs single-pass — worth it when wrong answers have real consequences

What standard RAG actually does

User query
    ↓
Embed → search vector DB → retrieve top-K chunks
    ↓
Inject chunks into LLM context
    ↓
Generate answer
    ↓
Return to user (no checkpoint, no second chance)
Enter fullscreen mode Exit fullscreen mode

Works fine for simple direct questions. Breaks silently on ambiguous, multi-hop, or cross-source queries. The LLM has no way to signal "my context was bad" — it just generates something plausible-sounding and wrong.


The agentic RAG pattern

User query
    ↓
Agent decides which source to query
    ↓
Retrieve chunks
    ↓
┌─── DECISION POINT ──────────────────┐
│  Evaluate: is this sufficient?       │
│  → SUFFICIENT: generate answer       │
│  → RETRY: rewrite query, search again│
│  → ESCALATE: cannot answer reliably  │
└─────────────────────────────────────┘
    ↓
Generate grounded answer with citations
Enter fullscreen mode Exit fullscreen mode

The decision point between retrieval and generation is the entire architectural difference. Something now asks "was this retrieval good enough?" before the LLM generates.


Complete implementation on AWS Bedrock

import boto3
import json

client = boto3.client("bedrock-runtime", region_name="us-east-1")

retrieval_tools = [
    {
        "toolSpec": {
            "name": "search_knowledge_base",
            "description": """Search the primary knowledge base for relevant information.
            Use this first for any factual question.
            Returns chunks with relevance scores.""",
            "inputSchema": {
                "json": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"},
                        "max_results": {"type": "integer"}
                    },
                    "required": ["query"]
                }
            }
        }
    },
    {
        "toolSpec": {
            "name": "evaluate_retrieval_quality",
            "description": """Evaluate whether retrieved chunks are sufficient to answer the question.
            Use after every retrieval. Returns SUFFICIENT, RETRY, or ESCALATE.""",
            "inputSchema": {
                "json": {
                    "type": "object",
                    "properties": {
                        "original_query": {"type": "string"},
                        "retrieved_chunks": {"type": "string"}
                    },
                    "required": ["original_query", "retrieved_chunks"]
                }
            }
        }
    }
]

def search_knowledge_base(query: str, max_results: int = 3) -> dict:
    # Replace with your actual vector DB call (Pinecone, pgvector, Bedrock KB)
    return {
        "chunks": [
            {"text": f"Retrieved chunk for: {query}", "score": 0.87},
            {"text": f"Second chunk for: {query}", "score": 0.72}
        ]
    }

def evaluate_retrieval_quality(original_query: str, retrieved_chunks: str) -> dict:
    """
    Use a cheap fast model to evaluate — save expensive model for generation.
    This is the decision point that makes the loop work.
    """
    eval_prompt = f"""Evaluate if the retrieved content is sufficient to answer the question.

Question: {original_query}
Retrieved: {retrieved_chunks}

Respond with exactly: VERDICT|reasoning|suggested_query_if_retry
VERDICT must be: SUFFICIENT, RETRY, or ESCALATE"""

    response = client.converse(
        modelId="anthropic.claude-3-haiku-20240307-v1:0",  # Cheap model for evaluation
        messages=[{"role": "user", "content": [{"text": eval_prompt}]}]
    )
    result = response["output"]["message"]["content"][0]["text"]
    parts = result.split("|")
    return {
        "verdict": parts[0].strip(),
        "reasoning": parts[1].strip() if len(parts) > 1 else "",
        "suggested_query": parts[2].strip() if len(parts) > 2 else ""
    }

def tool_router(tool_name: str, tool_input: dict) -> str:
    if tool_name == "search_knowledge_base":
        return json.dumps(search_knowledge_base(
            tool_input["query"],
            tool_input.get("max_results", 3)
        ))
    elif tool_name == "evaluate_retrieval_quality":
        return json.dumps(evaluate_retrieval_quality(
            tool_input["original_query"],
            tool_input["retrieved_chunks"]
        ))
    return json.dumps({"error": f"Unknown tool: {tool_name}"})

def run_agentic_rag(user_query: str) -> str:
    system = """You are a precise Q&A agent.

Process:
1. Search knowledge base
2. Evaluate retrieval quality — ALWAYS do this before generating
3. RETRY if evaluation says so (max 3 retries)
4. ESCALATE if cannot find sufficient information
5. Generate grounded answer with citations only if SUFFICIENT

Never generate before evaluating. Cite sources in your answer."""

    messages = [{"role": "user", "content": [{"text": user_query}]}]

    for _ in range(8):  # Safety cap
        response = client.converse(
            modelId="anthropic.claude-3-sonnet-20240229-v1:0",
            system=[{"text": system}],
            messages=messages,
            toolConfig={"tools": retrieval_tools}
        )

        stop_reason = response["stopReason"]
        output = response["output"]["message"]
        messages.append(output)

        if stop_reason == "end_turn":
            return output["content"][0]["text"]

        if stop_reason == "tool_use":
            tool_results = []
            for block in output["content"]:
                if "toolUse" not in block:
                    continue
                tool = block["toolUse"]
                result = tool_router(tool["name"], tool["input"])
                tool_results.append({
                    "toolResult": {
                        "toolUseId": tool["toolUseId"],
                        "content": [{"text": result}]
                    }
                })
            messages.append({"role": "user", "content": tool_results})

    return "Max iterations — could not generate reliable answer"

print(run_agentic_rag(
    "How does our refund policy work for enterprise customers upgrading mid-cycle?"
))
Enter fullscreen mode Exit fullscreen mode

Three things that make this work

The evaluation tool is the entire value add

evaluate_retrieval_quality uses Haiku (fast, cheap) to judge relevance before the main model generates. That decision point is where quality improvement comes from. Don't use your expensive model for this step.

Tool descriptions are your routing logic

Write them as instructions: "Use this first for any factual question" — not labels like "knowledge base search". The agent routes entirely based on these descriptions.

The system prompt is your quality contract

"Never generate before evaluating" is the single most important instruction. Without it the agent sometimes skips evaluation and hallucinates from bad context.


When to use each pattern

Pattern Use when Token cost
Standard RAG Simple single-hop Q&A, latency-sensitive 1x
Agentic RAG Ambiguous queries, multi-hop, quality matters 2–4x

Agentic RAG costs more per query. Worth it when wrong answers have real consequences. Overkill for simple doc lookups.


Production tips

Log every iteration. Which queries needed retries, and why, reveals exactly where your knowledge base has gaps.

Cap at 5–6 iterations. Queries needing more are usually unanswerable from your KB and should escalate instead of burning tokens.

Latency is now a distribution. Some queries resolve in one pass (fast). Some need three (slower). Monitor p95, not average.


For hands-on labs building a full RAG system with Bedrock Knowledge Bases in a real AWS sandbox — Cloud Edventures CCA-001 track, 22 labs, no AWS account needed.

Search: Cloud Edventures CCA-001

Where does your RAG system break down most in production? Drop a comment.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.