Standard RAG retrieves once and hopes for the best. Agentic RAG retrieves, reflects, decides it was wrong, and tries again — without being told to.
Single-pass RAG has a fundamental flaw: it commits to its first retrieval attempt and generates forward regardless. It has no mechanism to check whether the retrieved chunks actually contain the answer. This works for simple factual queries. It breaks on multi-hop questions, ambiguous intent, and analytical queries requiring sequenced lookups.
The Architecture
An agentic RAG system treats retrieval as a tool available to a reasoning loop. The LLM decides what to retrieve, evaluates what came back, and determines when to stop.
The key component: a reflection agent sits between retrieval and generation. It evaluates the quality and sufficiency of accumulated context and either terminates the loop or sends it back with a refined query.
Three patterns in increasing complexity:
- Iterative Query Refinement — single tool, query rewritten per pass
- Multi-Tool Orchestration — agent selects between keyword, semantic, hybrid, and filtered search
- Hierarchical Decomposition — planner splits multi-hop queries into dependent sub-queries
Routing: The Most Important Decision
Sending every query through the agentic path is the most common mistake. Agentic retrieval adds 2-8s latency and 4-12x cost. Simple factual queries (60-75% of typical traffic) get no quality improvement from it.
Use a hybrid router: deterministic rules first (regex patterns, length heuristics, keyword signals), LLM classification only for ambiguous cases. Use Haiku for routing — it's a classification task, not a reasoning task.
Reflection Agent: Deciding When to Stop
The reflection agent's judgment quality determines the entire system's utility. Calibrate it against real queries:
- Iteration 1: 65-75% of queries should terminate (simple queries succeeding on first pass)
- Iteration 2: 15-20% (needed one refinement)
- Iteration 3: 5-10% (multi-hop or genuinely ambiguous)
- Iteration 4+: <5% (forced termination — investigate these)
If significant traffic hits max iterations, either routing is broken or your corpus has coverage gaps.
Failure Isolation and Loop Bounding
Without explicit bounding, misbehaving loops drive latency and cost to unacceptable levels. Non-negotiable limits:
- max_iterations: 4 — never exceed
- timeout: 12s — wall-clock for entire loop
- min_new_chunks_per_iteration: 1 — if retrieval returns nothing new, break immediately
- context token budget — stop accepting chunks beyond the budget
On timeout or max iterations: generate with accumulated context + caveat, never return a 500 error.
Cost Reality
Single-pass RAG: ~$0.003/request
Agentic (2 iter): ~$0.006/request (2x)
Agentic (4 iter): ~$0.010/request (3-4x)
If 25% of traffic goes agentic at 2.5x cost → 37% total increase (acceptable). If 75% goes agentic → costs triple (likely unacceptable). The router controls your bill.
The Key Insight
An agentic system with no observability is not an improvement over single-pass — it's a more expensive pipeline that's harder to debug. The loop delivers quality improvement only when it is instrumented, bounded, and its behavior is understood at the query level.
Agency without accountability is just unpredictability.
Read the Full Article
This is a summary of my deep dive into agentic RAG architecture. The full article covers the complete system with production implementations:
👉 Designing Self-Correcting Retrieval Loops for Production — Full Article
The full article includes:
- Full agentic RAG architecture diagram (router → planner → loop → generation)
- Query planner implementation with multi-hop decomposition (Python/Anthropic)
- Iterative retrieval loop with async timeout and dedup
- Reflection agent prompt and calibration patterns
- Multi-tool orchestration with Claude tool-use API
- Hybrid router (rules-first + LLM fallback)
- Loop bounding with five hard limits
- Graceful degradation with context caveats
- Per-request cost model (single-pass vs 2-iter vs 4-iter)
- Latency budget breakdown and streaming response pattern
- Structured loop telemetry with structlog
- Alerting metrics for agentic systems
- Production deployment checklist
Top comments (0)