Letting AI Control RAG Search Improved Accuracy by 79%
Most RAG (Retrieval-Augmented Generation) search pipelines are built like this:
Query → vector search → Top-K retrieval → dump everything into LLM
This fixed pipeline is the root cause limiting RAG accuracy.
A February 2026 ArXiv paper (arXiv:2602.03442) proposed A-RAG (Agentic RAG), replacing the fixed search pipeline with an AI agent. Result: multi-hop QA accuracy improved by 79% (50.2% → 89.7%). And retrieved tokens dropped by half.
Higher accuracy with less retrieval. Here's how this counter-intuitive result works.
Three Limits of Fixed-Pipeline RAG
Limit 1: Weak on Multi-Hop Questions
Question: "Where did the person who invented X attend university?"
Required searches:
Round 1: "Who invented X" → identify the person
Round 2: "That person's university" → get the answer
Fixed pipeline:
One vector search for "inventor of X + university"
→ No chunk directly contains the answer
→ Retrieves many low-relevance chunks
→ LLM guesses → inaccurate
Multi-hop questions make up a substantial share of real queries. Fixed pipelines are structurally weak against questions that can't be answered in one search.
Limit 2: Fixed Retrieval Granularity
With Top-K=5:
Simple question → retrieves 5 chunks → token waste
Complex question → retrieves 5 chunks → insufficient information
Required granularity varies per question:
"What's GPT-4's parameter count?" → 1 chunk is enough
"How does GPT-4 vs Claude 3.5 differ on long context?" → 10 chunks needed
Limit 3: Fixed Search Strategy
Vector search only:
Retrieves semantically similar chunks
→ Weak on exact matches (part numbers, proper nouns)
Keyword search only:
Retrieves exact/partial matches
→ Weak on synonyms and paraphrases
Hybrid search (fixed ratio):
70% vector + 30% keyword (or similar fixed weights)
→ Can't dynamically adjust based on question type
A-RAG Architecture: Let the Agent Search
A-RAG's core insight: replace the fixed search flow with agent decision-making.
Traditional RAG:
Query → [fixed pipeline] → chunks → LLM → answer
Search method, granularity, and count all pre-determined
A-RAG:
Query → [agent] → answer
Agent autonomously decides:
- Which search tool to use
- How many times to search
- What granularity to retrieve at
- When to stop searching
Three Search Interfaces
A-RAG gives the agent three tools:
class ARAGTools:
def keyword_search(self, query: str) -> list[str]:
"""Keyword-based search
Use case: proper nouns, part numbers, exact terms"""
pass
def semantic_search(self, query: str) -> list[str]:
"""Vector similarity search
Use case: conceptual similarity, paraphrase handling"""
pass
def chunk_read(self, doc_id: str, chunk_range: str) -> str:
"""Deep read of specific chunks
Use case: drilling into search results, getting surrounding context"""
pass
The agent freely combines these tools based on the question.
Multi-Hop Question Example
Question: "What is the current affiliation of the first author
of the paper that proposed Transformers?"
Agent behavior:
Step 1: keyword_search("Transformer paper original authors")
→ "Attention Is All You Need", Vaswani et al., 2017
Step 2: semantic_search("Ashish Vaswani current affiliation 2026")
→ Retrieves 3 chunks
Step 3: chunk_read(doc_id="result_2", range="full")
→ Deep reads detailed info
Step 4: Generate answer → "Essential AI (startup founded 2023)"
Fixed pipeline would:
vector_search("first author current affiliation Transformer paper")
→ Unlikely to get direct answer in one search
→ Mostly retrieves content about "Attention Is All You Need"
→ Risk of answering with 2017-era Google affiliation
Benchmark Results: A-RAG by the Numbers
Key results from the paper (Table 1).
GPT-4o-mini Backend
| Benchmark | Naive RAG | A-RAG | Improvement |
|---|---|---|---|
| MuSiQue | 38.6% | 46.1% | +19% |
| HotpotQA | 74.5% | 77.1% | +3.5% |
| 2WikiMultiHopQA | 42.6% | 60.2% | +41% |
GPT-5-mini Backend
| Benchmark | Naive RAG | A-RAG | Improvement |
|---|---|---|---|
| MuSiQue | 52.8% | 74.1% | +40% |
| HotpotQA | 81.2% | 94.5% | +16% |
| 2WikiMultiHopQA | 50.2% | 89.7% | +79% |
Pattern Analysis
patterns = {
"multi_hop_improvement": {
"2Wiki": "+41% (4o-mini) / +79% (5-mini)",
"MuSiQue": "+19% (4o-mini) / +40% (5-mini)",
"insight": "Bigger gains on multi-hop questions"
},
"model_scaling": {
"4o_mini_avg": "+21%",
"5_mini_avg": "+45%",
"insight": "Stronger models benefit more from A-RAG"
},
"graphrag_comparison": {
"HotpotQA_graphrag_4o_mini": "33.2%",
"HotpotQA_graphrag_5_mini": "82.5%",
"HotpotQA_naive_rag": "74.5% / 81.2%",
"insight": "GraphRAG is extremely model-dependent. Collapses with weak models"
}
}
Three key takeaways:
- Dominant improvement on multi-hop: +79% on 2WikiMultiHopQA. A-RAG is strongest where fixed pipelines are weakest
- Scales with model capability: GPT-5-mini gains more than GPT-4o-mini. Agent search quality depends on model intelligence
- GraphRAG is extremely model-dependent: With GPT-4o-mini, HotpotQA drops to 33.2% (less than half of Naive RAG). But with GPT-5-mini, GraphRAG hits 82.5%, beating Naive RAG's 81.2%. Weak-model GraphRAG is dangerous
Token Efficiency
HotpotQA (GPT-5-mini):
Naive RAG: 5,358 tokens retrieved → 81.2% accuracy
A-RAG: 2,737 tokens retrieved → 94.5% accuracy
Retrieved tokens: -49%
Accuracy: +16%
Less retrieval, higher accuracy. The agent selectively retrieves only what's needed, reducing noise and improving LLM answer quality. This directly impacts API costs.
Can Agentic RAG Run on Local LLMs?
The paper uses GPT-4o-mini and GPT-5-mini. What about local models?
Structural Challenges
agent_requirements = {
"tool_use": "Function calling capability",
"planning": "Multi-step planning",
"reflection": "Evaluating search results, deciding next action",
"context_management": "Maintaining and integrating retrieved info",
}
local_llm_capability = {
"Qwen2.5-32B Q4_K_M": {
"tool_use": "Supported (ChatML format)",
"planning": "Moderate (simple 2-3 steps)",
"speed": "~10 t/s (ngl=24)",
"verdict": "Simple Agentic RAG works, complex multi-hop is difficult"
},
"Qwen3.5-9B Q4_K_M": {
"tool_use": "Supported",
"planning": "Moderate",
"speed": "~33 t/s",
"verdict": "Fast but knowledge-limited. Search judgment quality may drop"
}
}
Minimal Implementation
# Local LLM + ChromaDB minimal Agentic RAG
import chromadb
import subprocess, json
class LocalAgenticRAG:
def __init__(self, db_path, model_path):
self.chroma = chromadb.PersistentClient(path=db_path)
self.collection = self.chroma.get_collection("papers")
self.model = model_path
def keyword_search(self, query, k=5):
results = self.collection.query(
query_texts=[query], n_results=k,
where_document={"$contains": query.split()[0]}
)
return results["documents"][0]
def semantic_search(self, query, k=5):
results = self.collection.query(
query_texts=[query], n_results=k
)
return results["documents"][0]
def agent_query(self, question):
context = []
for step in range(3):
ctx_str = json.dumps(context, ensure_ascii=False)[:2000]
prompt = f"""Question: {question}
Tools: keyword_search, semantic_search, chunk_read
Retrieved: {ctx_str}
Decide: TOOL:name(args) or ANSWER:your answer"""
response = self._llm_call(prompt)
if response.startswith("ANSWER:"):
return response[7:]
elif response.startswith("TOOL:"):
context.append(self._execute_tool(response[5:]))
return self._llm_call(
f"Based on: {json.dumps(context, ensure_ascii=False)}\nAnswer: {question}"
)
def _execute_tool(self, tool_call):
if tool_call.startswith("keyword_search"):
q = tool_call.split("(", 1)[1].rstrip(")")
return str(self.keyword_search(q.strip("'\"")))
elif tool_call.startswith("semantic_search"):
q = tool_call.split("(", 1)[1].rstrip(")")
return str(self.semantic_search(q.strip("'\"")))
elif tool_call.startswith("chunk_read"):
doc_id = tool_call.split("(", 1)[1].rstrip(")")
return self.chunk_read(doc_id.strip("'\""))
return ""
This works, but don't expect the paper's +79%. Local LLM tool_use capability is the limiting factor.
Realistic Expectations
expected_improvement = {
"32B_model": {
"multi_hop": "+15-25% (roughly 1/3 of paper results)",
"single_hop": "+3-5%",
},
"9B_model": {
"multi_hop": "+5-10%",
"single_hop": "+1-3%",
},
"recommendation": "32B+ needed for meaningful Agentic RAG benefits"
}
Before You Try Agentic RAG
A-RAG is compelling but not universally necessary.
Agentic RAG makes sense when:
✓ Multi-hop questions are frequent (research, investigation)
✓ Large knowledge base (1000+ chunks)
✓ Variable question complexity
✓ Accuracy is top priority (medical, legal)
Naive RAG is sufficient when:
✓ Single-hop questions dominate (FAQ, manual lookup)
✓ Small knowledge base (< 100 chunks)
✓ Uniform question patterns
✓ Latency is top priority
Cost Structure
cost_comparison = {
"naive_rag": {
"llm_calls": 1,
"search_calls": 1,
"avg_tokens": 5358,
"latency": "1-2s (API) / 5-10s (local 32B)",
},
"agentic_rag": {
"llm_calls": "2-4",
"search_calls": "2-5",
"avg_tokens": 2737,
"latency": "3-8s (API) / 15-40s (local 32B)",
},
}
# Price of +79% accuracy: 3-4x latency
# ROI: Worth considering if multi-hop queries exceed 30% of traffic
The Next Evolution of RAG Is Agentic
The summary:
- Fixed-pipeline RAG is limited by search design: Multi-hop weakness, fixed granularity, fixed strategy
- A-RAG lets the agent decide how to search: Three tools (keyword/semantic/chunk_read) selected autonomously
- +79% improvement on multi-hop: Biggest gains where fixed pipelines are weakest
- Less retrieval, higher accuracy: -49% tokens, +16% accuracy
- Stronger models benefit more: Local LLMs see smaller gains
RAG is evolving from human-designed search optimization to model-driven search decisions. Fixed pipelines are stable and predictable but can't adapt to question diversity. Agents are less predictable but adaptive.
If your RAG system uses a fixed pipeline, measure your multi-hop question rate first. If it exceeds 30%, Agentic RAG is worth investigating.
References
- "A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces" (2026) arXiv:2602.03442
- "Retrieval-Augmented Generation: A Comprehensive Survey" (2025) arXiv:2506.00054
- "Ragas: Automated Evaluation of Retrieval Augmented Generation" (2023) arXiv:2309.15217
Top comments (0)