plasmon

Posted on Apr 6

Letting AI Control RAG Search Improved Accuracy by 79%

#ai #llm #rag #machinelearning

Letting AI Control RAG Search Improved Accuracy by 79%

Most RAG (Retrieval-Augmented Generation) search pipelines are built like this:

Query → vector search → Top-K retrieval → dump everything into LLM

This fixed pipeline is the root cause limiting RAG accuracy.

A February 2026 ArXiv paper (arXiv:2602.03442) proposed A-RAG (Agentic RAG), replacing the fixed search pipeline with an AI agent. Result: multi-hop QA accuracy improved by 79% (50.2% → 89.7%). And retrieved tokens dropped by half.

Higher accuracy with less retrieval. Here's how this counter-intuitive result works.

Three Limits of Fixed-Pipeline RAG

Limit 1: Weak on Multi-Hop Questions

Question: "Where did the person who invented X attend university?"

Required searches:
  Round 1: "Who invented X" → identify the person
  Round 2: "That person's university" → get the answer

Fixed pipeline:
  One vector search for "inventor of X + university"
  → No chunk directly contains the answer
  → Retrieves many low-relevance chunks
  → LLM guesses → inaccurate

Multi-hop questions make up a substantial share of real queries. Fixed pipelines are structurally weak against questions that can't be answered in one search.

Limit 2: Fixed Retrieval Granularity

With Top-K=5:
  Simple question → retrieves 5 chunks → token waste
  Complex question → retrieves 5 chunks → insufficient information

Required granularity varies per question:
  "What's GPT-4's parameter count?" → 1 chunk is enough
  "How does GPT-4 vs Claude 3.5 differ on long context?" → 10 chunks needed

Limit 3: Fixed Search Strategy

Vector search only:
  Retrieves semantically similar chunks
  → Weak on exact matches (part numbers, proper nouns)

Keyword search only:
  Retrieves exact/partial matches
  → Weak on synonyms and paraphrases

Hybrid search (fixed ratio):
  70% vector + 30% keyword (or similar fixed weights)
  → Can't dynamically adjust based on question type

A-RAG Architecture: Let the Agent Search

A-RAG's core insight: replace the fixed search flow with agent decision-making.

Traditional RAG:
  Query → [fixed pipeline] → chunks → LLM → answer
  Search method, granularity, and count all pre-determined

A-RAG:
  Query → [agent] → answer
  Agent autonomously decides:
    - Which search tool to use
    - How many times to search
    - What granularity to retrieve at
    - When to stop searching

Three Search Interfaces

A-RAG gives the agent three tools:

class ARAGTools:
    def keyword_search(self, query: str) -> list[str]:
        """Keyword-based search
        Use case: proper nouns, part numbers, exact terms"""
        pass

    def semantic_search(self, query: str) -> list[str]:
        """Vector similarity search
        Use case: conceptual similarity, paraphrase handling"""
        pass

    def chunk_read(self, doc_id: str, chunk_range: str) -> str:
        """Deep read of specific chunks
        Use case: drilling into search results, getting surrounding context"""
        pass

The agent freely combines these tools based on the question.

Multi-Hop Question Example

Question: "What is the current affiliation of the first author
           of the paper that proposed Transformers?"

Agent behavior:
  Step 1: keyword_search("Transformer paper original authors")
    → "Attention Is All You Need", Vaswani et al., 2017
  Step 2: semantic_search("Ashish Vaswani current affiliation 2026")
    → Retrieves 3 chunks
  Step 3: chunk_read(doc_id="result_2", range="full")
    → Deep reads detailed info
  Step 4: Generate answer → "Essential AI (startup founded 2023)"

Fixed pipeline would:
  vector_search("first author current affiliation Transformer paper")
  → Unlikely to get direct answer in one search
  → Mostly retrieves content about "Attention Is All You Need"
  → Risk of answering with 2017-era Google affiliation

Benchmark Results: A-RAG by the Numbers

Key results from the paper (Table 1).

GPT-4o-mini Backend

Benchmark	Naive RAG	A-RAG	Improvement
MuSiQue	38.6%	46.1%	+19%
HotpotQA	74.5%	77.1%	+3.5%
2WikiMultiHopQA	42.6%	60.2%	+41%

GPT-5-mini Backend

Benchmark	Naive RAG	A-RAG	Improvement
MuSiQue	52.8%	74.1%	+40%
HotpotQA	81.2%	94.5%	+16%
2WikiMultiHopQA	50.2%	89.7%	+79%

Pattern Analysis

patterns = {
    "multi_hop_improvement": {
        "2Wiki": "+41% (4o-mini) / +79% (5-mini)",
        "MuSiQue": "+19% (4o-mini) / +40% (5-mini)",
        "insight": "Bigger gains on multi-hop questions"
    },
    "model_scaling": {
        "4o_mini_avg": "+21%",
        "5_mini_avg": "+45%",
        "insight": "Stronger models benefit more from A-RAG"
    },
    "graphrag_comparison": {
        "HotpotQA_graphrag_4o_mini": "33.2%",
        "HotpotQA_graphrag_5_mini": "82.5%",
        "HotpotQA_naive_rag": "74.5% / 81.2%",
        "insight": "GraphRAG is extremely model-dependent. Collapses with weak models"
    }
}

Three key takeaways:

Dominant improvement on multi-hop: +79% on 2WikiMultiHopQA. A-RAG is strongest where fixed pipelines are weakest
Scales with model capability: GPT-5-mini gains more than GPT-4o-mini. Agent search quality depends on model intelligence
GraphRAG is extremely model-dependent: With GPT-4o-mini, HotpotQA drops to 33.2% (less than half of Naive RAG). But with GPT-5-mini, GraphRAG hits 82.5%, beating Naive RAG's 81.2%. Weak-model GraphRAG is dangerous

Token Efficiency

HotpotQA (GPT-5-mini):
  Naive RAG: 5,358 tokens retrieved → 81.2% accuracy
  A-RAG:     2,737 tokens retrieved → 94.5% accuracy

Retrieved tokens: -49%
Accuracy: +16%

Less retrieval, higher accuracy. The agent selectively retrieves only what's needed, reducing noise and improving LLM answer quality. This directly impacts API costs.

Can Agentic RAG Run on Local LLMs?

The paper uses GPT-4o-mini and GPT-5-mini. What about local models?

Structural Challenges

agent_requirements = {
    "tool_use": "Function calling capability",
    "planning": "Multi-step planning",
    "reflection": "Evaluating search results, deciding next action",
    "context_management": "Maintaining and integrating retrieved info",
}

local_llm_capability = {
    "Qwen2.5-32B Q4_K_M": {
        "tool_use": "Supported (ChatML format)",
        "planning": "Moderate (simple 2-3 steps)",
        "speed": "~10 t/s (ngl=24)",
        "verdict": "Simple Agentic RAG works, complex multi-hop is difficult"
    },
    "Qwen3.5-9B Q4_K_M": {
        "tool_use": "Supported",
        "planning": "Moderate",
        "speed": "~33 t/s",
        "verdict": "Fast but knowledge-limited. Search judgment quality may drop"
    }
}

Minimal Implementation

# Local LLM + ChromaDB minimal Agentic RAG
import chromadb
import subprocess, json

class LocalAgenticRAG:
    def __init__(self, db_path, model_path):
        self.chroma = chromadb.PersistentClient(path=db_path)
        self.collection = self.chroma.get_collection("papers")
        self.model = model_path

    def keyword_search(self, query, k=5):
        results = self.collection.query(
            query_texts=[query], n_results=k,
            where_document={"$contains": query.split()[0]}
        )
        return results["documents"][0]

    def semantic_search(self, query, k=5):
        results = self.collection.query(
            query_texts=[query], n_results=k
        )
        return results["documents"][0]

    def agent_query(self, question):
        context = []
        for step in range(3):
            ctx_str = json.dumps(context, ensure_ascii=False)[:2000]
            prompt = f"""Question: {question}
Tools: keyword_search, semantic_search, chunk_read
Retrieved: {ctx_str}
Decide: TOOL:name(args) or ANSWER:your answer"""

            response = self._llm_call(prompt)
            if response.startswith("ANSWER:"):
                return response[7:]
            elif response.startswith("TOOL:"):
                context.append(self._execute_tool(response[5:]))

        return self._llm_call(
            f"Based on: {json.dumps(context, ensure_ascii=False)}\nAnswer: {question}"
        )

    def _execute_tool(self, tool_call):
        if tool_call.startswith("keyword_search"):
            q = tool_call.split("(", 1)[1].rstrip(")")
            return str(self.keyword_search(q.strip("'\"")))
        elif tool_call.startswith("semantic_search"):
            q = tool_call.split("(", 1)[1].rstrip(")")
            return str(self.semantic_search(q.strip("'\"")))
        elif tool_call.startswith("chunk_read"):
            doc_id = tool_call.split("(", 1)[1].rstrip(")")
            return self.chunk_read(doc_id.strip("'\""))
        return ""

This works, but don't expect the paper's +79%. Local LLM tool_use capability is the limiting factor.

Realistic Expectations

expected_improvement = {
    "32B_model": {
        "multi_hop": "+15-25% (roughly 1/3 of paper results)",
        "single_hop": "+3-5%",
    },
    "9B_model": {
        "multi_hop": "+5-10%",
        "single_hop": "+1-3%",
    },
    "recommendation": "32B+ needed for meaningful Agentic RAG benefits"
}

Before You Try Agentic RAG

A-RAG is compelling but not universally necessary.

Agentic RAG makes sense when:
  ✓ Multi-hop questions are frequent (research, investigation)
  ✓ Large knowledge base (1000+ chunks)
  ✓ Variable question complexity
  ✓ Accuracy is top priority (medical, legal)

Naive RAG is sufficient when:
  ✓ Single-hop questions dominate (FAQ, manual lookup)
  ✓ Small knowledge base (< 100 chunks)
  ✓ Uniform question patterns
  ✓ Latency is top priority

Cost Structure

cost_comparison = {
    "naive_rag": {
        "llm_calls": 1,
        "search_calls": 1,
        "avg_tokens": 5358,
        "latency": "1-2s (API) / 5-10s (local 32B)",
    },
    "agentic_rag": {
        "llm_calls": "2-4",
        "search_calls": "2-5",
        "avg_tokens": 2737,
        "latency": "3-8s (API) / 15-40s (local 32B)",
    },
}
# Price of +79% accuracy: 3-4x latency
# ROI: Worth considering if multi-hop queries exceed 30% of traffic

The Next Evolution of RAG Is Agentic

The summary:

Fixed-pipeline RAG is limited by search design: Multi-hop weakness, fixed granularity, fixed strategy
A-RAG lets the agent decide how to search: Three tools (keyword/semantic/chunk_read) selected autonomously
+79% improvement on multi-hop: Biggest gains where fixed pipelines are weakest
Less retrieval, higher accuracy: -49% tokens, +16% accuracy
Stronger models benefit more: Local LLMs see smaller gains

RAG is evolving from human-designed search optimization to model-driven search decisions. Fixed pipelines are stable and predictable but can't adapt to question diversity. Agents are less predictable but adaptive.

If your RAG system uses a fixed pipeline, measure your multi-hop question rate first. If it exceeds 30%, Agentic RAG is worth investigating.

References

"A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces" (2026) arXiv:2602.03442
"Retrieval-Augmented Generation: A Comprehensive Survey" (2025) arXiv:2506.00054
"Ragas: Automated Evaluation of Retrieval Augmented Generation" (2023) arXiv:2309.15217

DEV Community

Letting AI Control RAG Search Improved Accuracy by 79%

Letting AI Control RAG Search Improved Accuracy by 79%

Three Limits of Fixed-Pipeline RAG

Limit 1: Weak on Multi-Hop Questions

Limit 2: Fixed Retrieval Granularity

Limit 3: Fixed Search Strategy

A-RAG Architecture: Let the Agent Search

Three Search Interfaces

Multi-Hop Question Example

Benchmark Results: A-RAG by the Numbers

GPT-4o-mini Backend

GPT-5-mini Backend

Pattern Analysis

Token Efficiency

Can Agentic RAG Run on Local LLMs?

Structural Challenges

Minimal Implementation

Realistic Expectations

Before You Try Agentic RAG

Cost Structure

The Next Evolution of RAG Is Agentic

References

Top comments (0)