Yaohua Chen

Posted on May 26

The Representation Problem: Why RAG vs. Agentic Search Is the Wrong Debate

#rag #ai #llm #architecture

The industry has been asking the wrong question.

When Boris Cherny — the creator and Head of Claude Code — revealed on the Latent Space podcast that Anthropic's flagship coding agent had abandoned RAG entirely and switched to what he called "Agentic Search," the discourse fractured predictably. One camp declared RAG dead and obsolete. Another pushed back, arguing RAG remained perfectly valid for most applications. Critics of the new approach pointed out that agentic search burns far more tokens and takes longer to respond than a simple vector lookup. Defenders countered that RAG indexes go stale the moment the underlying data changes, making retrieval unreliable in dynamic environments. Both sides were making real points — but about different problems, with different data, in different contexts. The debate hardened into a false binary: old-school retrieval versus new-school agents, as if every system had to pick a side.

But that framing misses the more interesting thing that's actually happening. The field isn't splitting into two camps. It's splitting into five. And the reason isn't that one retrieval method is better than another — it's that different data types have fundamentally different natural representations, and we're finally building systems that respect that.

A live codebase is not a document corpus. A financial report is not a bag of text chunks. A personal knowledge base built over months is not a static FAQ. When you force all of these through the same pipeline — embed, chunk, store in a vector database, retrieve by cosine similarity — you're discarding the most structurally useful information before you ever run a query.

This is the representation problem. And solving it is what's driving the fragmentation.

What Started the Conversation

In May 2025, Boris Cherny went on the Latent Space podcast and publicly explained why Claude Code had dropped RAG from its architecture. This mattered not because of the specific tool choice, but because of who made it. Claude Code is widely considered the most capable coding agent available, and the team behind it ran the experiment seriously before abandoning the approach.

Cherny's explanation was precise. Claude Code originally used the standard pattern: vector-index the codebase, retrieve semantically relevant snippets when the user asks something, stitch them into a prompt. In practice, three things broke it.

First, the intelligence ceiling. RAG retrieval finds things that are similar to the query vector — but code tasks often require reasoning about things that aren't semantically adjacent to the query at all. A bug in an authentication function might trace through three layers of indirection to a configuration file that shares almost no vocabulary with the original error message. Vector similarity doesn't model causality or call chains.

Second, freshness. A codebase at an active company changes constantly — dozens of commits per day in many teams. An index built this morning may not reflect the function signature changed this afternoon. Stale context in a coding agent doesn't just produce wrong answers; it can introduce bugs that look plausible.

Third, security. Indexing an entire codebase into a queryable vector store creates a detailed, structured map of your most sensitive business logic. If that store is compromised, the attacker has more than source files — they have an organized index of what everything does and how it connects.

The response wasn't to patch the index. It was to remove it.

Five Paradigms, Not Two

The Claude Code pivot opened a broader question: if not RAG, then what? And the answer turned out to depend entirely on what kind of data you're working with. Today there are at least five distinct retrieval paradigms in active production use, each matching a different data type and a different structure of query.

Paradigm 1: Vector RAG

This is the baseline most engineers know. Documents are chunked, embedded into high-dimensional vectors, and stored in a vector database. At query time, the query is embedded and the nearest vectors are retrieved by cosine similarity.

Tools in this category include the standard vector databases — Pinecone, Weaviate, Qdrant, Chroma — paired with embedding models from OpenAI, Cohere, or open-source alternatives.

Where it works well: Unstructured text corpora with relatively stable content and straightforward information needs. FAQ systems, customer support knowledge bases, broad documentation search, semantic search over news archives. When a user asks "how do I reset my password?" and the answer is somewhere in a support wiki, cosine similarity over chunked text is a perfectly adequate tool.

Where it breaks down: Anywhere structure matters. A 200-page financial report has section headings, tables, cross-references between numbered items, and hierarchical organization that conveys meaning. When you chunk that into 512-token segments and embed them, you're turning a structured argument into a bag of fragments. The retrieval system no longer knows that table 3B is referenced by the footnote on page 47 — it just knows that both contain numbers and the word "revenue."

Vector RAG also struggles with anything that changes faster than the index update cycle, and with queries that require multi-hop reasoning rather than single-fact lookup.

Paradigm 2: Agentic Search

This is what Claude Code switched to. Instead of querying a pre-built index, the model uses tools — specifically Glob for pattern-matching file paths and Grep for full-text search within files — to explore the codebase in real time through a ReAct loop (Reason, Act, Observe, repeat).

The concrete mechanics look like this: the agent receives a task, forms a hypothesis about where relevant code might live, executes a targeted search, reads the result, updates its understanding, and searches again if needed. This is exactly how an experienced developer actually debugs — not by querying a semantic index of their codebase, but by reasoning about the problem and looking in the right places.

Cline (formerly Claude Dev) uses the same approach. The agent has no index to go stale, no attack surface from an external vector store, and no ceiling imposed by the quality of the embedding model.

Where it works well: Live codebases that change constantly. Tasks requiring multi-hop reasoning — finding the definition of a symbol, tracing its callers, identifying side effects in adjacent modules. Any scenario where the intelligence of the traversal matters more than the speed of the lookup.

The real tradeoff: Cherny described it as trading time for intelligence. Agentic search consumes more tokens per query and takes longer than a vector lookup. For a coding agent where the user expects multi-second response times anyway, this is an acceptable trade. For a high-throughput retrieval system serving thousands of queries per second, it isn't.

Paradigm 3: Graph and AST Indexing

Between "no index" and "vector index" is a third approach: index the structure of the code rather than its semantic content.

Two tools represent this paradigm well.

codebase-memory-mcp (by DeusData) uses tree-sitter to parse source files into Abstract Syntax Trees across 155 programming languages, then builds a persistent knowledge graph where functions, classes, and modules are nodes and call relationships, inheritance, and imports are typed edges. Queries traverse this graph rather than computing cosine similarity over embeddings. The performance numbers are striking: the Linux kernel — 28 million lines of code — indexes in approximately three minutes. Because queries hit the graph directly instead of sending file contents to the LLM, this approach uses roughly 99% fewer tokens than naive file-by-file analysis. Sub-millisecond query latency.

Understand-Anything takes the same underlying idea — turning a codebase into a knowledge graph — but optimizes for a different audience. Where codebase-memory-mcp is built for AI agents querying code programmatically during active development, Understand-Anything is built for humans trying to understand a codebase they didn't write. It produces an interactive visual dashboard you can pan, zoom, and search in a browser. It generates guided learning tours through the architecture, explains how code maps to business processes, and can produce onboarding guides for new team members. It also goes beyond pure code: point it at an LLM-maintained markdown wiki — a format popularized by Andrej Karpathy where an AI agent incrementally builds and cross-references a personal knowledge base from plain text files — and it builds a navigable knowledge graph of your notes and research.

Aider, the open-source coding assistant, pioneered a similar approach earlier: AST parsing to build a repository map of files, classes, and functions, which is passed to the model as context rather than the full file contents.

Choosing between codebase-memory-mcp and Understand-Anything: They serve different primary use cases and are not mutually exclusive.

Use codebase-memory-mcp when the primary consumer is an AI agent during active coding work. It is faster, lighter (a single binary with no dependencies), and optimized for structural queries that an agent needs mid-task: call graphs, dead code detection, cross-service HTTP links, impact analysis before a refactor. It uses far fewer tokens, which matters when you're running hundreds of agent queries per day.
Use Understand-Anything when the primary consumer is a human trying to orient themselves in an unfamiliar codebase — onboarding, architecture review, or explaining a system to a non-technical stakeholder. Its visual dashboard and guided tours are designed for exploration and comprehension, not programmatic lookup.

In practice, a team might use both: codebase-memory-mcp powering the AI coding agent in the background, and Understand-Anything run once when a new engineer joins or when the team needs a shared map of the architecture.

Where it works well: Large, relatively stable codebases where structural relationships are the primary thing you're querying. "What functions call authenticate()?" is trivially answered by graph traversal. "What would break if I change this interface?" is a reachability query. These are questions that vector similarity simply cannot answer — not because embedding models are bad, but because the question is fundamentally about graph structure, not semantic proximity.

Where it breaks down: Very recently changed code that hasn't been re-indexed yet (though incremental indexing mitigates this), and highly dynamic codebases where the structure itself is in flux. The graph also doesn't model runtime behavior — only static structure.

Paradigm 4: Reasoning-Based Tree-Indexed RAG

This paradigm addresses the structural limitation of vector chunking for long, hierarchical documents — and the results are striking enough that they reframe the basic assumptions of the field.

PageIndex, developed by VectifyAI, takes a fundamentally different approach to document retrieval. Instead of chunking and embedding, it builds a hierarchical "table of contents" tree from a document — capturing section structure, subsection relationships, and the logical organization the authors actually imposed on the content. At query time, an LLM reasons over this tree to navigate to the right section, rather than computing nearest-neighbor similarity over flat chunks.

No vectors. No chunking. No embedding model at all.

The benchmark result: 98.7% accuracy on FinanceBench, a dataset of questions over real-world SEC filings — 10-Ks, 10-Qs, 8-Ks, and earnings releases. Vector RAG baselines score roughly 30–50% on the same benchmark, making this a substantial improvement.

The underlying insight from the VectifyAI team is worth quoting directly: similarity does not equal relevance. When you ask "what was the effective tax rate in the Asia-Pacific segment in fiscal year 2023?", the answer is in a specific table in a specific section of a structured report. The text of that section may share almost no vocabulary with your query — it might use abbreviations, reference earlier definitions, and be embedded in a table format that embeddings handle poorly. The relevant passage is not semantically similar to the question; it's structurally located at the right position in the document.

Reasoning over a tree index finds the right section because the model can interpret the document's own organizational logic. Vector similarity finds the chunk that looks most like the question, which is often not the same thing.

Where it works well: Long, structured documents where the organization itself carries meaning. Financial reports, legal filings, technical standards documents, academic papers with formal section structures, regulatory documents. Any domain where "find me the answer" requires navigating a document the way a human expert would — by understanding what the document is about and where it keeps different types of information.

Where it breaks down: Unstructured documents without meaningful hierarchy, and corpora of many short documents where there's no tree structure to exploit. It also requires the documents to have a stable logical organization — conversational content or informal writing doesn't have the structural regularity that makes tree indexing effective.

Paradigm 5: LLM-Maintained Wiki

This paradigm is less about retrieval and more about continuous knowledge compilation. The distinction matters: traditional retrieval systems find information that already exists in a raw corpus. This approach first transforms that corpus into a synthesized, structured artifact — then retrieves from that.

The core idea: instead of indexing raw documents for retrieval at query time, an LLM incrementally builds and maintains a structured wiki as new information arrives. When a new source is ingested, the LLM reads it, extracts entities and claims, and integrates them into existing wiki pages — updating summaries, flagging contradictions with prior content, adding cross-references, and strengthening the overall synthesis. The wiki is a persistent, compounding artifact. The work of understanding is done once and accumulated, not re-derived on every query.

The LLM-wiki pattern — documented as a pattern for personal knowledge bases — makes this concrete. The architecture has three layers: raw sources (immutable), the wiki (LLM-maintained markdown), and a schema document (CLAUDE.md or AGENTS.md) that tells the LLM how to maintain the wiki. Queries go against the wiki, not the raw sources. Good answers get filed back into the wiki as new pages, so exploration compounds the knowledge base just like ingested sources do.

GBrain takes this pattern and productizes it. The wiki lives as plain markdown files in a git repository — readable by humans and agents alike — backed by an embedded database that requires no external server to set up. For larger deployments, it can switch to a full database with vector storage.

What makes GBrain's search particularly thoughtful is how it combines two complementary techniques. Keyword search is precise but literal: it finds documents containing the exact words you typed. Semantic search is fuzzy but conceptual: it finds documents about the same idea, even if they use completely different words. Neither alone is sufficient. A query like "unconventional thinking" won't match a document titled "The Bus Ticket Theory of Genius" through keyword search, but will through semantic search. Conversely, a query for an exact name or figure needs keyword search to find it reliably. GBrain runs both simultaneously and merges the ranked results — a technique called Reciprocal Rank Fusion — so you get the precision of one and the conceptual reach of the other.

Before searching, GBrain also rephrases your query several different ways using a fast AI model, then searches on all the variations at once. This is similar to how a librarian might say "let me also check under 'fiscal policy' and 'government spending' if I don't find it under 'budget.'" The results are then scored by relevance and the best ones surface to the top.

It also connects to external data sources — email, calendar, voice calls, Twitter — so the knowledge base grows automatically from the user's real-world activity rather than requiring manual curation.

Where it works well: Personal knowledge bases, team intelligence systems, research projects that accumulate knowledge over weeks or months, any scenario where the value comes from synthesis across many sources rather than lookup of specific facts. The pattern's core advantage is compounding — the wiki gets richer and more useful the longer it runs, whereas a RAG system over raw documents delivers essentially the same quality on day one and day one hundred.

Where it breaks down: Real-time information needs — the wiki reflects what's been ingested, not what's happening now. It also requires ongoing LLM maintenance work, which costs tokens and introduces latency on every ingest cycle. And the wiki's quality depends on the quality of the LLM's summarization and integration — errors can compound as readily as insights.

How Understand-Anything and GBrain compare — and how to combine them: Earlier, Understand-Anything was mentioned for analyzing knowledge bases. This is where the two tools intersect, and the distinction is worth spelling out.

GBrain is about growing and querying a knowledge base. The LLM writes to it continuously, ingests new sources, updates existing pages, and the wiki gets richer over time. It is your day-to-day knowledge engine — the system you run constantly in the background.

Understand-Anything applied to a knowledge base is about understanding what you've already built. It reads the existing markdown files, extracts entities and implicit relationships between ideas, detects topic clusters, and builds an interactive visual map of the whole thing. It doesn't write anything — it reveals structure.

Since both tools work on the same plain markdown files, there's no conversion or migration — you point Understand-Anything at the same directory GBrain writes to:

GBrain grows the wiki — day after day, ingesting sources, synthesizing knowledge, cross-referencing ideas.
Understand-Anything maps the wiki — run periodically to see which topics are well-developed, which concepts are referenced but never explained (orphan nodes), and which clusters of ideas have formed that you hadn't consciously planned.

The combination is particularly powerful for long-running research. After weeks of GBrain ingesting papers, articles, and notes, running Understand-Anything gives you a bird's-eye view: here is the shape of what you know, here are the gaps, here are the unexpected connections between ideas you explored in different contexts. That map then informs what to read and ingest next — feeding back into GBrain. The two tools create a feedback loop between building knowledge and understanding what you've built.

Comparison at a Glance

Dimension	Vector RAG	Agentic Search	Graph/AST Indexing	Tree-Indexed RAG	LLM-Maintained Wiki
Index type	Vector embeddings	No index	Knowledge graph	Hierarchical tree	Compiled markdown wiki
Data freshness	Stale between rebuilds	Always fresh (real-time)	Near-fresh (incremental)	Static	Updated on ingest
Query latency	Low (milliseconds)	High (seconds, multi-round)	Very low (sub-ms graph query)	Medium (LLM tree traversal)	Low (search over compiled wiki)
Accuracy / reasoning quality	Moderate	High (reasoning-driven)	High for structural queries	Very high for structured docs	High for synthesized knowledge
Setup complexity	Medium	Low	Medium-high	Medium	Medium-high
Token cost per query	Low	High	Very low (~99% fewer than file-by-file)	Medium	Low
Best data type	Unstructured text corpora	Live codebases	Large stable codebases	Long structured documents	Accumulating personal/team knowledge

A Framework for Choosing

The right question isn't "which retrieval paradigm is best?" It's four questions about your specific data and use case.

1. How frequently does the data change?

If the data changes faster than you can rebuild an index — think an active codebase with dozens of daily commits — agentic search or graph indexing with incremental updates are your options. If the data is effectively static, the cost of an index is justified.

2. Does structure or hierarchy matter for your queries?

If the answers to your questions are located by navigating a document's or codebase's organizational structure — sections, call chains, inheritance hierarchies — then structure-preserving representations (graph indexing, tree indexing) will outperform flat vector embeddings. If your queries are truly about semantic similarity and the documents are genuinely unstructured, vector RAG is appropriate.

3. Are you doing point lookups or open-ended reasoning?

"What does function X return?" is a lookup. "Why is this test failing, and what's the root cause?" is open-ended reasoning that may require exploring paths you didn't anticipate at query time. Lookups favor indexed approaches. Reasoning tasks favor agentic approaches or graph traversal where the model can navigate to relevant context.

4. Is this a one-time corpus or an accumulating knowledge base?

A static document corpus — product documentation, a research paper collection, a regulatory filing archive — is indexed once and queried repeatedly. An accumulating knowledge base — ongoing research, a personal journal plus reading notes, a team's institutional memory — grows continuously and gains value from synthesis over time. The LLM-maintained wiki pattern is designed specifically for the second case; traditional RAG is designed for the first.

The answers map to paradigm selection as follows:

Changes constantly + reasoning-heavy + codebase → Agentic Search
Large stable codebase + structural queries → Graph/AST Indexing
Long structured documents + precise lookup → Reasoning-based Tree-Indexed RAG
Accumulating knowledge + synthesis over time → LLM-Maintained Wiki
Stable text corpus + semantic similarity queries → Vector RAG

5. Who is the consumer — an AI agent or a human?

This cuts across all the paradigms above and specifically matters when choosing within the Graph/AST Indexing and LLM-Maintained Wiki categories, where two tools often cover similar ground:

AI agent as primary consumer (querying during active work) → prefer codebase-memory-mcp for code, GBrain for knowledge. Both are optimized for programmatic access, low token cost, and fast lookup.
Human as primary consumer (exploring, onboarding, understanding) → layer in Understand-Anything on top. It visualizes the same underlying data as an interactive graph that humans can navigate, rather than an API that agents call.

Crucially, these pairings are additive, not alternatives. codebase-memory-mcp + Understand-Anything serve the same codebase from different angles. GBrain + Understand-Anything create a feedback loop: GBrain builds the knowledge base day by day; Understand-Anything periodically maps its structure, reveals gaps, and surfaces unexpected connections — informing what to add next.

They Coexist in Production

In a real production AI agent, you don't pick one paradigm and apply it everywhere. You compose multiple, matching each layer of the system to the data type it handles.

Consider a production coding agent. The stable architectural layer — the module boundaries, the major abstractions, the interfaces between components — is well-suited for graph indexing. That structure doesn't change often, queries about it are structural ("what implements interface X?"), and the 99% token reduction from codebase-memory-mcp compounds significantly across thousands of queries. But for files modified in the last hour, graph indexes may be stale. Agentic search against those specific files gives you fresh context without rebuilding the full graph. And for team-level institutional knowledge — why a particular architectural decision was made, what a deprecated module was replaced by, context about an external integration — an LLM-maintained wiki stores the synthesis that can't be recovered from the code alone.

A research agent might combine paradigms differently. Papers in the corpus are long, structured documents — tree-indexed RAG for precise retrieval from specific papers, outperforming chunk-based approaches by a significant margin. But the ongoing synthesis of what those papers mean collectively, how they relate to each other, what contradictions have been found, what hypotheses have been refined — that's the LLM-maintained wiki layer, compounding over weeks of research.

These combinations aren't exotic or theoretical. They're what you get when you take the representation problem seriously: each data type in your system gets the representation that preserves and exploits its structure, rather than everything getting forced through the same pipeline.

The Takeaway

The era of RAG as a default is over — not because RAG is dead, but because the field has developed better-matched tools and we now understand the tradeoffs clearly enough to choose deliberately.

Vector RAG solved a real problem: making large text corpora queryable at scale. For unstructured text corpora with stable content and semantic search needs, it remains a reasonable choice. But the mistake was treating it as a universal retrieval primitive — applicable to codebases, financial documents, personal knowledge bases, and everything else with equal fidelity. It isn't.

The honest accounting of the field in 2025 is this:

Codebases are graphs with temporal dynamics — use graph indexing for stable structure, agentic search for fresh context
Long structured documents are hierarchical arguments — use tree-indexed reasoning to navigate them, because similarity is not relevance
Accumulating knowledge is a synthesis problem — maintain a wiki that compounds, rather than re-deriving from raw sources on every query
Unstructured text corpora are genuinely suited for vector RAG — stop apologizing for using it where it actually works

The right question is always: what is the structure of my data, and which representation preserves and exploits that structure?

Everything else — which database, which embedding model, which retrieval framework — is downstream of that question. Get the representation right, and the rest of the system follows naturally. Get it wrong, and you're throwing away the most useful information in your data before you've asked a single question.

References

Tools and Projects

codebase-memory-mcp (DeusData) — github.com/DeusData/codebase-memory-mcp | deusdata.github.io/codebase-memory-mcp — High-performance code intelligence MCP server; AST-based knowledge graph across 155 languages.
Understand-Anything (Lum1104) — github.com/Lum1104/Understand-Anything | understand-anything.com — Interactive knowledge graph generation for codebases and knowledge bases; works with Claude Code, Codex, Cursor, Copilot, and Gemini CLI.
PageIndex (VectifyAI) — github.com/VectifyAI/PageIndex | pageindex.ai — Vectorless, reasoning-based RAG using hierarchical tree indexing; 98.7% accuracy on FinanceBench.
Aider (Aider-AI) — github.com/Aider-AI/aider | aider.chat — Open-source coding assistant; pioneered AST-based repository map indexing for passing structural code context to LLMs.
GBrain (garrytan) — github.com/garrytan/gbrain — Agent memory system with hybrid BM25/vector search (RRF fusion) and personal data integrations.
Claude Code (Anthropic) — Boris Cherny discussed abandoning RAG in favor of Agentic Search on the Latent Space podcast (May 2025) | YouTube. A later in-depth interview on The Pragmatic Engineer podcast (March 2026) covers Claude Code's full architectural evolution.

Research

"Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP" — arxiv.org/abs/2603.27277 — Research behind codebase-memory-mcp; evaluated across 31 real-world repositories: 83% answer quality, 10× fewer tokens, 2.1× fewer tool calls vs. file-by-file exploration.

Benchmarks

FinanceBench — Financial document Q&A benchmark for evaluating retrieval systems over real-world SEC filings and earnings reports. Referenced in PageIndex documentation for the 98.7% accuracy result.

Top comments (1)

Harjot Singh • May 31

The reframe is right and the RAG-is-dead discourse is a great example of how the industry argues about implementations when the real question is about representation. RAG vs agentic search isn't two philosophies, it's two points on a tradeoff: precomputed index (fast, cheap, goes stale, lossy at chunk boundaries) vs live traversal (fresh, expensive, slow, but follows the actual structure). Cherny abandoning RAG for Claude Code makes sense for that specific shape, code has hard structural edges (imports, call graphs, definitions) that a flat embedding chunking destroys, so agentic search over real structure beats similarity over mangled chunks. But that's a property of code, not a universal verdict, a support knowledge base with stable docs is still RAG's home turf. The representation framing is the unlock: the question isn't which retrieval wins, it's does your retrieval preserve the structure that matters for this domain. Choose representation first, retrieval follows. That match-the-method-to-the-data-shape thinking is how I approach retrieval in Moonshift. Where do you draw the line, is it data volatility or the strength of the structural relationships that pushes you toward agentic over indexed?