Harrison Guo

Posted on May 26 • Originally published at harrisonsec.com

Agent Retrieval Is a Cost Curve Problem: Why Claude Code Doesn't Use RAG

#ai #rag #claudecode #programming

There's a popular interview question making the rounds: "Why doesn't Claude Code use RAG to retrieve code? Why grep?"

The popular answer goes: chunking breaks code structure, vectors approximate when code demands exact, indexes go stale, cold-start is slow, retrieval is a black box. All five are real. None of them are the reason.

They're symptoms. The reason is older than RAG, older than LLMs, older than the term retrieval. It's a cost curve.

tl;dr — Index-based retrieval pays a high build cost plus a nonlinear maintenance cost in churn × index complexity. LLM tool-loop retrieval pays nothing up front and a per-query cost that's roughly project-size-independent for queries an LLM actually issues. For most small-to-mid-size repos the crossover is never reached. The "Anthropic trusts the model" framing is romantic; the actual answer is colder — build cost zero, per-query cost amortizes faster than index drift, so the math says grep.

There's also a precision axis, which most engineers care about more than cost. Vector RAG is approximate by design — getUserById returns alongside getUserByEmail because they're semantically adjacent. Code usually wants exact, which grep gives you for free. Symbol-graph indexes (Sourcegraph, Kythe, LSP) are precision-first but haven't become the LLM companion either — covered below.

Audited against a publicly circulated build snapshot of Claude Code with file:line citations. The kicker: the Explore subagent's "use this when you'll need more than 3 queries" rule is gated behind a feature flag (tengu_amber_stoat) and being A/B-tested against a parallel architecture (Fork). The canonical answer is conditional. That is the answer that gets you the offer.

The Frame That Matters: Total Cost Over Time

Pick any retrieval system and you pay for three things, on different schedules:

Build cost — one-time work to assemble whatever structure makes lookup fast. For an index, this is chunking + embedding + insert. For tool-loops, this is zero.
Maintain cost — ongoing work to keep the structure honest as the underlying data changes. For an index, this is invalidation, reindex, drift reconciliation. For tool-loops, this is also zero — the "structure" is the live filesystem.
Per-query cost — work done when a question arrives. For an index, this is a vector search + a few reranks + an LLM call. For tool-loops, this is N LLM-tool round-trips, where N varies.

The temptation is to compare per-query cost: "vector search is one round-trip, tool-loop is six." That's why RAG looks dominant on a whiteboard. But you ship a system, not a whiteboard. The bill is:

total_cost = build_cost + maintain_cost × time + per_query_cost × queries

For a project that changes daily and gets queried hourly, the term that actually grows is maintain_cost × time. For index-based retrieval on a churning codebase, maintain cost grows at least linearly with churn — and can grow faster than teams expect, because cross-chunk and cross-file references force cascading re-embeddings and symbol-graph consistency checks. A naive incremental indexer is linear; a correct one tracking cross-file refactors is often worse than linear in the worst case. For tool-loops, maintain cost is identically zero, because the loop has no persistent structure.

The build/maintain term dominates anything you save on per-query cost, until your project is large enough that per-query cost itself becomes the bottleneck. For most small-to-mid-size repos, the crossover is never reached.

That's the whole argument. The rest of this post is evidence — what the cost-curve choice looks like in source code, and where Anthropic is hedging.

A teaser for the punchline two sections down: the canonical "Claude Code spawns an Explore subagent for open-ended search" rule that most explainers quote is gated behind a feature flag (tengu_amber_stoat) and A/B-tested in production against a second architecture (Fork) that takes the opposite trade. Anthropic is hedging on the retrieval design itself. We'll come back to this in "The Subagent Twist Nobody Quotes Correctly."

The Popular Answer, Charitably

Before deflating it: the popular answer isn't wrong, it's just downstream. Briefly, with the steel-manned version:

Chunking breaks structure. A function split across chunks loses both halves of an if/else, and call-graph relationships fragment between chunks. AST-aware chunkers exist; they're better, not solved.
Vectors approximate. getUserById returns alongside getUserByEmail and getUserByName because they're semantically adjacent. Exact-symbol search beats this trivially. Important distinction: this is true of vector RAG specifically. Symbol-graph indexes — Sourcegraph, Kythe, Glean, LSP-backed code search — are a different category: they index by function/class/reference, not by chunked vector. They give exact answers and are what mega-monorepos at Google and Meta actually run for code search. When this post says "RAG," it means vector RAG. The cost-curve argument applies to vector RAG; symbol-graph has its own cost curve and lives higher up the scale spectrum. It has its own separate reason for not becoming the LLM companion — covered in the next section.
Indexes go stale. Every commit invalidates some subset of chunks. Incremental update has edge cases (renames, file moves, cross-file rename refactors). Full reindex is expensive enough to discourage frequent commits.
Cold start. Minutes-to-first-query is a non-starter for "open the tool, start working" UX.
Black-box recall. Top-K vector hits are not human-auditable. When the LLM returns a wrong answer, you can't tell whether retrieval failed or reasoning did.

All five are pains. Each has a counter — better chunkers, hybrid retrieval, incremental indexers, warm pools, attribution layers. The counters cost engineering. Some teams spend the engineering and ship working systems. Anthropic looked at the bill and decided not to.

Why? Because the baseline for code retrieval — grep over a clean filesystem with an LLM in the loop — already works well enough that adding an index is paying engineering cost to solve symptoms whose root cause is the index itself. Removing the index removes the pains. The remaining cost is per-query LLM round-trips, which the cost-curve frame says is acceptable below the crossover.

That's why grep. Everything else is engineering details on top of that decision.

So Why Hasn't Symbol-Graph Become the LLM-Companion Either?

If symbol-graph indexes are precision-first, language-aware, and battle-tested at FAANG scale, the natural question is: why didn't they become the default companion to LLM coding agents? Why grep, not LSP-over-MCP?

The answer is the same shape as the vector-RAG answer — high friction in places that don't show up on a feature comparison — but the specific frictions are different:

Build cost is high in a different way. Symbol-graph indexes need to compile (or semi-compile) the project to resolve symbols. For Rust, C++, large TypeScript or Java codebases this is minutes to tens of minutes per cold start. "Open Claude Code and start working" can't pay that toll.
Language-specific, not portable. LSP is one server per language. Tree-sitter coverage helps but isn't uniform. A grep-backed agent works on any text in any language with zero setup; a symbol-graph-backed agent inherits the project's language-server matrix.
API/format mismatch with how LLMs reason. LSP returns deeply nested JSON (locations, ranges, document hierarchies); grep returns file:line: content. The second is almost literally an LLM's native dialect; the first needs adapting. The translation tax is real.
Coverage is narrower than it looks. Symbol-graph models code-as-structure. It misses config files, comments, strings, generated code, markdown, env files, shell scripts, the README — all of which are first-class context for a real coding session. Grep covers anything that's text.
The win is for structure questions, not intent questions. "Where is getUserById defined?" — symbol-graph is exactly right. "How does the login flow work?" — back to grep + read. A real coding day has both kinds; building infrastructure that only solves one kind is paying a high fixed cost for half the answer.
The constraint reversed under it. Symbol-graph was designed for a world where the constraint was human attention bandwidth — give a developer one precise answer they can read. LLMs don't have that constraint; they can read 30 grep hits cheaply and reason across them. The bottleneck moved from "precision of retrieval" to "fluency of the model reading the retrieval." Symbol-graph is optimizing the part that's no longer expensive.

One-line summary: symbol-graph is precision tooling built for human IDEs. LLMs are not human IDEs. Their retrieval bottleneck is different — they prefer many cheap rounds over one expensive precise call. Installing a symbol-graph for an agent that can grep thirty times in a session is, roughly, hiring a second driver for someone who already drives.

This is also why the few existing LLM ↔ symbol-graph integrations (Cursor's @symbol references via LSP, Sourcegraph Cody, Codeium with LSP backend) are additive niceties in those products, not the retrieval backbone. The backbone is still the same as Claude Code's — grep over text.

The Three Primitives, Audited

Source paths below are from a publicly circulated, non-public build snapshot of Claude Code that I have on disk for analysis purposes. APIs and exact line numbers will drift; the design choices below have been stable in the snapshot I reviewed, and match observed runtime behavior on current public Claude Code releases.

Grep — ripgrep, with structured output and a "don't shell out" enforcement clause

The Grep tool description, from src/tools/GrepTool/prompt.ts:7-16, is short and pointed:

A powerful search tool built on ripgrep

  Usage:
  - ALWAYS use Grep for search tasks. NEVER invoke `grep` or `rg` as a Bash command. The Grep tool has been optimized for correct permissions and access.
  - Supports full regex syntax (e.g., "log.*Error", "function\s+\w+")
  - Filter files with glob parameter (e.g., "*.js", "**/*.tsx") or type parameter (e.g., "js", "py", "rust")
  - Output modes: "content" shows matching lines, "files_with_matches" shows only file paths (default), "count" shows match counts
  - Use Agent tool for open-ended searches requiring multiple rounds
  - Pattern syntax: Uses ripgrep (not grep) - literal braces need escaping (use `interface\{\}` to find `interface{}` in Go code)
  - Multiline matching: By default patterns match within single lines only. For cross-line patterns like `struct \{[\s\S]*?field`, use `multiline: true`

Three things to read out of this:

The ALWAYS / NEVER is doing work. The model has Bash. It could shell out to rg or grep directly. The prompt forbids it. Why? Three reasons, in order of importance:
- Permission surface. Bash is a universal tool. Auditing what the model can do with Bash means auditing every shell command. Audited Grep means audited Grep, period.
- Output discipline. A Bash rg dumps raw text into the context window. Grep returns one of three structured modes — content, files_with_matches, count — letting the model pick the cheapest mode that answers the question. This is a token-budget decision, not a feature decision.
- Backend swap. Today the implementation is ripgrep. Tomorrow it might be bfs/ugrep (the source already has a branch for embedded-search tools — see hasEmbeddedSearchTools() in src/utils/embeddedTools.ts). A tool boundary makes the swap invisible to the model.
The default output mode is files_with_matches. Not content. The model has to opt in to seeing actual lines. This is a token-conservation default: most of the time, the model wants to know which files matched so it can narrow further; only when it's ready to read does it ask for the lines.
Multiline is opt-in. Default ripgrep is line-bounded — a deliberate restriction, because cross-line regex on a large tree is a perf cliff. The model can opt into multiline when it knows it needs to, paying the cost only then.

These are all small choices. Each one shaves a tail off the per-query cost. Cumulatively, they're why the per-query term in our cost equation stays low enough that the build/maintain term — zero — dominates.

Glob — filename patterns with a recency heuristic and a hard cap

Glob is the "find files by name" primitive. Two design tells in its construction:

Results are sorted by mtime descending. Most-recently-modified file first. The heuristic is that in any given session, the files you've touched recently are the files you're about to touch again. This is the same logic IDEs use for the "Recent Files" list, and it's empirically right far more than it's wrong.
A hard cap of 100 results. Past that, output truncates. The model can tighten the pattern and re-call. The cap exists because an LLM that consumes 800 file paths because it asked for **/* is an LLM that's burned a quarter of its context on noise.

Both are token-budget decisions disguised as ergonomic ones. The mtime sort means a small N typically covers the relevant set. The 100-file cap means a careless query degrades gracefully instead of catastrophically.

Read — bounded, fresh, stat-checked

src/tools/FileReadTool/prompt.ts is the most interesting of the three because of what it constrains:

- By default, it reads up to 2000 lines starting from the beginning of the file (...)
- When you already know which part of the file you need, only read that part. This can be important for larger files.

(MAX_LINES_TO_READ = 2000 at line 10; the "only read that part" guidance is OFFSET_INSTRUCTION_TARGETED at line 20-21, swapped in dynamically based on context.)

A 2000-line default cap. offset and limit parameters for targeted reads. And — critically — every read calls stat on disk. No cache. No index. No staleness layer.

The implication is that Read is always live. The model that just edited a file and wants to see the result reads the file and gets the new bytes. The model that's iterating on a fix doesn't fight cache invalidation because there is no cache to invalidate. This is the "no maintain cost" term in the cost equation, made concrete: the live filesystem is the index, and the filesystem is always up to date with itself.

The OFFSET_INSTRUCTION_TARGETED template is worth noting separately. It's swapped in over the default when the prompting context suggests the model already knows what range it wants. It's a tiny prompt-engineering detail, but it's also a piece of evidence that the team thinks carefully about teaching the model to read selectively. The lesson the model is being taught, every prompt, is don't be greedy. That's exactly the discipline that keeps per-query cost from blowing up.

The composition

Walk through a real query: "Where's the login flow in this project?"

Glob `/login.{ts,tsx,js}`** — returns up to 100 files, most-recently-modified first. Usually under ten matches; usually the right one is in the first three.
Grep passport|auth|login --glob '' — narrows to specific lines. Three output modes available; the model picks the cheapest one that disambiguates.
Read the file, with offset/limit targeted at the matched region — reads only what's needed.

Three primitives. One realistic query. Total cost: three round-trips, a few hundred tokens of output total. No index, no embedding, no rebuild. Filesystem unchanged.

Now add ten more iterations as the model investigates a bug. Each one is the same three primitives, in different combinations. The cost grows linearly with iterations; it doesn't grow super-linearly with the codebase. That's the curve.

The Loop, Sketched Honestly

You can find pseudo-code versions of Claude Code's main loop in essays and threads. They all look something like:

while (true) {
  const response = await callLLM(messages);
  if (!response.toolUses.length) break;
  for (const use of response.toolUses) {
    const result = await execute(use);
    messages.push(result);
  }
}

That sketch is conceptually right. It is also missing roughly 1,415 lines.

The real loop is at src/query.ts:307 (while (true) {) and closes at :1728 (} // while (true)). The body is 1,421 lines. The full file is 1,729. If you want a guided tour, the Claude Code Deep Dive Part 2 walks through it section by section. The summary for our purposes here is shorter:

The Platonic loop — call model, run tools, repeat — is six lines. The other 1,415 lines are doing one of four things:

Streaming tool execution. Tool calls don't wait for the full model response before they start running; they execute as the streaming tool_use blocks arrive. This is non-trivial because the model can emit a partial tool_use, retract it, or continue thinking before the rest of the call arrives.
Cache and compaction management. Microcompact maintains tool-result cache coherence by tool_use_id without inspecting content. The auto-compact path triggers when the message stream nears a context budget. Both interact with the prompt cache, which has its own coherence rules.
Failure recovery. A nested while (attemptWithFallback) at line 654 handles model fallback when the primary returns a recoverable error. There's a separate path for max_output_tokens recovery (when the model would emit a tool_use but truncate before the tool_result). Orphan tool_results are aggressively pruned so a retry doesn't carry stale IDs forward.
Thinking-block preservation. Reasoning content has to span the assistant trajectory — same turn, plus any subsequent tool_use/tool_result chain — because the model expects to see its own prior reasoning to continue coherently. The loop preserves these blocks across iterations.

The point is not "the loop is complicated." The point is: the cost of running an agent on tool-loops moves into the loop, where it's controllable. There's no second system — no index pipeline, no vector store, no embedding service — with its own failure modes, latency budgets, and consistency guarantees. The loop is the engineering target.

This is the same property that makes a single-process database easier to operate than a microservices fleet, for reasons that have nothing to do with the database. Concentrate complexity where you can attack it.

The Subagent Twist Nobody Quotes Correctly

This is the section where most explanations of Claude Code's retrieval get it wrong, including the article that prompted this post. The popular telling goes: "For open-ended exploration, Claude Code spawns an Explore subagent — Anthropic codifies the rule that if you need more than three queries, you should fork."

The rule exists. The popular telling has two things wrong about it:

The rule is not in AgentTool/prompt.ts. It's in src/constants/prompts.ts:378-379, injected into the system prompt by template:

For simple, directed codebase searches (e.g. for a specific file/class/function) use [searchTools] directly.

For broader codebase exploration and deep research, use the Agent tool with subagent_type=Explore. This is slower than using [searchTools] directly, so use this only when a simple, directed search proves to be insufficient or when your task will clearly require more than [EXPLORE_AGENT_MIN_QUERIES] queries.

EXPLORE_AGENT_MIN_QUERIES is defined in src/tools/AgentTool/built-in/exploreAgent.ts:59 as the integer 3. So the "more than 3 queries" threshold is literal — but it's a single named constant, easy to change, and the guidance is interpolated dynamically per system-prompt build.

The rule is conditional. The two lines above are only emitted when both conditions hold:

// src/constants/prompts.ts:374-381
...(hasAgentTool &&
areExplorePlanAgentsEnabled() &&
!isForkSubagentEnabled()
  ? [
      `For simple, directed codebase searches ...`,
      `For broader codebase exploration ... subagent_type=Explore ...`,
    ]
  : []),

Both areExplorePlanAgentsEnabled() and !isForkSubagentEnabled() have to be true. The first is itself feature-gated:

// src/tools/AgentTool/builtInAgents.ts:13-22
export function areExplorePlanAgentsEnabled(): boolean {
  if (feature('BUILTIN_EXPLORE_PLAN_AGENTS')) {
    // 3P default: true — Bedrock/Vertex keep agents enabled (matches pre-experiment
    // external behavior). A/B test treatment sets false to measure impact of removal.
    return getFeatureValue_CACHED_MAY_BE_STALE('tengu_amber_stoat', true)
  }
  return false
}

BUILTIN_EXPLORE_PLAN_AGENTS is a Bun build feature gate. tengu_amber_stoat is a GrowthBook key. (Aside: tengu_* is the Anthropic-internal naming prefix for Claude Code — any flag you see with that prefix in the source is a Claude Code feature toggle.) The comment in source is the giveaway: "A/B test treatment sets false to measure impact of removal." Anthropic is actively testing what happens when they remove Explore and Plan agents entirely for some fraction of internal users.

Read that again. The canonical "Anthropic uses Explore for exploration" claim is true for the default treatment and being measured against the no-Explore baseline. The interview answer that says "Anthropic always uses Explore" is wrong — or at least, more confident than the source.

This is the half-credit point. Knowing that the rule exists is the first half. Knowing the rule is feature-gated and under measurement is the second half.

The Economics of Explore, in Detail

If you read just the function of Explore, the design looks like a generic subagent: open-ended exploration, returns a summary, isolates from the main context. If you read its parameters, the economics jump out:

// src/tools/AgentTool/built-in/exploreAgent.ts:64-83
export const EXPLORE_AGENT: BuiltInAgentDefinition = {
  agentType: 'Explore',
  whenToUse: EXPLORE_WHEN_TO_USE,
  disallowedTools: [
    AGENT_TOOL_NAME,             // can't spawn nested subagents
    EXIT_PLAN_MODE_TOOL_NAME,
    FILE_EDIT_TOOL_NAME,         // read-only
    FILE_WRITE_TOOL_NAME,        // read-only
    NOTEBOOK_EDIT_TOOL_NAME,
  ],
  source: 'built-in',
  baseDir: 'built-in',
  // Ants get inherit to use the main agent's model; external users get haiku for speed
  // Note: For ants, getAgentModel() checks tengu_explore_agent GrowthBook flag at runtime
  model: process.env.USER_TYPE === 'ant' ? 'inherit' : 'haiku',
  // Explore is a fast read-only search agent — it doesn't need commit/PR/lint
  // rules from CLAUDE.md. The main agent has full context and interprets results.
  omitClaudeMd: true,
  getSystemPrompt: () => getExploreSystemPrompt(),
}

Three economic decisions encoded here:

Explore runs on Haiku for external users. Not the main reasoning model. Exploration is a cheap-tokens job — there's no creative reasoning happening, just iterate-and-filter — and Anthropic uses a fast, small, cheap model for it. The main agent gets the expensive model when it gets the summary back. This is the staffing analogue: junior associate does the deposition review, senior partner reads the brief.
Explore omits CLAUDE.md. The argument in the inline comment: "The main agent has full context and interprets results." CLAUDE.md is for project rules — commit conventions, PR style, lint guidance. Explore isn't doing any of those things. Loading CLAUDE.md into its prompt would be paying tokens for guidance it can't act on.
Explore can't spawn subagents. AGENT_TOOL_NAME is in disallowedTools. No recursion. This is a budget guarantee: an Explore call has bounded depth.

And the system prompt makes the read-only constraint impossible to misread:

=== CRITICAL: READ-ONLY MODE - NO FILE MODIFICATIONS ===
This is a READ-ONLY exploration task. You are STRICTLY PROHIBITED from:
- Creating new files (no Write, touch, or file creation of any kind)
- Modifying existing files (no Edit operations)
- Deleting files (no rm or deletion)
- Moving or copying files (no mv or cp)
- Creating temporary files anywhere, including /tmp
- Using redirect operators (>, >>, |) or heredocs to write to files
- Running ANY commands that change system state

(exploreAgent.ts:26-34. The full prompt also requires Explore to spawn parallel tool calls aggressively — "you must try to spawn multiple parallel tool calls for grepping and reading files" — which is yet another budget squeeze: turn N round-trips into one wall-clock round, lower latency, same token bill.)

Pull the threads together and Explore reads like a budgeted exploration worker: small model, no CLAUDE.md noise, no nested recursion, read-only, parallel-first. It is the most economically tuned piece of the retrieval system, and it's there because once you've decided the model drives retrieval, the next question is which model, and what it gets paid to think about.

The popular telling skips all of this. It says "Anthropic spawns Explore." The source says Anthropic spawns the cheapest possible agent that can do the work, strips its context to the minimum that lets it function, and measures whether spawning it at all beats just letting the main agent loop directly. That last clause is the A/B in the previous section.

Fork — The Architecture Under Test

Here's the other half of the experiment.

src/tools/AgentTool/forkSubagent.ts exports isForkSubagentEnabled(). When that returns true, AgentTool/prompt.ts rewrites the system prompt to introduce a new operation: fork.

The fork section, paraphrased from prompt.ts:80-96:

Fork yourself (omit subagent_type) when the intermediate tool output isn't worth keeping in your context. The criterion is qualitative — "will I need this output again" — not task size. Forks are cheap because they share your prompt cache. Don't set model on a fork — a different model can't reuse the parent's cache.

Don't peek. The tool result includes an output_file path — do not Read or tail it unless the user explicitly asks for a progress check. You get a completion notification; trust it.

Don't race. After launching, you know nothing about what the fork found. Never fabricate or predict fork results. The notification arrives as a user-role message in a later turn; it is never something you write yourself.

Read this against Explore and the trade becomes visible:

	Explore	Fork
Context isolation	Fresh subagent, separate context	Inherits parent context
Model	Haiku (cheap)	Same as parent (no swap allowed)
Prompt cache	New cache, no parent reuse	Reuses parent cache (huge savings on the system prompt and prior turns)
CLAUDE.md	Omitted	Inherited
Recursion	Disallowed	Allowed (forks can fork)
Failure model	Subagent returns summary or error	Notification arrives later as user message
Discipline required	Caller frames the query	Caller writes a directive prompt; trusts the notification

Neither dominates. Explore wins when isolation matters more than cost — the exploration is noisy enough that you don't want any of the tool spew in your main context, and the work is mechanical enough that Haiku can handle it. Fork wins when the work needs the main model's depth and you're willing to pay the cache-reuse savings to get it, accepting that the fork is just a future-you with a directive.

Anthropic is shipping both, gated, and watching which one wins on internal metrics. The interview answer that says "Claude Code uses Explore for exploration" is a snapshot of one branch of the experiment. The next release cycle may settle it differently.

This is exactly the kind of fact you can't get from the popular tellings, because they're written from a single observation of behavior. Reading the source gives you the experimental design.

Version caveat. Flag names (BUILTIN_EXPLORE_PLAN_AGENTS, tengu_amber_stoat, isForkSubagentEnabled) and exact gate semantics reflect the snapshot I reviewed and will drift — Anthropic ships fast. The two specific architectures (Explore, Fork) may be renamed, consolidated, or replaced by the time you read this. What's stable is the pattern: Anthropic running multiple retrieval architectures in parallel, gated, measuring against each other. That pattern outlasts any specific flag name and is what the argument here actually rests on. Verify against the current release before depending on any specific flag.

When RAG Still Wins

The cost-curve thesis predicts when RAG crosses over and dominates. Three regimes:

1. Mega-monorepos where any kind of index is amortized across millions of queries

At Google or Meta or a top-tier hedge fund — codebases large enough that every per-query latency point costs real money across the org — there is an index. But, importantly, the index they actually run is almost always symbol-graph (Kythe, Glean, Sourcegraph) rather than vector RAG, for the precision reasons in the earlier section. The index is built once, maintained by a dedicated team, amortized across the engineering population's queries forever. Per-query cost goes to single-digit milliseconds; index drift is handled.

In this regime, tool-loops are paying the same per-query cost over and over across N engineers, and the math flips toward indexing. Below that scale, the staffing cost of "a dedicated team that owns the code index" is itself a hidden cost that wipes the savings — and the index of choice at smaller scales (per-project Sourcegraph, ctags) is still not vector RAG. Vector RAG specifically tends to win in the next two regimes, not this one.

2. Pure semantic queries where there's no symbol to grep for

"Find the code that handles user-deactivation edge cases when the account is also a billing admin" — there's no specific symbol. There's a conceptual region of the code that you're looking for. A vector search over function-doc embeddings might point you to the right cluster of functions faster than the model would grep.

Claude Code's answer to this is: spawn Explore, let it iterate. That works, but it isn't free. If your workload is dominated by semantic-cluster queries (auditing, security review, refactor planning), RAG starts to pencil out.

3. When LLM context is the scarce resource — cheap, short-context model regime

This is the framing nobody else gives. Cost-curves cut both ways. If your model has a 32k context window and costs $0.50/M tokens, doing six tool round-trips for retrieval is six rounds of context consumption. A one-shot RAG hit lets you spend that context budget on reasoning. RAG dominates when per-query token cost dominates per-query latency cost.

Anthropic optimized Claude Code for the regime where context is cheap and abundant (Opus 4.7 with a 1M-token window, in some configurations) and per-query latency is what users feel. In a different regime — say, on-device coding agents over a small open-source model — the cost-curve flips and RAG is the right tool. Same first principles, different numerical answer.

A Decision Framework

Match your project to the column. The cost-curve answers the question for you.

Signal	Use grep + LLM tool-loop	Use RAG	Either / hybrid
Project size: under ~1M lines	✓
Project size: 1M–100M lines	✓		✓
Project size: 100M+ lines (mega-monorepo)		✓
Codebase changes daily	✓
Codebase is mostly static (knowledge base, archived)		✓
Queries are exact-symbol ("find getUserById")	✓
Queries are conceptual ("how does auth work")		✓	✓
Model context is cheap and large	✓
Model context is scarce/expensive		✓
You don't have a team to own a code index	✓
You have an index team and tooling already (Sourcegraph, Glean, ctags)		✓	✓

The crossover doesn't happen at any one threshold; multiple columns moving in the same direction tips the cost curve. For most non-FAANG projects, the columns sit on the grep side.

The Companion Question

This post is one half of a two-part argument about Claude Code's retrieval and memory choices. The companion — "Agent Memory Is a Cache Coherence Problem" (publishing 2026-05-28) — makes the same kind of argument about cross-session memory: why Claude Code's built-in memory is hand-written Markdown instead of vector-recalled embeddings, even with the world hyping claude-mem (70k+ GitHub stars as of May 2026) as a drop-in upgrade.

Read together, the two pieces add up to a coherent design stance. Anthropic's bet, across both axes:

Fidelity over fuzz. Both within-session retrieval (grep) and cross-session memory (CLAUDE.md) are lossless and exact. Both refuse vector approximation as the default.
Cost curves over romance. Neither choice is justified by "we trust the model." Both are justified by the math: zero build cost + zero maintain cost beats nonlinear maintain cost for the workloads they target.
Experimentation in production. Both architectures have alternative branches under active flag gating. On the retrieval side: tengu_amber_stoat for Explore-vs-no-Explore, with Fork as a parallel architecture. On the memory side: tengu_coral_fern, tengu_herring_clock, tengu_passport_quail, tengu_slate_thimble, plus the build-time gates KAIROS, TEAMMEM, and EXTRACT_MEMORIES — all visible in src/memdir/. The pattern is the same: ship a default, leave the toggles in, keep measuring.

The shape of the design is a careful refusal to lock in. The cost curves favor the current choices today. They might not in a year. The system is built to flip.

That, finally, is the answer the interview question is fishing for. Why not RAG? Because the cost curves don't justify it for this workload, and Anthropic has the engineering culture to refuse the cargo-cult. Will it always be that way? No — and the feature flags in the source say so out loud.

Companion piece: **Agent Memory Is a Cache Coherence Problem* (publishing 2026-05-28)*
Background: *Consistency in Distributed Systems: Scenarios, Trade-offs, and What Actually Works***
For a focused walk through the loop file: *Claude Code Deep Dive Part 2: The 1,421-Line While Loop***

Top comments (3)

Valentin Monteiro • May 30

Cost-curve is the right lens for this. The variable I'd put at the center is how often the corpus changes, not its size. grep wins on code because the filesystem is a zero-staleness index, you never pay to keep it in sync. The day a corpus stops moving and gets queried a lot (docs, policies, a frozen knowledge base) the amortization flips and a vector index starts earning its keep. Read/write ratio of the corpus, not line count, is what decides it.

Harjot Singh • May 30

the single-vendor lock-in is the real cost. moonshift uses multi-model routing (deepseek/qwen/claude per phase), $3 flat per shipped saas, no monthly. first run free, no card. moonshift.io

Some comments may only be visible to logged-in visitors. Sign in to view all comments.