Hoyin kyoma

Posted on May 9

Why AI Coding Agents Waste 30% of Their Tokens — And How to Fix It

#agents #ai #llm #softwareengineering

The Hidden Cost of Blind Agents

Every AI coding agent has the same workflow: receive a task, search the codebase, read files, write code. The problem is step 2. The agent doesn't know the codebase. It doesn't know the architecture. So it searches.

And searches. And searches.

We analyzed token usage across 500 SWE-bench Verified instances and found that agents spend approximately 30-40% of their tokens on exploration — reading files that turn out to be irrelevant, following import chains that lead nowhere, and backtracking from wrong approaches.

This isn't a model problem. GPT-5, Claude Opus, Gemini — they all do it. The issue is structural: the agent lacks a map of the codebase.

A Real Example: Django Bug #16379

Let's trace through a real bug to see this in action.

Bug: FileBasedCache crashes with FileNotFoundError when multiple processes access the cache simultaneously.

What a human developer does:

Reads the issue — understands it's a race condition in the file cache backend
Knows (from experience) that Django's cache backends inherit from BaseCache
Opens django/core/cache/backends/filebased.py directly
Checks the delete() and _cull() methods for file operations without proper locking
Writes a fix: wrap the os.remove() call in a try/except for FileNotFoundError
Done. ~5 minutes, ~3 files read.

What an AI agent does (without context):

Reads the issue
Searches for "FileNotFoundError" — finds 47 matches across the codebase
Opens django/core/files/storage.py — wrong file
Opens django/core/files/base.py — wrong file
Searches for "FileBasedCache" — finds it
Opens django/core/cache/backends/filebased.py — right file
Reads the whole file but doesn't understand the inheritance from BaseCache
Writes a fix that handles the error but doesn't respect the cache contract
Test fails
Opens django/core/cache/backends/base.py to understand the base class
Opens django/core/cache/__init__.py to understand the cache framework
Rewrites the fix
Test passes. ~20 minutes, ~12 files read, ~4,000 tokens.

What an AI agent does (with XCE):

Reads the issue
Calls xce_get_context("FileBasedCache FileNotFoundError concurrent access")
Gets back: the cache backend hierarchy, the file operations in filebased.py, the locking patterns, and the test infrastructure
Understands the architecture immediately
Writes the correct fix on the first attempt
Test passes. ~3 minutes, ~3 files read, ~1,500 tokens.

The token savings compound across hundreds of tasks. On our 500-instance benchmark run, XCE reduced total token usage by approximately 20%.

Why Embeddings Aren't Enough

The obvious solution is "just use code search." Tools like Greptile, Sourcegraph Cody, and GitHub Copilot all offer some form of code search. Most use embedding-based retrieval: convert code to vectors, find the most similar vectors to the query.

This works for simple lookups. "Find the login function" → returns the login function. But it fails for architectural questions:

Question	Embedding Search	Architectural Context
"Which module owns this logic?"	Returns similar code snippets	Returns the HLD module, its role, and its boundaries
"What depends on this function?"	Returns functions with similar names	Returns the call graph and downstream consumers
"If I change this, what breaks?"	Returns similar code (not dependent code)	Returns impact analysis with affected modules
"How does this fit in the architecture?"	Returns nearby code	Returns HLD → LLD → code hierarchy

The fundamental issue: embeddings measure text similarity, not structural relationships. Two functions can be textually similar but architecturally unrelated. Two functions can be textually different but tightly coupled through a call chain.

The Architecture Gap Across Repositories

We measured the improvement from XCE across five major open-source repositories. The results reveal a clear pattern:

Repository	Architecture Type	Baseline	With XCE	Delta
sympy	Deep module dependencies	45%	62%	+17%
scikit-learn	Complex inheritance chains	58%	71%	+13%
matplotlib	Multi-backend rendering pipeline	52%	65%	+13%
django	Layered MVC + ORM + middleware	62%	74%	+12%
pytest	Plugin system (relatively flat)	70%	78%	+8%

sympy (+17%): The largest improvement. Sympy has deep cross-module dependencies. A bug in sympy/core/expr.py might require understanding sympy/simplify/, sympy/printing/, sympy/polys/, and sympy/series/. Without a map, the agent gets lost in the dependency maze. With XCE, it knows which modules are structurally related before it starts exploring.

scikit-learn (+13%): Complex estimator inheritance. BaseEstimator → ClassifierMixin → LinearClassifierMixin → LogisticRegression. A bug in LogisticRegression.fit() might actually be in LinearClassifierMixin._fit() or even BaseEstimator.set_params(). The agent needs to understand the full inheritance chain to find the right place to fix.

pytest (+8%): The smallest improvement. Pytest has a plugin system that's complex, but most bugs are localized to a single file or module. The agent doesn't need as much architectural context because the architecture is relatively flat.

The correlation is strong: the more architecturally complex the codebase, the more the agent benefits from having a structural map.

This has a practical implication: if your codebase is a simple CRUD app with flat architecture, XCE helps modestly. If your codebase is a complex system with deep module dependencies, layered abstractions, and cross-cutting concerns — XCE helps dramatically.

How XCE Works

XCE uses the proprietary PRAT algorithm to build a structured codebase index that captures architectural relationships — not just code text. Unlike embedding-based search, PRAT understands structural connections between components at multiple levels of abstraction.

When an agent queries XCE, it gets back a structured response that includes: what module the code belongs to, what its role is in the system, what depends on it, and what it depends on. The agent doesn't just know where the code is — it knows why it exists and how it connects to the rest of the system.

This is served via MCP, so any compatible agent gets architectural context on every tool call without modifications.

Practical Setup

XCE runs as an MCP service. Any MCP-compatible agent can connect with one config block:

# Index your repo (one command)
npx xanther-cli init --api-key YOUR_KEY

This indexes the codebase and installs a git hook that auto-syncs after every commit. Then add to your agent's MCP config:

{
  "mcpServers": {
    "xanther-xce": {
      "url": "https://mcp.xanther.ai/sse",
      "headers": { "Authorization": "Bearer YOUR_KEY" }
    }
  }
}

Works with Claude Code, Kiro, Cursor, OpenCode, Windsurf — any MCP-compatible tool.

The agent gets five tools:

xce_get_context — Full architectural context for a problem statement
xce_search — Semantic search across the codebase
xce_architecture_context — Architecture around a specific file or symbol
xce_trace — Trace relationships from code to design artifacts
xce_impact_analysis — What breaks if you change specific files

The Bottom Line

AI coding agents are getting better every quarter. But the bottleneck isn't model capability — it's context quality. A cheap model with the right context outperforms an expensive model without it.

The numbers:

78.2% on SWE-bench Verified with MiniMax M2.5 + XCE (beats every model on the official leaderboard)
20% token reduction per task (fewer wrong turns, less exploration)
$0.22 per instance (16x cheaper than Claude Opus)

Context is cheaper than compute. And it compounds: better models + better context = better results than either alone.

Xanther is in open beta. Free tier: 3 repos, 100 queries/month.

Top comments (3)

Vikrant Shukla • May 11

Really solid breakdown of the structural vs. embedding distinction — that table comparing what embedding search returns vs. architectural context is the clearest framing I've seen of why RAG alone doesn't cut it for complex codebases.

The 30-40% exploration overhead maps closely to something I kept noticing in my own billing: the actual cost of agent runs was consistently higher than my token-count estimates, and tracking down why was opaque. The tokens being burned on wrong-file reads and backtracking don't show up with any project attribution in the raw API logs — they just inflate the bill.

That problem was the direct motivation for building Halton Meter (haltonmeter.com) — a local mitmproxy-based daemon that intercepts all outbound LLM API traffic, attributes each request to a project via env var / working directory / process tree, and writes exact costs to a local SQLite DB. No SDK changes needed, works for Claude Code, Cursor, raw scripts, whatever.

What it revealed for us was that a disproportionate share of per-project cost was coming from agent exploration phases exactly as you describe — not the actual code-writing turns. That visibility is what makes optimizations like XCE's architectural context actually measurable rather than just benchmarks on paper.

The $0.22/instance figure you cite for MiniMax M2 + XCE is striking. For teams doing client work where you need to attribute real costs per project, that per-instance figure is the kind of thing you want flowing into a cost ledger automatically rather than estimated after the fact.

John • May 17

The architecture-map framing is right. I’d separate “reduce wasted exploration” from “notice burn while it’s happening” though, because both change agent behavior in different ways.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.