DEV Community

Nicola Alessi
Nicola Alessi

Posted on

I tracked every token my AI coding agent consumed for a week. 70% was waste.

Last week Anthropic announced tighter usage limits for Claude during peak hours. My timeline exploded with developers asking why they're hitting limits after 2-3 prompts.

I'm the developer behind vexp, a local context engine for AI coding agents. Before building it, I did something nobody seems to do: I actually measured what's happening under the hood.

The experiment

I tracked token consumption on FastAPI v0.115.0 — the real open-source framework, ~800 Python files. Not a toy project.

7 tasks (bug fixes, features, refactors, code understanding). 3 runs per task. 42 total executions. Claude Sonnet 4.6. Full isolation between runs.

What I found

Every single prompt, Claude Code did this:

  1. Glob pattern * — found all files
  2. Glob pattern **/*.{py,js,ts,...} — found code files
  3. Read file 1
  4. Read file 2
  5. Read file 3
  6. ...repeat 20+ times
  7. Finally start thinking about my actual question

Average per prompt:

  • 23 tool calls (Read/Grep/Glob)
  • ~180,000 tokens consumed
  • ~50,000 tokens actually relevant to the question
  • 70% waste rate

That 70% is why you're hitting usage limits. You're not asking too many questions. Your agent is reading too many files.

Why this happens

AI coding agents don't have a map of your codebase. They don't know which files are relevant to your question before they start reading. So they do what any new developer would do on their first day: read everything.

The difference is that a new developer reads the codebase once. Your AI agent reads it on every single prompt.

And it gets worse. As your session continues, context accumulates. By turn 15, each prompt is re-processing your full conversation history plus the codebase reads. The cost per prompt grows exponentially, not linearly.

What actually helps

Free fixes (do these today):

  1. Scope your prompts. "Fix the auth error in src/auth/login.ts" triggers 3-5 file reads. "Fix the auth error" triggers 20+.

  2. Short sessions. Start a new session for each task. Don't do 15 things in one conversation.

  3. Use /compact before context bloats. Don't wait for auto-compaction at 167K tokens.

  4. Audit your MCPs. Every loaded MCP server adds token overhead on every prompt, even when you don't use it.

  5. Use /model opusplan. Planning with Opus, implementation with Sonnet.

These get you 20-30% savings. The structural fix gets you 58-74%.

What I built

The idea: instead of letting the agent explore your codebase file-by-file, pre-index the project and serve only the relevant code per query.

I built this as an MCP server called vexp. Rust binary, tree-sitter AST parsing, dependency graph, SQLite. Runs 100% locally. Your code never leaves your machine.

Here's what changed on the FastAPI benchmark:

Metric Before After Change
Tool calls/task 23 2.3 -90%
Cost/task $0.78 $0.33 -58%
Output tokens 504 189 -63%
Task duration 170s 132s -22%

Total across 42 runs: $16.29 without vexp, $6.89 with.

The output token drop surprised me. Claude doesn't just read less — it generates less irrelevant output too. Focused input context leads to focused responses. I didn't design for that, but it makes sense: less noise in, less noise out.

The output quality didn't drop. It improved.

I also ran this on SWE-bench Verified — 100 real GitHub bugs, Claude Opus 4.5, same $3 budget per task:

  • 73% pass rate (highest in the lineup)
  • $0.67/task vs $1.98 average
  • 8 bugs only vexp solved

Same model. Same budget. The only variable was context quality.

What this means for the usage limits debate

Everyone's arguing about whether Anthropic should raise limits or lower prices. Both miss the point.

The real issue is architectural: AI coding agents don't know your codebase. They compensate by reading everything. You pay for that compensation with tokens — and now, with tighter session limits.

Cheaper tokens help. Higher limits help. But reducing what goes into the context window in the first place is the only fix that works regardless of what Anthropic does with pricing or limits.

Full benchmark data (open source, reproducible): https://vexp.dev/benchmark

FastAPI methodology: https://www.reddit.com/r/ClaudeCode/comments/1rjra2w/i_built_a_context_engine_that_works_with_claude/

Free tier available, no account needed. I'm curious what numbers you see on your own projects — especially on repos larger than FastAPI.

Top comments (0)