Nicola Alessi

Posted on Feb 26

How I Cut My AI Coding Agent's Token Usage by 65% (Without Changing Models)

#ai #productivity #vscode #devtools

I've been using Claude Code on a 200-file TypeScript project. The model is great. The token bill was not.

The problem wasn't the model — it was what I was feeding it. Every session, the agent would read 30-40 files trying to orient itself before doing any actual work. Same files, same discoveries, same wasted tokens. Every single time.

After a lot of trial and error, I got my average input tokens per query from about 8,200 down to 2,100. Here's what worked, in order of impact.

Step 1: Write a real CLAUDE.md (not a vague one)

Most people write something like:

This is a TypeScript project using Express and React.
Please follow best practices.

This tells the agent almost nothing. It's going to read your whole codebase anyway.

What actually works is being specific about decisions, not descriptions:

## Auth
- Auth uses middleware in src/auth/middleware.ts
- JWT tokens, not sessions. Refresh token rotation in src/auth/refresh.ts
- DO NOT touch src/auth/legacy.ts — deprecated, will be removed Q2

## Database
- Prisma ORM, schema in prisma/schema.prisma
- All migrations must be backward-compatible
- Connection pooling handled by src/db/pool.ts, do not create new connections

## Conventions
- All API handlers in src/handlers/, one file per resource
- Error handling through src/lib/errors.ts, do not use try/catch in handlers
- Tests mirror src/ structure in tests/

The key: tell the agent what it would otherwise spend 10 minutes figuring out. Decisions, not descriptions. "We use Express" is useless. "Auth uses JWT with refresh rotation in this specific file" saves the agent from reading your entire auth directory.

Impact: about 20% token reduction. Significant, but not enough.

Step 2: Stop letting the agent grep your whole project

Here's what happens when you ask "how does authentication work in this project" without any context management:

Agent searches for "auth" across the codebase
Gets 40+ hits across middleware, tests, configs, legacy code, node_modules if you're unlucky
Reads 15-20 files to piece together the picture
Burns 8,000+ tokens before writing a single line of code

The agent doesn't need 40 files. It needs the auth middleware, the two things it depends on, and the three things that depend on it. That's maybe 5 files.

The question is: how do you give the agent the right 5 files instead of all 40?

This is where I stopped being able to solve it with prompting alone.

Step 3: Give the agent a dependency graph

I built a tool called vexp that pre-computes a dependency graph of your codebase at the AST level. Not grep, not text search — actual parsed relationships: who imports what, who calls what, what types flow where.

When the agent asks about authentication, instead of grep-matching "auth" across 40 files, it gets the relevant subgraph: the auth function, its dependencies, and its dependents, packed into a token budget you control.

Before (grep approach):

Agent reads: 40 files, 8,247 tokens
Relevant files: 5
Wasted: about 80% of input tokens

After (dependency graph):

Agent reads: capsule with 5 relevant nodes, 2,140 tokens
Relevant files: 5
Wasted: near zero

Same information, 74% fewer tokens.

Step 4: Solve the "amnesia" problem

Token reduction is half the problem. The other half: every new session starts from zero.

Monday the agent spends 20 minutes discovering that your payment module has a non-obvious dependency on a legacy Redis cache. Tuesday, new session, same 20 minutes. Wednesday, same again.

I tried every approach to make agents save their own notes:

"After completing a task, save your observations" — ignored 90% of the time
Detailed save instructions in CLAUDE.md — maybe 15% compliance
Making it a "required step" — agent writes "completed successfully, no issues" and moves on

The models are optimized for current-task completion. A tool that only benefits future sessions has zero value to the current context window. The incentive structure works against you.

What actually worked: passive observation. Instead of asking the agent to save things, watch what it does. Track which files it reads, what changes it makes at the AST level, and infer observations from its behavior. The agent that spent 20 minutes on your Redis dependency didn't save a note about it — but the tool call pattern and code changes tell you exactly what it learned.

These observations get linked to the code graph. When the underlying code changes, linked observations automatically go stale. So you're never feeding the agent outdated context.

Session 1: Agent discovers Redis dependency — observation saved passively
Session 2: Agent gets the observation immediately — skips the 20-minute rediscovery
Session 3: Someone refactors the Redis cache out — observation flagged stale — agent re-explores

The combined result

Metric	Before	After
Avg input tokens/query	8,200	2,100
Session start orientation	5-10 min	under 30 sec
Repeated discoveries	Every session	Once
Token reduction	—	65-74%

On a practical level this means:

If you're on Claude Max/Pro: 2-3x more work before hitting usage caps
If you're on API: direct cost savings on input tokens
On any plan: the agent starts working immediately instead of spending the first 10 minutes reading

Setup

vexp works as a VS Code extension or standalone CLI. It's an MCP server, so it works with any agent that speaks MCP: Claude Code, Cursor, Windsurf, Cline, Roo Code, Copilot, aider, Codex.

# VS Code
Search "vexp" in the extension marketplace

# CLI (for Claude Code, terminal agents)
npm install -g vexp-cli

Free tier: 2,000 nodes, full memory tools, no account needed. Runs 100% local — single Rust binary, SQLite, zero network calls.

Pro ($19/mo): multi-repo support, 50k nodes, priority updates.

What I'd do if I were starting today

Write a specific CLAUDE.md — decisions, not descriptions. 30 minutes, 20% improvement.
Set up a dependency graph — stop letting the agent grep. This is where the real token savings are.
Let memory accumulate — don't try to make the agent save notes. Observe passively and let the context build itself over 3-4 sessions.

The first step is free and takes 30 minutes. The rest takes about 5 minutes to install.

I'm the developer behind vexp. Happy to answer questions about the architecture, MCP integration, or anything else in the comments.

Top comments (1)

Harjot Singh • May 31

"Without changing models" is the most underrated framing in this whole genre, because it proves the obvious move (swap to a cheaper model) is often NOT where the savings are - 65% came from feeding the same model less garbage. For coding agents specifically the usual wins are: stop dumping whole files when a function or a few lines would do, don't re-attach the entire repo context every turn, and trim the tool-output noise (a 2000-line test log compressed to the 5 lines that failed). Same intelligence, a fraction of the tokens.

This matches what I see exactly: context discipline beats model selection more often than people expect, because the cheapest token is the one you never send. That's the first lever I reach for in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - tight scoped context per agent is a big part of why a full build holds at ~$3 flat, before routing even enters the picture. Really practical writeup. Of the 65%, how much came from context/file scoping vs trimming tool output? For coding agents I'd guess file-context scoping was the dominant lever, but tool-output bloat is the sneaky one.