akshay sharma

Posted on Jun 10

I cut my coding agent's token usage 61% by giving it a code graph

#typescript #ai #mcp #devtools

My coding agent has a goldfish memory. Every session runs the same way: I ask, "Who calls parseToken?" and it opens seven files, reads forty kilobytes, and, half a minute later, tells me something it could have told me on day one if it remembered the shape of my code.

It never remembers. Every conversation starts from zero, so it greps, reads, burns tokens, and every so often invents a function name that was never there.

I got tired of paying for that, so I built GraphPilot. It's a local MCP server that indexes your TypeScript/JavaScript repo once and lets the agent query its structure instead of re-reading files. Here's what I found measuring it.

The actual problem

Coding agents reason well and remember nothing. The structural questions they ask all day ("where is this defined?", "who calls it?", "what breaks if I change it?") have exact answers sitting in the code, but the agent re-derives them from raw text every time, because nothing survives between sessions.

You pay for that three ways. The obvious one is tokens: re-reading files to answer a question you already answered is pure waste. Then there's accuracy, because grep finds the string save without knowing which save you meant, so the agent guesses. And the one that actually worries me is refactor safety. "What does renaming this break?" is the question that matters most, and a file-by-file agent is worst at exactly that question.

GraphPilot fills the gap as persistent structural memory.

How it works

The CLI parses your repo with tree-sitter and builds a graph. Every function, class, method, interface, type, and enum becomes a node; every call becomes an edge. The graph gets written to ~/.graphpilot/, and an MCP server hands it to the agent over stdio.

Four tools:

gp_recall finds where a symbol is defined
gp_callers lists who calls a symbol (the reverse lookup)
gp_impact computes the blast radius of a change, depth-bounded
gp_index re-indexes after edits

Every response carries a file:line @ sha anchor, so the agent can quote its source and you can jump straight to the line. And when a name is ambiguous, say two files both export save, the answer tells you instead of quietly picking one and pretending it was sure.

The numbers

I ran a real coding agent (claude-sonnet-4-5) against fastify, a Node.js framework of about 300 files, on 40 structural questions. First with nothing but file reads. Then with GraphPilot's four tools. Same model, same questions, same repo.

Metric	Without	With GraphPilot
Total tokens	2,796,760	1,088,276
API cost	$8.88	$3.68
Correct answers	33/40	37/40

That's 61% fewer tokens, and it got four more answers right instead of trading accuracy for speed. By question type, "who calls X?" dropped 82% and impact analysis dropped 73%.

Now the part most launch posts skip. It doesn't help everywhere. Flow-tracing questions like "trace a request through the middleware" come out roughly even, because the agent still has to read code to answer them. Plain dependency checks save about 7%. I publish those numbers too. A tool that claims to win at everything is hiding something.

Local-first, and I mean it

GraphPilot never makes a network call. Indexing is tree-sitter running on your machine, and your source never leaves it. There's no telemetry and no update check, and that part is enforced rather than promised: an ESLint rule bans http, fetch, and axios imports in the source, and CI fails any PR that tries to add one. The graph sits in ~/.graphpilot/ at mode 0600.

Try it

npm install -g @graphpilot-oss/graphpilot
graphpilot index ~/code/my-app

Then point your agent's MCP config at graphpilot mcp. It works with Claude Code, Cursor, Cline, Windsurf, and Continue.

It's TypeScript and JavaScript only right now (tree-sitter-typescript handles TS, TSX, JSX, and JS in one grammar). Python is probably next if people ask for it.

It's Apache-2.0, on GitHub at graphpilot-oss/graphpilot. The benchmark is reproducible, script and method in the repo, so if you think I measured it wrong you can check yourself.

One thing I'd actually like to know: what structural questions do you ask your agent most, and which ones am I not measuring yet?

Top comments (3)

Max Quimby • Jun 11

The blast-radius tool is the part that'd actually change how I work — agents are fine at "where is this defined" (grep gets close enough), but genuinely dangerous at "what does renaming this break," and that's exactly the question where a file-by-file scan quietly misses a caller. The file:line @ sha anchor is underrated too; half the "invented a function that never existed" problem is the agent having no way to cite its source, so it confabulates one.

One thing I'd be curious about from your fastify run: how does the graph handle the edges tree-sitter can't see statically — dynamic dispatch, string-keyed handler registries, DI containers, re-exports? Those are usually where a structural index hands back a confidently incomplete caller list, which is arguably worse than grep because the agent now trusts it. Did you measure accuracy on gp_callers separately from the token savings, or mostly the token number?

Alex Shev • Jun 11

A code graph is much more useful than simply throwing more files into context. The agent needs to know relationships: who calls what, where data flows, which tests cover the path, and which files usually change together.

The token savings are nice, but the bigger win is precision. Better retrieval means fewer irrelevant files, fewer hallucinated connections, and less time spent re-discovering the same repo structure every session.

PURNIMAWADHWA • Jun 10

Great!