I cut my coding agent's token usage 61% by giving it a code graph

akshay sharma — Wed, 10 Jun 2026 15:27:31 +0000

My coding agent has a goldfish memory. Every session runs the same way: I ask, "Who calls parseToken?" and it opens seven files, reads forty kilobytes, and, half a minute later, tells me something it could have told me on day one if it remembered the shape of my code.

It never remembers. Every conversation starts from zero, so it greps, reads, burns tokens, and every so often invents a function name that was never there.

I got tired of paying for that, so I built GraphPilot. It's a local MCP server that indexes your TypeScript/JavaScript repo once and lets the agent query its structure instead of re-reading files. Here's what I found measuring it.

The actual problem

Coding agents reason well and remember nothing. The structural questions they ask all day ("where is this defined?", "who calls it?", "what breaks if I change it?") have exact answers sitting in the code, but the agent re-derives them from raw text every time, because nothing survives between sessions.

You pay for that three ways. The obvious one is tokens: re-reading files to answer a question you already answered is pure waste. Then there's accuracy, because grep finds the string save without knowing which save you meant, so the agent guesses. And the one that actually worries me is refactor safety. "What does renaming this break?" is the question that matters most, and a file-by-file agent is worst at exactly that question.

GraphPilot fills the gap as persistent structural memory.

How it works

The CLI parses your repo with tree-sitter and builds a graph. Every function, class, method, interface, type, and enum becomes a node; every call becomes an edge. The graph gets written to ~/.graphpilot/, and an MCP server hands it to the agent over stdio.

Four tools:

gp_recall finds where a symbol is defined
gp_callers lists who calls a symbol (the reverse lookup)
gp_impact computes the blast radius of a change, depth-bounded
gp_index re-indexes after edits

Every response carries a file:line @ sha anchor, so the agent can quote its source and you can jump straight to the line. And when a name is ambiguous, say two files both export save, the answer tells you instead of quietly picking one and pretending it was sure.

The numbers

I ran a real coding agent (claude-sonnet-4-5) against fastify, a Node.js framework of about 300 files, on 40 structural questions. First with nothing but file reads. Then with GraphPilot's four tools. Same model, same questions, same repo.

Metric	Without	With GraphPilot
Total tokens	2,796,760	1,088,276
API cost	$8.88	$3.68
Correct answers	33/40	37/40

That's 61% fewer tokens, and it got four more answers right instead of trading accuracy for speed. By question type, "who calls X?" dropped 82% and impact analysis dropped 73%.

Now the part most launch posts skip. It doesn't help everywhere. Flow-tracing questions like "trace a request through the middleware" come out roughly even, because the agent still has to read code to answer them. Plain dependency checks save about 7%. I publish those numbers too. A tool that claims to win at everything is hiding something.

Local-first, and I mean it

GraphPilot never makes a network call. Indexing is tree-sitter running on your machine, and your source never leaves it. There's no telemetry and no update check, and that part is enforced rather than promised: an ESLint rule bans http, fetch, and axios imports in the source, and CI fails any PR that tries to add one. The graph sits in ~/.graphpilot/ at mode 0600.

Try it

npm install -g @graphpilot-oss/graphpilot
graphpilot index ~/code/my-app

Then point your agent's MCP config at graphpilot mcp. It works with Claude Code, Cursor, Cline, Windsurf, and Continue.

It's TypeScript and JavaScript only right now (tree-sitter-typescript handles TS, TSX, JSX, and JS in one grammar). Python is probably next if people ask for it.

It's Apache-2.0, on GitHub at graphpilot-oss/graphpilot. The benchmark is reproducible, script and method in the repo, so if you think I measured it wrong you can check yourself.