Nikita Rybalchenko

Posted on Jun 28

Your coding agent is burning tokens grepping your repo. Here's a one-command fix.

#ai #programming #productivity #tutorial

graphlens-mcp gives Claude Code, Cursor, and compatible clients a typed graph of your code, so they ask "who calls create_order?" and get one small answer instead of reading half the codebase. Below: how the engine works, what a 936-run benchmark says about when it actually pays off, and the five-minute install.

I've been building graphlens in the open. The full story is three posts on Habr — the engine, a benchmark, the product — but this version stands on its own; I've folded in what matters. Links at the end.

The loop everyone knows

Picture a big project. A few hundred thousand lines, Python on the backend, TypeScript on the front, a legacy corner nobody wants to touch. You point a coding agent at it and ask something ordinary: "how does auth work here?" or "what breaks if I change this method's signature?"

The agent can't see the whole repo at once. So it does the only thing it can: grep a name, open a file, read it, follow an import, grep again. It reads a dozen files, every one of them lands in the context window, and every one gets re-billed on the next turn.

This isn't hypothetical overhead. Anthropic's own engineering blog notes that tool definitions and intermediate results can eat "50,000+ tokens before an agent reads a request" — the window fills before the agent has even started on your question.

A code graph attacks exactly this. Instead of "read the file and eyeball it," the agent asks a precise question — who calls create_order — and gets back a small structured answer: resolved edges, not a text search and a prayer.

That's the pitch. The rest of this post is whether it holds up, and what it took to make it usable.

What the engine actually does

The hard part of "code → graph" isn't drawing boxes. It's the edges. Most lightweight tools resolve references by name: they see a call to save() and draw an edge to everything named save. Fast, and wrong — a real codebase has a dozen saves.

graphlens, the engine under the MCP server, splits the work in two:

Tree-sitter parses each file into a concrete syntax tree: exact structure, precise 1-based span positions. Every use-site gets recorded as an occurrence with a role — call, read, write, annotation, base class.
A type-aware resolver, specific to the language, answers definition_at(file, line, col) for each occurrence. The resolved definition becomes a real edge to the real declaration.

The resolvers are the same machinery your IDE runs: ty (Astral's Rust type-checker) for Python, the TypeScript Compiler API for TS, gopls for Go, rust-analyzer for Rust. So a CALLS edge points at the actual function, HAS_TYPE at the actual class, INHERITS_FROM at the actual base class. It's the difference between "probably related" and "related." The engine knows the process_order in services.py is the one called from api.py, not the namesake in tests/.

That handles the first wall — name ambiguity. The second wall is that most code-intel tools are monolingual. They understand Python beautifully and go blind the moment a TypeScript frontend calls a FastAPI route. Real systems are polyglot; the tools around them usually aren't.

graphlens emits language-neutral BOUNDARY nodes for the interfaces a service exposes or consumes: HTTP routes, queue topics, gRPC methods. The boundary ID carries no project and no language, and HTTP paths get normalized so /users/1, /users/{user_id} (FastAPI), <int:id> (Flask), and :id (Express) all collapse to the same key. A FastAPI route and a TypeScript fetch to that endpoint therefore produce the same boundary ID. Merge the two graphs, link them, and you get edges crossing the language border — which lets the agent answer "which frontend calls hit this endpoint?", a question a single-language tool can't even phrase.

Two more choices matter for trust. IDs are deterministic: a node's ID is a SHA-256 of project::kind::qualified_name, so the same scan yields the same IDs on any machine, which is what makes diffing and incremental updates work. And the graph never lies about being incomplete: if a toolchain is missing or a file fails type-checking, the resolver records a status (ok / degraded / unavailable) instead of quietly handing back a half-resolved graph. In CI, anything but ok fails the build.

Does it actually pay off? 936 runs

Here's the part most "my tool is faster" posts skip. I wrote that first piece, claimed agents burn tokens on grep, and put zero numbers behind it. So I built a benchmark to find out, and the answer surprised me.

The setup is one controlled variable. Same agent (Claude Code), same prompts, same tasks. The only thing that changes is which MCP server feeds the agent context. Four "hands": filesystem (grep + read), graphlens (the structural graph), serena (an LSP), and codegraph (a competing graph tool). Three models (Haiku, Sonnet, Opus), three seeds, 26 tasks on apache/superset (~400k lines, Python + TS). That's 936 runs.

A few things I locked down so the numbers mean something. The built-in Claude Code tools (Read, Grep, Bash) are disabled — otherwise the agent ignores the MCP server and the test measures nothing. Reference answers are hand-verified against a fixed tag, and crucially not generated by any tool under test. temperature=0 doesn't make these models deterministic, so three seeds and I report the median, not the mean. A run that hits the turn ceiling without answering counts as accuracy 0: "the tool didn't finish in budget," not "no data."

The headline finding: the ranking inverts depending on the task.

On simple point lookups — "where is class X defined," "what does it inherit from" — all four tools tie on accuracy. The only difference is price, a spread of roughly 3×, and graphlens sits unremarkable in the middle. If I'd measured only these, I'd have written "the graph isn't worth it, grep is fine." That would have been half the truth.

On the work that actually matters — blast-radius questions, finding every override, resolving an ambiguous name — the tools diverge hard:

Tool	Accuracy	Tokens	Tool calls	$/task
filesystem (grep)	0.71	12,596	27	$0.424
graphlens	0.84	748	1	$0.018
serena (LSP)	0.85	1,368	5	$0.065
codegraph	0.93	1,114	2	$0.036

grep collapses. Lowest accuracy, and it only reaches an answer in 83% of runs — the rest burn through the 50-turn ceiling. The runs that do finish cost 10–23× more and take 10–18× longer. Text search drowns in noise when the question is "every call to this" or "which of ten identically-named methods."

The same graphlens that looked dull on point lookups is now the cheapest ($0.018) and fastest option, answering in a single tool call instead of twenty-seven. That's roughly a 94% token cut against grep on these tasks. codegraph is the most accurate (0.93); serena holds its own.

There's a second twist I didn't predict: the best tool depends on which model you run. graphlens returns token-heavy results — graph neighborhoods, reference lists. On a cheap model that verbosity is nearly free, so on Haiku graphlens is the cheapest of all four. On Opus, which prices those same tokens far higher, graphlens becomes the most expensive of the structural tools (still cheaper than grep). serena and codegraph return tight, pointed results and stay cheap on any model.

Which leads to the one takeaway I'd bet money on: a cheap model on a structural tool beats an expensive model on grep. codegraph + Haiku (~$0.023, ~0.99 accuracy) beats filesystem + Opus (~$0.087, 0.93) on every axis at once.

One of my predictions flat-out failed, and it's worth reporting. I'd planted cross-language tasks (a TS call resolving to a Python handler across the /api/v1/... boundary) as a stress test, sure the single-language tools would trip. They didn't — every hand, grep included, solved both. The agent steps across the boundary on its own, whatever feeds it context. A benchmark that only confirms what you hoped isn't a benchmark.

The honest fine print: one repo, one harness, 26 tasks (20 simple, 6 hard). The cost difference is statistically solid; the accuracy gap on hard tasks is a strong signal but not proven at n=6. cost_usd is an API-equivalent, not your subscription bill. This is a reproducible measurement on one case, not a universal ranking — and the whole harness plus raw data is open if you want to run it on your own code.

The gap nobody mentions: an engine isn't a product

So the engine works on the tasks where a graph should work. But there's a hole I glossed over in both Habr posts: the engine is not something you can hand to an agent as-is.

graphlens, by design, stops at producing the graph. It doesn't own a database, doesn't watch the filesystem, doesn't reindex itself, doesn't raise a long-running service. For an engine that's the right call — a small core is trivial to test, cache, and compose. But to actually wire it into an agent, someone has to write the layer on top: graph storage, invalidation (which files to reindex when one changes), a filesystem watcher, an MCP server with tools the agent can call, registration in each client's config format, and a navigation skill so the agent knows how to use any of it.

That layer is the work everyone ends up redoing by hand. I wrote it once and packaged it as graphlens-mcp: a thin runtime over the engine that owns the storage, the freshness model, and everything the agent sees.

One command

uv tool install graphlens-mcp        # or: pipx install graphlens-mcp
cd your-project && graphlens-mcp init

init detects the project's languages, runs a toolchain "doctor," indexes the code into a local graph, writes the MCP server into your agents' configs (it knows Claude Code, Cursor, Windsurf, VS Code/Copilot, Codex CLI, and writes idempotently without clobbering your other servers), and installs the navigation skill. The agent starts the server itself from that config — you never run serve by hand. Restart the agent and ask it something like "what breaks if I change the signature of create_order?"

Requirements: Python ≥ 3.13 (inherited from the engine). MIT licensed. Current version 0.1.2, and it's early — more on that below.

What the agent gets

Eight tools, each cut to a specific question about code:

Tool	What it answers
`search_symbols`	Full-text search over symbol names — the entry point
`get_node_info`	Source snippet + signature + docstring + position
`get_file_structure`	A file's symbol outline
`get_callees`	What a function calls (outgoing, to `max_depth`)
`get_callers`	Who calls a function — the core of impact analysis
`get_neighbors`	Nodes within N hops, any direction
`find_references`	Non-calls: type annotations, assignments
`get_cross_language_calls`	Links across service boundaries (HTTP/gRPC/queues)

Every response carries a graph-quality status (ok or degraded), so the agent never mistakes a partial answer for a complete one. Lists are capped and flagged truncated rather than silently cut.

The pattern the navigation skill teaches: start at search_symbols, fan out with get_callers / find_references, and only pull get_node_info for the spots that actually need the source — instead of reading every calling file end to end.

Why the graph doesn't go stale while you type

What separates a product from "the engine plus a script" is that the graph stays current on its own while you edit.

A filesystem watcher starts with the server. When a file changes on disk, the server reindexes the connected set — the changed file, the files that import it, and the files it imports — in one full pass, so cross-file edges rebuild correctly instead of half-breaking. Deleting a file purges its symbols and updates its importers. The case everyone forgets — edits made while the server was off — is handled by a one-shot reconcile on startup: scan the project, index what's new, drop what vanished, refresh what changed, then hand control to the watcher.

The graph lives in .graphlens/graph.db (SQLite). It's a regenerable cache, safe to delete; reindex rebuilds it. Add .graphlens/ to your VCS ignore.

What it deliberately isn't

Status is early, and I'd rather say so than dress it up. The navigation core works; the rest is in progress.

The watcher reindexes the connected set of a change, not the whole project — a refactor that ripples through many layers of indirection may need a full reindex for a perfectly accurate graph. Cross-language COMMUNICATES_WITH edges rebuild on a full reindex and can erode on incremental edits. Languages other than Python need their toolchains present (Python works out of the box; ty ships with it); without them a language reports as degraded — structure parsed, calls and types not fully resolved — but init never blocks on it, and status tells you exactly what's missing.

And the one boundary that's structural, not a roadmap item: graphlens-mcp does not do embeddings or semantic "find me something like this" search. The graph is structural and type-aware, not a vector index. If you need "find code conceptually similar to rate limiting, whatever it's called," that's a vector tool's job. This answers the structural questions: who calls this, what it depends on, what breaks.

Try it, and tell me where it breaks

Install it if you run Claude Code, Cursor, or a compatible client on a big polyglot project and you're tired of watching the agent grep its way through the repo — especially if you do impact analysis before refactors, which is exactly the mode where the graph earns its keep. Skip it if your project is small (grep is instant on a few dozen files) or you mainly need semantic search.

Zero barrier to bail: everything's local, nothing leaves your machine, MIT, and it uninstalls in one command (graphlens-mcp remove --purge-db). Point it at your main project, confirm the MCP server is live in your agent, and compare the tool-call count on the same architectural question with the graph and without.

What I need most right now is independent runs on codebases that aren't superset. Issues, numbers from your projects, complaints about tool granularity — all welcome in the repo. The more measurements on different code, the closer this gets to an answer you can carry over, instead of "works on superset."

graphlens-mcp: github.com/Neko1313/graphlens-mcp
The engine: github.com/Neko1313/graphlens
The benchmark: github.com/Neko1313/agent-context-bench

DEV Community