Harrison Guo

Posted on Jun 1 • Originally published at harrisonsec.com

I Tested CodeGraph on Hono. The Tool-Call Savings Reproduce — the Cost Savings Don't.

#ai #benchmark #devtools #typescript

Two weeks ago CodeGraph hit GitHub trending — tree-sitter + SQLite/FTS5 + MCP for Claude Code, 19k+ stars in a week. The team published a benchmark on 7 repos showing 35% cheaper, 57% fewer tokens, 46% faster, 71% fewer tool calls vs. baseline.

Those are big numbers. They're also numbers from a benchmark designed by the team that built the tool, on repos they chose. Designer bias is the #1 risk in any retrieval benchmark — when you pick the test repos and write the ground truth, you'll consciously or unconsciously favor your own tool's strengths.

So I ran an independent test on an 8th repo — Hono (TypeScript, ~280 source files, in neither CodeGraph's published 7-repo suite nor any other published benchmark I could find). 5 architectural questions covering different retrieval shapes, with a deliberate control case (Q5) where the tool should not win. Two conditions (baseline grep+Read+Glob+Explore vs. CodeGraph active), 4 repeats per question per condition. 40 runs on Claude Opus 4.8 — and, critically, every CodeGraph run was verified to have connected, and actual codegraph_* tool usage was recorded per run (more on why that sentence exists below).

The result splits in a way the single published headline number hides — and the split is the useful part.

tl;dr — On Hono, CodeGraph delivers a large, consistent reduction in tool calls (-55%, 14.0 → 6.3 avg) and a smaller latency win (-20%) — the published 7-repo direction reproduces here. But cost is a wash: +6.8%, not the published −35%. On narrow-scope questions (route lookup, middleware trace) CodeGraph is actually 20-43% more expensive, because each structural lookup loads a big chunk of graph context that costs more in cached tokens than the grep round-trips it replaces. The cost win only appears on broad multi-file navigation (Q3 multi-runtime adapters: −29% cost, −80% tool calls, −53% latency). A second finding: baseline grep+Read has high variance — the agent occasionally spiraled to 47-52 tool calls on the broad questions, while CodeGraph never exceeded 16. Net at Hono's size: CodeGraph makes the agent take fewer steps and finish faster, but not for fewer dollars. Total cost of the 40 valid runs: ~$14 of Opus 4.8 calls. Raw per-run CSV and the 5 verbatim prompts are below.

What "tool calls down, cost flat" actually means

CodeGraph's published 7-repo suite (VS Code, Excalidraw, Django, Tokio, OkHttp, Gin, Alamofire) skews larger and more architecturally complex than Hono. Hono is ~280 TypeScript source files (362 files indexed by CodeGraph, including tests and configs), 16MB on disk — small enough that a thoughtful agent with grep + Read can finish most architectural questions in a handful of tool calls.

The interesting result is that the axes come apart. CodeGraph replaces several grep+Read round-trips with one or two structural lookups — so step count drops hard (-55%). But each codegraph_context / codegraph_explore call returns a sizeable chunk of graph context, which then rides along in the conversation cache and gets re-read every turn. At Hono's size, the dollar cost of carrying that cached payload roughly equals the dollar cost of the grep round-trips it replaced — so dollars stay flat (+7%) even as steps fall by more than half.

That's not a contradiction of the cost-curve thesis from the prior post in this mini-series — it's a sharper reading of it. Hono sits above the step-count crossover (the index already saves tool calls) but below the dollar crossover (it doesn't yet save money). On a much bigger repo, the grep path churns through far more files and the index pays back on dollars too. Hono just happens to land in the gap between the two crossovers.

A useful complementary benchmark answers three things the published one doesn't:

Cross-validation on a repo not chosen by the tool's team — do the published advantages generalize?
Within-repo variance across question types — does the win concentrate on certain question shapes? (It does — heavily.)
A control case where the tool shouldn't win — Q5 (text search) tests whether the agent correctly declines to use the structural engine when grep is the right tool.

Setup — install CodeGraph, ~10 minutes

# install (downloads a single binary, no Node/npm required)
curl -fsSL https://raw.githubusercontent.com/colbymchenry/codegraph/main/install.sh | sh

# clone the test repo + index it
git clone https://github.com/honojs/hono.git ~/tmp/hono
cd ~/tmp/hono
codegraph init -i

Index build time on Hono (362 files, 4,128 nodes, 8,225 edges): 1.7 seconds. On-disk index: 7.1 MB.

Per-condition setup for the two arms:

Baseline (control): a clean copy of Hono via rsync -a --exclude='.codegraph/' to a separate directory so Claude couldn't accidentally grep into the index. No MCP servers registered. Agent uses native Glob + Grep + Read + Explore + Task.
CodeGraph active: original Hono directory with .codegraph/ present, MCP server registered:

{"mcpServers": {"codegraph": {"command": "codegraph", "args": ["serve", "--mcp"]}}}

Both arms run claude --print --output-format stream-json --model opus so the model and the rest of the agent loop are identical; the only varying input is whether the CodeGraph MCP server is in the loop. Each run is a fresh session with no prior context.

Verifying the tool actually ran (this is not optional)

A retrieval-tool benchmark is only valid if the tool is actually in the loop — and I learned that the hard way. My first pass at this benchmark silently ran with CodeGraph's MCP server never connected: the config was missing the --mcp flag, and Claude Code proceeds without a server that fails its hand-shake in time rather than erroring out. Every "CodeGraph" run was really just grep+Read. The comparison was noise, and the numbers looked plausibly small — which is exactly how a broken benchmark slips through.

So for the data here, every run is instrumented:

--strict-mcp-config — only the server under test is loaded, with no contamination from other globally-registered MCP servers.
Pre-warmed daemon + MCP_TIMEOUT=30000 — CodeGraph's stdio server attaches to a warm daemon and finishes its handshake before the agent loop starts. (MCP connection is async; on a fast question the agent can otherwise finish before a cold server is ready.)
A per-run assertion that CodeGraph reached connected status, plus a record of whether the agent actually invoked a codegraph_* tool. Runs that didn't connect were discarded.

All 20 CodeGraph runs in this post connected. The agent invoked CodeGraph on Q1-Q4 (4/4 repeats each) and — correctly — chose grep on the Q5 control (0/4). Most vendor benchmarks never report this check. After watching mine fail it silently, I won't publish a retrieval benchmark without it.

The 5 questions

Full verbatim prompts in the Appendix. Brief overview:

#	Question	What it tests	Hypothesis for CodeGraph
Q1	Route resolution: `GET /users/:id` → handler	Route-aware extraction	Strong win
Q2	Middleware chain trace through `app.use`	Dynamic dispatch tracing	Decisive win via structural lookup
Q3	Multi-runtime adapter architecture	Cross-file abstraction	Mid-strong win
Q4	Refactor impact: add mandatory `requestId` to `Context`	Impact propagation + completeness	Strong win (what `codegraph_impact` is built for)
Q5	Text search: every literal `'Content-Type'`	Keyword search baseline	~Parity; agent should decline the tool

Q5 is the CONTROL — the tool should not win here, and whether the agent even reaches for it is itself a signal.

The data

Each row averaged over 4 repeats. Cost is Claude's own total_cost_usd (the API's authoritative figure, not my own multiplication); wall latency from request to final token; tool calls counted from unique tool_use blocks in the transcript; tokens are input + output (cache tokens tracked separately in the CSV).

Cost / tokens

Q	Baseline cost	CodeGraph cost	Δ cost	Baseline tokens	CodeGraph tokens	Δ tokens
Q1 route	$0.321	$0.393	+22.5% ❌	10,115	6,045	−40.2%
Q2 middleware	$0.212	$0.303	+43.4% ❌	7,233	5,649	−21.9%
Q3 multi-runtime	$0.490	$0.348	−28.9% ✓✓	11,582	7,048	−39.1%
Q4 refactor	$0.402	$0.509	+26.5% ❌	9,119	8,567	−6.1%
Q5 text (ctrl)	$0.267	$0.253	−5.3%	8,874	8,998	+1.4%
Aggregate	$0.338	$0.361	+6.8%	9,385	7,261	−22.6%

Tool calls / wall latency

Q	Baseline calls	CodeGraph calls	Δ calls	Baseline latency	CodeGraph latency	Δ latency
Q1 route	7.8	6.8	−12.9%	49.8s	51.2s	+2.8%
Q2 middleware	5.0	4.0	−20.0%	41.2s	43.1s	+4.5%
Q3 multi-runtime	35.2	7.0	−80.1% ✓✓	123.7s	58.4s	−52.8% ✓✓
Q4 refactor	19.8	11.8	−40.5% ✓	87.9s	85.9s	−2.2%
Q5 text (ctrl)	2.2	2.0	−11.1%	51.2s	43.5s	−15.0%
Aggregate	14.0	6.3	−55.0%	70.8s	56.4s	−20.3%

Two headline rows: −55% tool calls (real and consistent — CodeGraph used fewer tools on every single question) and +6.8% cost (CodeGraph is not cheaper on Hono). The latency win (−20%) is real but concentrated: almost all of it is Q3; on Q1/Q2/Q4 latency is within ±5%.

The variance story — CodeGraph bounds the worst case

Averages hide the most interesting result. Baseline tool-call counts, per repeat:

Q	Baseline (4 repeats)	CodeGraph (4 repeats)
Q3 multi-runtime	14, 23, 52, 52	5, 6, 8, 9
Q4 refactor	9, 10, 13, 47	9, 10, 12, 16

On the broad questions, baseline grep+Read occasionally spiraled — the agent without an index wandered to 47-52 tool calls chasing files. Across all 40 runs, baseline ranged from 2 to 52 tool calls; CodeGraph ranged from 1 to 16. A large part of CodeGraph's value here isn't the mean — it's that it bounds the worst case. When the structural answer is one graph query away, the agent can't spiral. That's a reliability property, not just an efficiency one, and it doesn't show up in a single average.

A caveat on sample size: this is 4 repeats on one repo. Treat the magnitudes as indicative and the directions as robust — CodeGraph used fewer tool calls in every question and nearly every repeat, and the cost direction was consistent within cells (more expensive on Q1/Q2/Q4, cheaper only on Q3). What I would not over-read is the exact percentages; a 47-vs-9 baseline spread on Q4 means the per-question means carry real uncertainty even at n=4.

Per-question narrative

Q1 (route resolution) — CodeGraph used, but more expensive. Both arms traced app.fetch → #dispatch → this.router.match() and read SmartRouter → RegExpRouter. CodeGraph used codegraph_context + codegraph_trace (2 calls/run) and cut tool calls 13% and tokens 40% — but cost rose 22.5% and latency was flat. The structural context it front-loaded was heavier than the 1-2 grep steps it saved. Hono's router is small enough (5-6 files for the full picture) that grep finds it directly.

Q2 (middleware chain trace) — used, 43% more expensive. CodeGraph landed the app.use → middleware array → compose() chain in 4 tool calls vs baseline's 5, but cost jumped 43%. Same mechanism as Q1, more pronounced: the call-chain context payload dominated a question baseline answered cheaply in 5 small steps. The clearest example of "fewer steps, more dollars."

Q3 (multi-runtime adapter) — the unambiguous win. Enumerating 6 adapter directories (Cloudflare Workers / Deno / Bun / Node / AWS Lambda / Vercel Edge) is exactly where one graph query beats many Glob+grep iterations. Baseline averaged 35 tool calls and 124s (and spiraled to 52 twice); CodeGraph: 7 calls, 58s, −29% cost. This is the question shape where structural retrieval pays back on every axis at once — and the only one where it saved money.

Q4 (refactor impact: add requestId to Context) — tools halved, cost up. Supposed to be CodeGraph's strongest case (codegraph_impact is built for blast-radius). It did cut tool calls 40% (and tamed baseline's 47-call spiral), but cost rose 26.5%: the impact-graph walk pulled wide context the agent didn't fully need at Hono's size. Completeness was comparable across arms (both identified the Context constructors in src/hono-base.ts, the Variables plumbing, and the per-method handler signatures). Fewer, more-bounded steps — but the propagation graph isn't wide enough here to pay back on dollars.

Q5 (text search, control) — the agent declined the tool, and that's the point. On a pure literal-'Content-Type' search, the agent never invoked CodeGraph in any of the 4 repeats — it reached straight for grep. Result: near-parity (−5% cost, −15% latency, both inside the noise). The old version of this post claimed an "FTS5 fallback" win here; that was an artifact of the broken first run. The truth is simpler and better: with CodeGraph connected and available, the agent correctly chose grep for a grep-shaped task. No over-engineering. That's the table-stakes behavior you actually want from a retrieval tool, and it's worth more than a fabricated 1-step saving.

Cross-validation with CodeGraph's published 7-repo benchmark

Metric	Published (7 repos)	This test (Hono, n=4)	Reproduces?
Tool calls	−71%	−55%	✓ Yes — same ballpark
Latency	−46%	−20%	~ Directionally, ~half the magnitude
Tokens	−57%	−23%	~ Directionally, smaller
Cost	−35%	+6.8%	✗ No — opposite sign

The tool-call reduction is the part that generalizes cleanly to a repo the team didn't pick. The cost reduction is the part that doesn't — and that's not an attack on CodeGraph, it's a statement about repo size. Their published suite skews large (VS Code is 30k+ files; Tokio is mid-thousands), and their own published table is non-monotonic in file count — Gin (~110 files) shows a 21% cost win while OkHttp (~645 files) shows ~2%, and Tokio (~790 files) shows 82%. Repo size matters, but it isn't a clean threshold; question shape matters at least as much. A single repo can't locate a universal crossover. What Hono shows is one clear data point: at ~280 files, the step-count win is already here, the dollar win isn't yet.

Decision matrix — install CodeGraph when

Signal	Install CodeGraph	Skip (baseline grep+Read is enough)
You care about fewer agent steps / lower latency	✓ (−55% tool calls even on a small repo)
You're optimizing dollar cost on a sub-~500-file repo	(may cost slightly more)	✓
Repo is large (low thousands of files+)	✓ (dollar win should appear too)
Workload is broad multi-file navigation / architecture	✓ (this is where it wins on every axis)
Workload is narrow single-symbol lookups	(fewer steps, but not cheaper)	(grep is fine)
Static-typed (TS / Rust / Go / Java / Swift / C#)	✓
Dynamic-typed (Python / Ruby / untyped JS)	⚠️ partial (tree-sitter misses runtime semantics)
Workflow is text-search dominant	(no penalty — the agent declines the tool)	✓
Agent reliability matters (bounding worst-case exploration)	✓ (caps the 50-tool-call spirals)
You're not sure	install it; ~10 min, <2s to index a Hono-sized repo, uninstall is one command

Key call from the data: at Hono's scale the reason to install CodeGraph is fewer steps, lower latency, and bounded worst-case exploration — not a lower bill. If your decision rule is purely dollars-per-query on a small repo, baseline grep+Read is still competitive. If it's agent speed, predictability, or you're working in a larger codebase, the index earns its place.

What I'd want to test next

Larger TS repo head-to-head — same 5 questions on Prisma (~2,000 TS files) or TanStack Query (~600 files) to find where the dollar crossover actually is. Hypothesis: cost flips negative somewhere in the high hundreds to low thousands of files.
Dynamic-typed repo — same 5 questions on FastAPI or Django REST to see how much of the step-count win survives when tree-sitter can't resolve dynamic dispatch.
Long-session compounding — single-question runs miss the multi-turn agent context. Does the per-query step saving compound across a real session, or stay linear?

All future content. None block the install-or-not decision the data above already answers.

One-line verdict

On TypeScript / Rust / Go projects, install CodeGraph if you want fewer agent steps, lower latency, and bounded worst-case exploration — those reproduce on an independent repo. Don't install it expecting a lower bill on a small codebase: at Hono's ~280-file scale it was ~7% more expensive, and in this benchmark a cost win appeared only on broad multi-file navigation (Q3) — the published −35% likely needs much larger repos.

For the architectural deep-dive on why this class of tool works and where the abstractions leak, see the companion Lab piece Agent Retrieval Above the Crossover: A First-Principles Read of CodeGraph (publishing 2026-06-08).

For the broader cost-curve framework this benchmark applies, see the prior Lab post: Agent Retrieval Is a Cost Curve Problem.

Appendix: Benchmark Questions

The 5 prompts used, verbatim. Each was sent to Claude Code in a fresh session (claude --print --model opus), 4 times per arm. No follow-up prompts within a run.

Q1 — Route resolution

When a request hits GET /users/:id in a Hono app, walk me through how Hono's routing finds and invokes the right handler. Where in the source does the URL → handler matching happen, and what data structure stores the route table?

Q2 — Middleware chain trace

Hono middleware is chained via app.use(middleware). When a request flows through several middlewares before hitting the handler, what's the actual call stack from the incoming request to the handler? Specifically — how does Hono ensure middleware runs in order, and how is c (context) + next passed between them?

Q3 — Cross-file abstraction navigation

Hono supports multiple runtime adapters (Cloudflare Workers, Deno, Bun, Node, AWS Lambda, Vercel Edge). How is this multi-runtime abstraction implemented? What's the shared interface, and where do the per-runtime adapters live? Show me the architecture.

Q4 — Refactor impact

Imagine I'm planning to add a mandatory new property requestId: string to Hono's Context class. What files and functions across the codebase would be affected? Give me the full blast radius — where Context is constructed, where it's typed in signatures, and where mandatory-property additions would break.

Q5 — Text search (control)

Find every place in the Hono codebase where the literal string 'Content-Type' (the exact HTTP header name, case-sensitive) appears. Include source code, tests, comments, and documentation.

Scoring & environment

Each question evaluated on cost (total_cost_usd), tokens (input+output, cache tracked separately), wall latency, unique tool-call count, and a manual correctness/completeness check (both arms agreed on the same answer in every Q). Every CodeGraph run was additionally checked for connected MCP status and actual codegraph_* tool invocation.

Environment: Hono @ commit 2cbeadda (2026-05-28) · CodeGraph 0.9.7 · Claude Code 2.1.159 · model claude-opus-4-8 (Opus 4.8) · macOS. Raw per-run data (cost, tokens, tool calls, latency, connection status) for all 40 runs: CSV.

Appendix: Why there's no third arm (knowing)

I intended to include knowing (Blackwell Systems) as a third arm. In headless batch mode its MCP server connected only intermittently — knowing advertises asynchronous tools/listChanged, which races Claude Code's MCP startup window, so on most runs the agent never saw knowing's tools and silently fell back to grep+Read.

Reporting task-cost numbers from runs where the tool wasn't actually in the loop is exactly the trap that invalidated my first attempt at this benchmark (see Verifying the tool actually ran), so I'm not publishing knowing figures. That's a limitation of my batch harness, not a verdict on knowing — a persistent / pre-warmed MCP host or an interactive session would likely fix it. If I get a reliable knowing setup, I'll benchmark it on its own terms and publish separately.

Lab companion (first-principles architectural read of CodeGraph and the class of tools it represents): **Agent Retrieval Above the Crossover* — publishing 2026-06-08.*
Prior Lab post in the retrieval / memory mini-series: *Agent Retrieval Is a Cost Curve Problem***
Companion Lab post on cross-session memory: *Agent Memory Is a Cache Coherence Problem***
CodeGraph repo: *https://github.com/colbymchenry/codegraph***

DEV Community