Nikita Groshin

Posted on Apr 29 • Originally published at sverklo.com

I benchmarked code retrieval for AI coding agents on 60 tasks

#ai #mcp #programming #opensource

A tuned grep beat my MCP code-intelligence server on F1 by 9 points.

I'm publishing the result anyway. Here's why.

Why this benchmark exists

I've spent the last six months building sverklo, a local-first MCP server that gives AI coding agents (Claude Code, Cursor, Windsurf) a real symbol graph instead of grep-based pattern matching. The product positioning has always been "stops the agent from hallucinating function names that don't exist in your codebase."

That positioning is hand-wavy without numbers. Six months in, I had no public benchmark. Whatever speed-of-iteration story I told myself was, I was telling myself.

So I built one: 60 hand-verified retrieval tasks across two real OSS codebases (expressjs/express and the sverklo repo itself), three baselines (naive grep, smart grep, sverklo), and metrics that measure both retrieval quality (F1, recall, precision) and the thing AI agents actually pay for (input tokens, tool calls, wall time).

Results live at sverklo.com/bench. Raw JSONL outputs are in the repo at benchmark/results/<timestamp>/. The harness runs in one npm command. Disagreements with my numbers are useful — file an issue with your machine spec.

The headline

baseline	F1	tokens	tool calls
naive-grep	0.35	15,814	7.6
smart-grep (tuned)	0.67	731	11.8
sverklo	0.58	255	1.0

A tuned grep beats sverklo on F1 by 9 points. That's not what I expected when I started building this. If you can write a clean ripgrep invocation with language filters and definition-shaped patterns, you get higher F1 than my hybrid retrieval stack returns.

What sverklo wins on:

62× fewer tokens than naive grep (255 vs 15,814)
2.9× fewer tokens than smart grep (255 vs 731)
1 tool call vs grep's 7-12 per task
~1ms wall time after a 3.7-second cold start (the index build)

Why "tokens per correct answer" is the load-bearing metric

If you're standing at a terminal with rg, F1 is what matters. You read the matches. The agent isn't paying for them.

If you're an AI agent with a 200K token context window, every token has an opportunity cost. Burning 15,000 tokens on grep noise to find one function leaves you 14,750 fewer tokens for the actual change. The agent that gets the answer in 255 tokens has 14,750 more tokens to spend on doing the work.

The metric that actually matters is tokens per correct answer: input tokens divided by recall. The bench reports this for both gated (F1 ≥ 0.8) and ungated runs. For sverklo on the gated subset, it's 203 tokens per correct answer. For naive grep, 3,557. For smart grep, 165 — smart grep is genuinely competitive on per-correct-answer cost when its F1 lands.

The mistake I almost made: optimising for F1. The thing AI coding agents actually need is the cheapest correct retrieval, not the highest-precision retrieval that takes 12 tool calls to assemble.

Per-category: where each baseline shines

Category	Best F1	Best token economy
P1 — Definition lookup (n=20)	sverklo (0.75)	smart-grep (196 tok)
P2 — Reference finding (n=20)	smart-grep (0.81)	sverklo (157 tok)
P4 — File dependencies (n=10)	sverklo (0.86)	sverklo (74 tok)
P5 — Dead code (n=10)	smart-grep (0.55)	sverklo (579 tok, F1 = 0.02)

The pattern: sverklo wins on the slices where structural retrieval (the symbol graph, the import graph) directly answers the question. Definition lookup (P1) and file dependencies (P4) are exactly that. Reference finding (P2) turns out to be a regex problem grep handles well, because the reference patterns in JS/TS are syntactically uniform enough that \bsymbol\b works most of the time.

Where sverklo fails: the P5 dead-code slice

P5 is the embarrassing one. F1 = 0.02. sverklo_refs looks at the static call graph. It doesn't see dynamic invocations (this[methodName]()), it doesn't see deserialization-driven calls (JSON.parse + eval patterns), and it doesn't see calls through ORM proxies that spell themselves with template-string method names.

Smart-grep gets 0.55 on the same slice by aggressively reading whole files and matching loose patterns. The "loose" matters: it picks up a lot of false positives, but on dead-code detection a false positive is "this function is alive" — which is the safer error.

P5 is the next thing I'm fixing. The plan is to extend the reference graph with a runtime-trace mode (instrument the test suite, log actual call sites, merge into the static graph). I'll publish that as a new bench slice when it lands.

Architecture: channelized RRF

The novel piece in sverklo's retrieval is channelized Reciprocal Rank Fusion. Most hybrid retrievers run RRF once over fts ∪ vector. Sverklo runs RRF per channel — FTS, vector, doc-section, path, symbol-name — and fuses the per-channel ranks with channel-specific weights. The path channel is weighted 1.5× because filename matches are precision-skewed: when a query's keywords match a filename, it's signal worth boosting.

The full architecture rationale is in RRF is doing 80% of the work if you want the deep dive on why per-channel weighting matters more than the embedding model choice.

Reproducing this

git clone https://github.com/sverklo/sverklo
npm install
npm run build
npm run bench:primitives

Raw outputs (raw.jsonl, summary.json, report.md) land in benchmark/results/<timestamp>/. The report.md mirrors the bench page tables. If your numbers differ, please file an issue with your machine spec and the run timestamp — I want the disagreements.

What's the takeaway

If you're choosing between grep and an MCP code-intelligence server for your AI coding agent today:

If your codebase is small (~30 files), use rg. The MCP server overhead doesn't pay back.
If you're standing at the terminal yourself doing exploration, learn smart-grep flags. F1 lands you in the right place.
If you're running an AI coding agent on a larger codebase and the agent invents function names that don't exist in your repo, the retrieval-token-economy gap is real and material. Sverklo's 1-tool-call retrieval is what unlocks that.

Try it

npm install -g sverklo
cd your-project
sverklo init

Sverklo is MIT-licensed, runs entirely on your laptop with embedded SQLite + a local ONNX model. No API keys. No cloud. No telemetry by default.

Or read the full bench report first — including the slice where sverklo loses.

Discuss

What metrics do you use when evaluating retrieval for AI coding agents? Drop a comment if "tokens per correct answer" feels right or wrong as the load-bearing axis.

DEV Community