DEV Community

Mohan Krishna Alavala
Mohan Krishna Alavala Subscriber

Posted on

I corrected my own benchmark claim from 91.5% to 88%. Here's what changed.

A week ago I shipped v4.4.3 of context-router with a number on the README: "91.5% fewer tokens than code-review-graph."

It was true in the narrow sense that both numbers came from real benchmark runs. It was also wrong in every way that matters. The two tools were running on different repos, on different tasks, with different inputs. I was comparing my best-case workload to their best-case workload and putting a percent sign between them.

This post is about the redo. v4.4.4 ships a workload-matched run on the same SHAs and the same diffs as input, on the same machine. The new headline is ~88% fewer tokens, 2/3 rank-1 hits vs 0/3 on the kubernetes commits I picked. That's a number I'll defend.

What context-router does

context-router is a small Python project for routing AI coding agents to the minimum useful context. You point it at your repo, give it a task type (review, debug, implement, handover), and it returns a ranked pack of files and snippets sized to fit a token budget.

The way you benchmark something like this is straightforward: pick a real bug-fix commit, hide the fix, hand the tool the parent state plus the diff, and check whether the file the human eventually changed shows up in the tool's top-N output. If yes, the tool would have routed the agent to the right place.

How I got the wrong number

For v4.4.3 I ran context-router across six OSS repos (gin, actix-web, django, gson, requests, zod). Separately, I ran code-review-graph on a different set of repos and grabbed its average tokens per output. Then I divided.

That isn't a comparison. That's two unrelated measurements with a percent sign glued between them. If code-review-graph happened to be running on repos where it had to emit more boilerplate, or where its scorer was less confident, my number would be flattering for reasons that had nothing to do with my tool.

Someone pointed this out. They were right. I pulled the claim and rebuilt the test.

Workload-matching in one sentence

Both tools see the same SHAs and the same diff as input.

That's the rule. If you can't say that sentence about a benchmark, the percent at the bottom isn't really pointing at anything.

Concretely, here's what v4.4.4's run looks like:

  • I picked three single-source-file bug-fix commits in kubernetes/kubernetes: kubelet status_manager, client-go clientcmd loader, and kube-proxy winkernel proxier. SHAs are pinned in benchmark/holdout/kubernetes/tasks.yaml so anyone can reproduce.
  • For each commit both tools get the same input: the parent tree, with the parent→fix diff handed in. Neither tool gets to "see the answer" in the working copy.
  • context-router: pack --mode review --pre-fix <fix-sha>.
  • code-review-graph: detect-changes --base <fix-sha>^.
  • The diff each tool consumes is git diff <fix-sha>^..<fix-sha>. Identical bytes.

Then I report what each tool predicts in its top-3, what its rank-1 was, and how many tokens it emitted.

The numbers

context-router code-review-graph
Rank-1 hits 2/3 0/3
Recall-at-3 3/3 3/3
Total tokens 406 3,478
Avg tokens / task 135 1,159
Errors 0 0

Token delta on this workload: -88.3%.

A few honest things to note before anyone gets too excited:

Three tasks is a small N. I'm reporting the direction with confidence. The precise percent is well within the range that could shift on a different task mix. If you put more weight on a single number than that, you're reading too much into it.

Recall-at-3 is tied. Both tools surfaced the right file in their top three on every task. The useful gap is at rank-1, and at cost. If your agent only reads the top hit, context-router takes you to the right file two times out of three; the other tool zero. If your agent reads the top three, both tools work, but one costs roughly 9× more tokens to do it.

Both tools were tripped by the same fixture noise. I had to reconstruct the kubernetes repo from per-commit GitHub tarballs because depth-50000 clones throttled badly on my network and a full clone is more bandwidth than I had at the time. GitHub's tarball generator stamps the source SHA into a couple of version.sh and version/base.go files at archive time. Those files appear in the synthetic parent→fix diff, but were not in the real upstream commit. Both tools' rank-1 picks on the two missed cases were one of those stamped files. On a real working-tree-diff workflow that noise wouldn't exist. I'll re-run this on a full clone once I have the bandwidth.

code-review-graph indexes faster. Roughly 80 seconds to build its graph + FTS for the full kubernetes tree. context-router takes 4–5 minutes on the same checkout because it's collecting richer call/symbol metadata. That's a real cost you pay; the precision and token economy at query time are what you get for it.

The full report with per-task tables, predicted top-3 lists, and the reproducer is at benchmarks/comparison-code-review-graph.md. The caveats are in the report itself, not in a corner where nobody looks.

What else shipped in v4.4.4

The benchmark redo wasn't the only thing in this release. The other piece worth mentioning, because it's load-bearing for the 2/3 rank-1 number, is an FTS5 anchor for implement-mode candidate retrieval.

v4.4.3 had a quiet regression on repos with more than 10,000 symbols: implement-mode's candidate set came from a get_all query capped at the first 10K rows with no ORDER BY. If the file you cared about lived past row 10,000 (say, in a 197K-symbol kubernetes graph), it was invisible. The bug was masked on every smaller repo I tested against.

The v4.4.4 fix is a SQLite FTS5 virtual table over (name, signature, file_path) with porter + unicode61 tokenization, kept live by three triggers. SymbolRepository.search_fts(query, repo, limit=200) returns BM25-ranked symbol rows; the orchestrator unions those with the existing 10K slice, FTS first so they survive top-N capping. When FTS returns zero hits and get_all returned ≥10K rows, a stderr warning fires naming the case. No silent degradation.

Three things I'd like you to take from this

  1. Workload-matched or it doesn't count. If you read a tool benchmark and can't tell whether both systems saw the same input, treat the result as marketing.
  2. Show the misses. "2/3" with the failed case explained is more credible than "100%" with no commentary. The fixture noise that tripped both tools on this run is right there in the report. Hiding it would have made the rank-1 number look better and the project less trustworthy.
  3. A correction isn't a defeat. v4.4.3 had a claim that didn't hold up. v4.4.4 has one that does. The repo is in better shape than it would have been if nobody had pushed back.

If you want to reproduce the run yourself, the commands are at the bottom of the comparison report. If you find a workload where the numbers don't hold, open an issue with the raw comparison_*.json attached and I'll either fix it or update the README to match what's true.

context-router is on GitHub; v4.4.4 is on PyPI as context-router-cli and on Homebrew as mohankrishnaalavala/context-router/context-router.

Top comments (0)