Why I spun my benchmark into its own repo (and why every dev tool with a benchmark should)

#opensource #benchmarking #devtools #ai

This week I shipped a benchmark for code-intelligence MCP servers and posted the results — including the cases where my own tool lost. Within 36 hours, the maintainer of one of the competing tools (jcodemunch-mcp) had shipped three
back-to-back releases addressing specific findings the benchmark exposed. Adding new tests for those fixes then exposed a symmetric blind spot in my own parser. I shipped a fix.

That whole loop — competing maintainers iterating on the same eval, in opposite directions, in 36 hours — is what a public benchmark is supposed to do. It almost never does, and I think most of the time it's because the benchmark lives

in the wrong place.

So I moved mine.

The benchmark is now its own repo

github.com/sverklo/sverklo-bench

What's in it:

README with the headline 90-task results table — replaces "go read the blog"
METHODOLOGY.md documenting what's measured, what isn't, and why these specific datasets (express, lodash, the project's own monorepo)
CONTRIBUTING.md with three contribution paths: submit a baseline, challenge the methodology, add a dataset
tasks/ directory mirroring the ground-truth seed files — read-only reference

The runtime stays in the main monorepo; this is the methodology + results showcase.

Why split

Most dev tools that ship a benchmark put the eval in their main repo. That's wrong — even when (especially when) the benchmark is honest. Two reasons:

1. Eval mixed with product is indistinguishable from marketing

When the benchmark lives in the same repo as the tool it measures, anyone reading the repo sees the eval surface mixed with the tool's own code. They can't separate "the eval is methodologically sound" from "the tool that wrote the eval

also wrote favorable scoring rules for itself."

The fix is structural, not editorial. The benchmark needs its own commit history, its own contributor PRs, its own credibility signal independent of the product.

2. Competitors can't engage with a benchmark they'd have to fork the whole product to access

If a competing maintainer wants to argue with the methodology — say, "your task-3 expected output is wrong because my tool returns Y, not X" — they shouldn't have to fork the entire product repo, navigate its directory tree, and make a

5-line edit hidden among the product source.

A standalone benchmark repo lets competitors:

File methodology issues without forking
Submit baseline implementations as PRs to a dedicated surface
Track their tool's score over time as a first-class concern

This is the same reason MLPerf isn't part of any single ML framework's repo. It's the same reason TPC benchmarks aren't part of any database vendor's repo. The eval has to be portable across implementations, and portability requires it to
live somewhere none of the implementations own.

What this enables

I filed three issues on day one to seed the repo's surface:

#1 — Add Python codebase as 4th dataset. The current 3-dataset matrix covers TypeScript, JavaScript modular CommonJS, and JavaScript monolithic IIFE. Zero Python. Glaring gap. Anyone deep in the Python ecosystem can pick this up.
#2 — Open invitation to GitNexus's maintainer to refresh their baseline. GitNexus has shipped releases since the original baseline integration was written. Inviting publicly so the bench reflects the latest version, not a snapshot.
#3 — Open invitation to jcodemunch's maintainer to refresh against v1.80.9. Same pattern. v1.80.9 added _meta.mode, max_results, and file_pattern parameters the current baseline doesn't exploit.

The bench-as-feedback-loop only works if competitors can engage cleanly. That's what these issues operationalize.

What I'd suggest if you maintain a dev tool with a benchmark

Three concrete moves:

Move the benchmark out of your tool's repo. It can be a sibling repo (yourname/yourname-bench), a separate org-level repo, or a community-owned repo like MLPerf. The exact structure matters less than the fact that the eval has its
own commit log.
Publish where you lose. Every benchmark has a "honesty section" — the slice where the tool you're evaluating gets beaten by something else. Document those losses prominently. Two reasons: (a) it's the credibility signal a competitor
needs to engage seriously, and (b) it's the part competitors can actually help fix. A benchmark that only documents wins is a marketing artifact, not an evaluation.
Invite competitor maintainers to submit baselines. Privately or publicly. If they decline, you control the trust narrative ("the bench is open, here's how to argue with it"). If they engage, the bench becomes the scoreboard for the
entire category. Either outcome beats a benchmark only the original author runs.

The first two are easy. The third is uncomfortable. Do it anyway — the bench-as-feedback-loop pattern needs all three to fire, and the third is the only one that's structurally hard to fake.

The repo: github.com/sverklo/sverklo-bench

The original benchmark + the bench-loop story that motivated the spin-out: sverklo.com/bench/

If you maintain a code-intelligence tool and want to argue with the methodology, issue #2 / #3 on the new repo are the cleanest way in.