Luc B. Perussault-Diallo

Posted on May 15 • Originally published at Medium

How do you benchmark an MCP server you built?

#ai #mcp #claude #benchmark

I built a code-intelligence MCP server. Then I built a benchmark for code-intelligence MCP servers. Then my tool placed first on every scenario.

I didn't believe it.

So I threw the harness out and rewrote it from scratch. Same result. I built three held-out scenarios with hand-graded reference scores, ran each iteration of the bench against them, and only trusted a number when it correlated above 0.85 with my own judgment.

That's the part I want to write about. Not which tool won. The methodology decisions you have to make when you are also one of the contestants.

Benchmarking AI tools is harder than benchmarking models because the variable is the tool, not the model.

Disclosure first. I am the author of Sense, one of four MCP servers in the bench, scored alongside a no-MCP baseline (Claude Code with grep, find, and Read). Everything below is in the open: code, scenarios, rubrics, transcripts, judge prompts, and the analysis of where my own tool loses. Repo here. Don't take my word for any of it.

The result

Rank	Tool	Fairness	Quality	Tokens	Time	Cost
#1	sense	81.3%	85.4%	10,896	141s	$6.22
#2	probe	77.7%	84.8%	12,119	162s	$6.23
#3	baseline	77.2%	84.2%	12,716	185s	$7.57
#4	serena	75.2%	83.4%	14,800	191s	$7.57
#5	gitnexus	74.9%	84.5%	12,964	173s	$6.87

Two of the four MCP servers scored below the no-MCP baseline. Serena and Gitnexus, by default, made the agent worse than just letting Claude use grep, find, and Read.

Why an eval framework wouldn't cut it

That result is the question worth defending. Two MCP servers below baseline is a claim I need to back up, which is why most of my time went into the bench's methodology, not the bench's tools.

I looked at the obvious frameworks first (Inspect AI from UK AISI is the closest fit for agent evals; Promptfoo, DeepEval, OpenAI Evals all have their place). They handle maybe a third of the work: the run loop, the LLM-as-judge calls, the report rendering. The other two-thirds had to be custom.

What a framework would handle	What I had to build
Run loop, prompt construction	MCP server orchestration per cell
LLM-as-judge calls	Citation grounding vs pinned repo
Report rendering	`answer_text` / `tool_input` split
	Fairness vs adoption layering
	Held-out + Spearman anchor
	Judge variance characterization

The reason is shape. An eval framework is built around "model produces output, output gets graded against expected output." That's a single-turn paradigm. What I needed was: given an MCP server attached to Claude Code, does an agent reach a useful answer faster, cheaper, and with fewer hallucinated file paths than the same agent with no MCP?

The variable isn't the model. The model is fixed (Opus 4.7, 1M context). The variable is the tool the model has access to. None of those custom pieces exist as off-the-shelf primitives.

1. Citation grounding against a pinned commit

Most evals grade text against text. I needed to grade text against a filesystem.

When an AI agent says "the dispatch logic lives at app.py:1625," that claim is either true or it isn't. The number is either inside the file or beyond EOF. The file is either at that path or it isn't.

So every file:line and file:Symbol reference in the answer gets extracted by regex, then verified against the repo at the pinned commit. Three buckets:

Grounded: the file exists and the line is in range, or the symbol resolves within ±5 lines of the cited line
Unresolved: the file is not at the cited path (usually a basename-only reference where the agent dropped the directory)
Hallucinated: the file exists, but the line is past EOF. Outright fabrication.

Hallucinated is the hard signal. When you flag a line that doesn't exist in a file that does, the agent isn't paraphrasing. It's inventing.

The numbers across all five were stark:

Tool	Citation grounding
sense	89.2%
baseline	80.8%
gitnexus	76.9%
probe	72.8%
serena	61.9%

That gap between 89% and 62% is the difference between trusting the answer and manually verifying every line reference. The LLM-as-judge alone would never have surfaced this. The judge reads the answer; it doesn't crawl the repo.

The naive extractor handles ~95% of cases at a fraction of the complexity. It misses some edge cases (basename-only paths it correctly cannot resolve, symbols at unusual offsets) but it doesn't lie.

2. The `answer_text` vs `tool_input` split

This one almost killed the bench before it was real.

A keyword check asks: did the answer mention TopicCreator? Pre-fix, the scorer searched the entire transcript, including tool calls. So when the agent ran Grep("TopicCreator"), the keyword TopicCreator got a "hit" inside the grep invocation. Even if grep returned nothing.

That's not a measurement. That's a tax on tools that don't use grep. Sense, which uses semantic search instead of grep, would lose keyword points purely because its tool calls didn't contain English-language symbol names. Probe, baseline, anything grep-flavored would win keyword points just for grepping for the right string.

The fix sounds simple: keyword checks search the assistant's prose only. Tool inputs and results live in a separate audit_text field, available for diagnostics but never scored against.

# Before: scorer searched the whole transcript
hits = count_keyword_in(transcript_text, keyword)

# After: scorer searches assistant prose only
hits = count_keyword_in(answer_text, keyword)
# Tool calls live in audit_text, never scored

A bench cannot be honest if its first instinct rewards using the tools the bench's author dislikes.

3. Fairness vs adoption: two layers, never folded

Two questions that look similar are not the same question:

Did the developer get a better answer? That's fairness.
Did the agent fluently use the MCP tools available? That's adoption.

If you measure how often the agent uses an MCP server, Sense looks great. But baseline (no MCP attached) gets a structural zero on that axis. Through no fault of its own. The agent literally cannot call a server that isn't there.

So if "adoption" feeds into the headline score, the headline answers the second question, not the first. That's not a benchmark. That's an MCP-adoption survey.

The two-layer model:

fairness  = 0.10 * keyword_coverage
          + 0.55 * llm_quality
          + 0.15 * citation_grounding
          + 0.20 * efficiency

adoption  = 0.60 * tool_fluency
          + 0.40 * discoverability

Adoption is computed, reported, and never folded into fairness. It's there for code-intel-vs-code-intel comparisons only. The headline number is fairness, and baseline can beat any MCP server on it.

Two of the four MCP servers I benchmarked did score below baseline on fairness. They added friction without offsetting it with answer quality.

If adoption had been in the formula, that finding would have been invisible.

4. Held-out scenarios + Spearman correlation

The anti-Goodhart move.

A self-improving benchmark can drift away from human judgment. You tune the rubric, the rubric tunes the scores, the scores look better, and at some point you're optimizing the metric instead of the underlying capability. Goodhart's Law applies to your own measurement.

The defense is an anchor the loop cannot touch.

Three held-out scenarios, frozen. Hand-graded reference scores in gold.json. SHA256-pinned in a lockfile:

# bench/locked/held-out.lock
flask-blueprints:
  transcript_sha256: 7f3a...
  rubric_sha256: 91e2...
  gold_sha256: 8d4c...

The improvement loop refuses to start if any hash has drifted. Each iteration re-judges the frozen transcripts against the current rubric, then computes Spearman correlation between the current llm_quality and the gold scores.

Drop below 0.85, convergence fails, the loop must stop or be re-anchored.

It's the line between we tuned the bench and we tuned the bench until it agrees with our own grading. The smallest part of the codebase. Possibly the most important.

5. Judge variance: what's reliable, what isn't

You don't get to claim the LLM-as-judge is consistent. You measure it.

I ran the judge twice over the same 12 transcripts. Same prompt. Same model. Default sampling.

Layer	Max stdev	Target	Verdict
Per-criterion (raw 0–1 scores)	0.071	<0.05	Fails
Per-step (4-criterion weighted sum)	0.048	<0.05	Passes
Per-scenario (mean of 4 steps)	max \|Δ\| 0.014	<0.05	Rock-solid

The judge is jittery at the criterion level, especially on a fuzzy criterion like "uncertainty." It averages down at the step level. At the scenario level, the number that actually matters for ranking tools, it's stable enough.

So I use scenario_quality to rank. I don't gate decisions on a single criterion-level delta of 0.05 between two tools. That's inside the noise floor. Per-criterion rationales are commentary, not data.

The win here isn't proving the judge is perfect. It's knowing where it's reliable and where it isn't, so I know which numbers I can defend and which ones I shouldn't quote at all.

Reproduce it yourself

The bench is open. Adding a new MCP server to test is one shell script in bench/tools/.

git clone https://github.com/luuuc/sense
cd sense/bench

# Run the full bench (12 sessions = 5 tools × 6 repos, sense + baseline at minimum)
bash bench/bench.sh

# Or run a single (tool, repo) cell
bash bench/run.sh --tool sense --repo flask

Cost: ~$40 in Opus 4.7 tokens for the full 5×6 matrix. Time: ~20 minutes wall-clock. Every transcript ends up under bench/results/<tool>/<repo>/ with transcript.json, scored.json, judged.json, and run_meta.json pinned to the commit the agent worked against.

To add a new MCP server:

# 1. Drop a config script
cat > bench/tools/yourtool.sh <<'EOF'
#!/bin/bash
# MCP server invocation for yourtool
exec your-mcp-binary --your-flags
EOF
chmod +x bench/tools/yourtool.sh

# 2. Run it
bash bench/run.sh --tool yourtool

If you find a methodology hole, the bench is open and replayable. I'll patch what's real.

What the bench does not claim

Not a benchmark of AI model quality. The model is fixed.
Not a real-world end-to-end task benchmark. Each scenario is bounded, scripted, rubric-anchored.
Not a cost benchmark in production terms. Cost is computed from public API pricing for comparability, not from anyone's actual invoice.
Not a measure of MCP-tool fluency in isolation. Adoption is a secondary signal, not the headline.

It is, narrowly, this: given a six-scenario, four-step exploration script across six real repos, how does each tool affect the agent's answer quality, citation grounding, and efficiency.

A smaller claim than "this is the best code-intel MCP." Also a more useful one.

PS: The hardest part wasn't building the tool. It was building a benchmark I'd still trust after it favored me. Most benchmarks die at that step.

Links:

Bench (code, scenarios, rubrics, replay): github.com/luuuc/sense/tree/main/bench
Full results report: report-for-humans.md
Sense: github.com/luuuc/sense

Top comments (1)

Harjot Singh • Jun 1

really interesting take on the challenges of benchmarking AI tools. your emphasis on methodology and correlation is key to getting reliable results. in a different realm, with Moonshift, you can deploy a full next.js + postgres + auth app in about 7 minutes and keep the code on your github. if you want to give it a shot, I can set you up for a free run.