Did that actually help? Evaluating AI coding assistants with hard numbers

#ai #agents #analytics

You are building a Skill, an MCP server, or a custom prompt strategy that is supposed to make an AI coding assistant better at a specific job. You make a change. The next session feels smoother. The agent seems to reach for the right context at the right time.

But how do you know?

That question came up in two parallel problems.

I was building and iterating on MCP servers to support a coding agent. New tool, new tool definition, new prompting strategy. Each change felt like an improvement. Sessions seemed smoother. But I had no numbers. I had vibes.

A colleague was working on the same problem from the other side: he was building and refining AI coding Skills -- structured prompt packs that teach the agent how to work in a specific context. Same issue. A lot of iteration, a lot of gut feel, no hard signal on whether the changes were actually moving the needle.

We joined forces and built something to fix this. The result is Pitlane -- named after the place in motorsport where engineers swap parts, adjust the setup, check the telemetry, and find out if the next lap is faster.

The problem with vibes

When you change an MCP server or a Skill, you are changing something about the environment the agent operates in. The agent gets different tools, different context, different instructions.

Those changes can have real effects: pass rates on tasks go up or down, the agent takes fewer wrong turns, token costs change, time to completion changes, output quality improves or degrades.

Without measurement, you cannot tell which of those things happened. You cannot tell whether the last commit was an improvement or a regression. You cannot tell whether version 3 of your Skill is better than version 1.

You end up making decisions based on a handful of memorable sessions, which is not a reliable signal. Good sessions feel good. Bad sessions get rationalised. The data you are implicitly collecting is not representative.

What you actually need

You need to be able to answer a specific, repeatable question:

With my Skill or MCP present, does the agent complete this task better than without it?

That question has a structure: a defined task with explicit success criteria, two configurations (baseline without your changes and challenger with them), deterministic assertions that verify success independently of the agent's own judgement, and a way to compare results across runs.

That structure is an eval. Not a generic language model benchmark. A benchmark for your specific Skill or MCP server, in your context, on tasks that actually matter to you.

What Pitlane is

Pitlane is an open source command-line tool for running those evals. You define tasks in YAML, configure a baseline and one or more challengers, and race them against each other. The results tell you -- with numbers rather than impressions -- whether your work is paying off.

The loop is simple: tune, race, check the telemetry, repeat.

The assertions are deterministic. File existence checks, command exit codes, pattern matching -- either the file is there and valid or it is not. No LLM-as-judge, no subjectivity baked into the measurement. When you need fuzzy matching for documentation or generated content, similarity metrics (ROUGE, BLEU, BERTScore, cosine similarity) are available with configurable thresholds. These are deterministic numeric metrics, not a second model grading your output.

Because agent outputs are non-deterministic, Pitlane supports repeated runs with aggregated statistics -- average, minimum, maximum, and standard deviation across runs. A Skill that reliably pushes a hard task from 50% to 70% pass rate is a meaningful result. Especially when that task used to fail half the time in CI. A Skill that appears to do that in a single run might just be variance.

The tool tracks pass rates alongside cost, time, and token usage. A Skill that improves pass rate by 5% while tripling cost is a different trade-off than one that hits the same improvement at the same cost. Both columns appear in the HTML report so you can see the full picture.

Pitlane currently supports Claude Code, Mistral Vibe, OpenCode, and IBM Bob (at time of writing).

Why not use an existing eval tool?

There are good, widely-used tools in this space. promptfoo, Braintrust, LangSmith, DeepEval, and others all solve real problems. The question is whether they solve this problem without requiring you to build the scaffolding yourself.

Take promptfoo as a representative example -- it is mature, well-documented, and genuinely extensible. It runs real agent sessions via its Claude Agent SDK and Codex SDK providers. The agent actually executes. Files actually get written. So far, so good.

The gap shows up in the assertion layer. Promptfoo's built-in assertions are primarily oriented around validating the agent's returned text. In their coding-agent guide, one of the example verification patterns is a JavaScript assertion that parses the agent's final text for keywords like "passed" or "success":

const text = String(output).toLowerCase();
const passed = text.includes('passed') || text.includes('success');

That assertion passes when the agent says the tests passed. It does not verify that the tests actually passed. A model that narrates success while producing broken code passes. A model that silently produces correct code with a terse "done" might not. That is fine for some workflows. It is not the same as asserting on the produced artifacts as first-class primitives.

Promptfoo's JavaScript assertion API is powerful enough to do better -- you can call require('fs') and require('child_process') and wire up real filesystem checks yourself. But you are writing boilerplate from scratch for every benchmark, managing your own working directory scoping, and handling fixture isolation manually. Their documentation acknowledges the gap directly:

"The agent's output is its final text response describing what it did, not the file contents. For file-level verification, read the files after the eval or enable tracing."

-- promptfoo docs: Evaluate Coding Agents

"Read the files after the eval" is a step outside the pipeline. That is what building Pitlane felt like when approached from that direction -- assembling scaffolding that should have been there already.

In Pitlane, command_succeeds: "terraform validate" or command_succeeds: "pytest" is a first-class primitive. One line. Every task gets a clean fixture copy automatically. The difference is not what is theoretically possible -- it is what is built in versus what you have to construct.

Benchmarks that don't lie to you

Measurement helps, but measurement can also mislead. Three failure modes are worth keeping in mind.

Gaming your own benchmark. When a metric becomes a target, behaviour adjusts to hit the target rather than the underlying goal. The baseline/challenger structure is the first defence -- you are not asking "does this pass" in isolation, you are asking "does this beat the baseline." The second defence is to include tasks your Skill was not specifically designed for. If adjacent tasks regress when your target tasks improve, you have a problem.

Pass rate is a goal metric, not the whole picture. Pass rate tells you whether the output was correct. It does not tell you what it cost to get there. Pitlane tracks tokens, cost, and time alongside pass rates. A Skill that takes a task from 60% to 80% pass rate while doubling token cost is a different trade-off than one that achieves the same at the same cost. Check both before deciding whether the change was worth shipping. The weighted score is also distinct from the binary pass rate -- a task where the critical assertion is weighted 3x tells a different story than a flat count.

Your context is not someone else's context. A generic benchmark tells you how an assistant performs on generic tasks. The meaningful signal comes from tasks you write yourself, against fixture directories that reflect your actual project structure, with assertions that match what "done" means in your specific context. Borrowing a benchmark wholesale and optimising against it is still measuring someone else's problem.

What this changes

The question "is this actually better" becomes answerable.

When you add a new tool to an MCP server, you can benchmark before and after and see whether the task that motivated the tool now passes more reliably. When you tighten a prompt in a Skill, you can see whether that tightening broke anything on tasks that previously passed.

Without measurement, every change is a vibe. With measurement, you have a signal. The signal is not perfect. Benchmarks can be gamed. Task sets can be incomplete. Improvements on a small task set may not generalise. But noisy measurement beats no measurement. You can improve your task set over time. You cannot improve intuition alone.

The lap times do not lie.

Try it

Pitlane is open source, takes a few minutes to set up, and is documented at the repository:

https://github.com/pitlane-ai/pitlane

If you are building MCP servers or AI coding Skills and you want hard numbers instead of gut feel, this is the tool. We built it because we needed it, and we would rather more people are measuring than guessing.

If you find a gap, open an issue. If you add support for a new assistant or improve an existing one, send a PR. The codebase is Python, the architecture is straightforward, and contributions are welcome.