Benchmark-Driven Development: let agents build the harness you never had time for

#ai #performance #testing #architecture

Most teams ship on two signals: does it compile, and do the tests pass. Both are correctness signals. Neither tells you whether the thing is fast, whether it got slower this week, or whether the output is actually good against some ground truth. We close that gap with a third signal, and we let it drive the work. We call it benchmark-driven development.

The loop is simple:

Write the plan and the spec first. Decide what the code should do before writing it.
Use TDD to drive it to correct. Red, green, refactor, the usual.
Test it for real with proper benchmarks, against ground truth, with profiling.

Correctness and performance are both requirements. So both get measured, on every change, not once at the end when a customer complains.

Why benchmarks used to be a luxury

Everyone agrees benchmarks are good. Almost no one builds a serious one. The reason is cost. A real harness is its own project: a fixture corpus with trustworthy ground truth, scoring that reflects quality and not just a checksum, resource sampling, a way to compare across implementations, and CI plumbing to run it all and publish results. That is weeks of work that ships no features. So it gets cut, and teams fall back to a stopwatch around a hot loop and a vibe about whether output looks right.

This is the part that changed. With an agent in the loop, the harness is no longer a quarter of engineering budget. It is a few focused sessions. You describe the corpus, the metrics, and the comparison you want, and you iterate on the harness the same way you iterate on a feature. The economics flipped. The thing that was too expensive to justify is now the cheap part.

A concrete harness

To make this concrete, here is the benchmark harness we run for xberg, our document-extraction engine. It is a Rust CLI, and it does the following.

It compares 13 language bindings of our own engine against 7 external reference frameworks, across a corpus of 318 fixtures that carry markdown ground truth, spanning 17 formats: PDF, HTML, DOCX, ODT, RTF, XLSX, CSV, EPUB, PPTX, and more.

It scores quality two ways, because text and structure fail differently:

Text F1 (TF1): token-level bag-of-words F1 between extracted text and ground truth, with a separate numeric-token score for number-heavy documents like financial and scientific PDFs.
Structural F1 (SF1): block-level matching between extracted markdown and ground-truth markdown. Headings, code blocks, formulas, tables, list items, and images are weighted by how much they matter, matched greedily with fuzzy cross-type compatibility, and scored for ordering with a longest-increasing-subsequence pass.

When markdown ground truth is present it combines them:

quality_score = 0.5 * f1_text + 0.2 * f1_numeric + 0.3 * f1_layout

It samples CPU and memory during each run, and behind a feature flag it generates flamegraphs so a regression in latency points you straight at the function that caused it. It runs each binding in both single-file mode for fair latency and batch mode for throughput. And it runs as a CI matrix: build, validate ground truth, fan out one job per binding, gate, run the external frameworks, then consolidate everything into percentiles and publish the aggregate as a release artifact.

None of that is exotic. All of it used to be too much work to justify for anything but a flagship. That is the point.

What "real benchmarks" actually means

A benchmark you can drive development with has to clear a higher bar than a stopwatch around a hot loop. Four properties matter.

Ground truth you trust. A number is only as good as what it is measured against. We keep ground truth in version control, validate its integrity on every CI run, and clean known artifacts so the target does not drift silently. If you cannot defend the reference, you cannot defend the score.

Metrics that reflect the goal. "Bytes match" is not quality for document extraction. A table flattened into a paragraph can have high text overlap and be useless. That is why structure gets its own score and its own weights. Pick metrics that punish the failures you actually care about.

Resource truth, not just wall-clock. Throughput hides memory blowups. We sample CPU and memory and keep flamegraphs around so "it got slower" turns into "this function got slower."

Comparison and regression gates. A score in isolation is trivia. The harness compares pipelines and implementations against each other and against history, and it can fail the build on a quality regression. That is what makes it a development driver rather than a report.

How to adopt it without a big-bang rewrite

You do not need 318 fixtures on day one.

Start with ten fixtures and one honest metric. Ten documents you understand beat a thousand you cannot defend.
Write the spec, then the failing test, then make it pass. Keep TDD for correctness. The benchmark is the layer on top, not a replacement.
Add a benchmark the first time you guess about performance. The moment you say "this is probably fine," measure it instead.
Put it in CI and gate on regressions once the metric is stable. A benchmark nobody runs is documentation.
Lean on an agent for the unglamorous parts: fixture wrangling, scoring code, CI matrix, percentile aggregation. This is exactly the work that used to make a harness too expensive, and it is exactly the work an agent is good at.

The payoff

When correctness and speed are both measured on every change, you stop guessing and you stop arguing from intuition. A refactor either held the line on quality and latency or it did not, and the harness tells you which before the diff merges. Spec-driven, TDD for correctness, real benchmarks for the rest. Build the harness you never had time for, because now you do.

Top comments (1)

Raju Dandigam • Jul 1

This is a strong use case for agents: not just writing product code, but building the measurement harness teams usually postpone. I like the three-signal model: compile, tests, and benchmarks against real ground truth. For AI-assisted development, that third signal becomes even more important because generated code can be correct in a narrow test sense but worse in latency, cost, quality, or reliability. A good next step would be connecting benchmark results back into agent traces so the agent sees not only what failed, but why the change degraded the system.