benchstat in Go: Comparing Benchmarks Without Fooling Yourself

#go #testing #performance

Book: The Complete Guide to Go Programming
Also by me: Hexagonal Architecture in Go — the companion book in the Thinking in Go series
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You change one function. You run the benchmark. It comes back 12%
faster. You paste the number into the PR description and hit merge.

Then a teammate runs the same benchmark on their laptop and it's 4%
slower than before. Nobody changed the code. Both of you ran go test -bench. Both of you read a real number off a real terminal. One of
you is wrong, and neither of you can tell which.

This is the trap with Go benchmarks: a single run gives you a number,
and a number feels like a fact. It isn't. It's one sample from a noisy
distribution, and the noise on a normal developer machine is often
larger than the change you're trying to measure. benchstat is the
tool that tells you whether the difference you're looking at is real.

One run proves nothing

A Go benchmark measures wall-clock time on a machine that is doing a
hundred other things. CPU frequency scaling moves your clock speed up
and down. Turbo boost kicks in for a few hundred milliseconds and then
backs off when the chip heats up. The garbage collector runs when it
runs. Your editor indexes a file. A Slack notification lands.

None of that is in your code, and all of it lands in your benchmark
number. Run the same unchanged benchmark ten times and you'll see a
spread. If that spread is ±5% and your optimization is worth 3%, a
single before/after comparison cannot separate the signal from the
jitter. You need repeated samples and a statistical test that reports
its own confidence.

That's the whole job of benchstat: take many samples of "before" and
many of "after", and tell you whether the difference is bigger than the
noise.

Write a benchmark that measures the right thing

Before you compare anything, the benchmark has to be honest. Here's a
standard one:

func BenchmarkEncode(b *testing.B) {
    payload := makePayload()
    b.ReportAllocs()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = Encode(payload)
    }
}

Two things matter here. b.ReportAllocs() adds allocation columns so
you compare memory too, not just time. b.ResetTimer() throws away the
setup cost of makePayload() so it doesn't pollute the measurement.

On Go 1.24+ you can drop the manual loop and use b.Loop(), which the
compiler is careful not to optimize away and which handles the timer
reset for you:

func BenchmarkEncode(b *testing.B) {
    payload := makePayload()
    b.ReportAllocs()
    for b.Loop() {
        _ = Encode(payload)
    }
}

If Encode has no observable side effect and you throw its result
away, the compiler is allowed to delete the call entirely. Then you're
benchmarking an empty loop and celebrating a 100% speedup. Assign the
result to a package-level sink, or feed it somewhere the compiler can't
prove is dead. b.Loop() guards against this better than the old
b.N loop, but it's still your job not to measure nothing.

Run it with -count

The one flag that turns a benchmark into data you can trust is
-count. It reruns each benchmark N times so benchstat has a sample
to work with.

go test -bench=BenchmarkEncode -count=10 > old.txt

Then make your change, and capture the new numbers into a second file:

go test -bench=BenchmarkEncode -count=10 > new.txt

Ten counts is a reasonable floor. Fewer than six and the statistics get
shaky. Keep the machine quiet while these run: close the browser, stop
the file watcher, plug in the laptop so it isn't throttling on battery.
The goal is for the only difference between old.txt and new.txt to
be your code, not your background noise.

Keep the benchmark names identical between the two files. benchstat
matches rows by name, so a rename means it can't pair them up.

benchstat A/B

Install it once from the Go performance tools module:

go install golang.org/x/perf/cmd/benchstat@latest

Then hand it both files, old first:

benchstat old.txt new.txt

You get a table that looks roughly like this:

              │   old.txt   │              new.txt               │
              │   sec/op    │   sec/op     vs base               │
Encode-8        1.28µ ± 2%    1.03µ ± 1%   -19.5% (p=0.000 n=10)

Read it left to right. The old.txt column shows the baseline: median
time per operation, with ± 2% describing how tightly the ten samples
clustered. The new.txt column shows the same for your change. The
vs base column is the part that matters: the percentage delta and,
in parentheses, the p-value and the sample count.

Add -benchmem at test time and you get B/op and allocs/op blocks
below the time block, each with their own delta and p-value. A change
that's faster but allocates more is a trade, and this is where you see
it.

Reading the delta and the p-value

The delta is the easy half: -19.5% means the new code's median is
about a fifth faster. The p-value is the half people skip, and it's the
half that stops you fooling yourself.

benchstat compares the two samples with the Mann-Whitney U test, a
rank-based test that doesn't assume your timings are normally
distributed. The p-value answers one question: if the two versions were
actually the same speed, how likely is a difference this large from
noise alone? A small p-value means "noise almost certainly didn't do
this." The default cutoff is 0.05.

So -19.5% (p=0.000 n=10) reads as: the new code is ~19.5% faster,
across 10 samples each, and the probability this is noise is
effectively zero. That's a result you can put in a PR. n=10 confirms
all ten samples counted; if benchstat discarded outliers, that number
drops and it warns you.

The delta without the p-value is the number that started the argument
at the top of this post. The delta with a low p-value is the number
that ends it.

When benchstat says ~

Sometimes the vs base column shows a tilde instead of a percentage:

              │   old.txt   │              new.txt              │
              │   sec/op    │   sec/op     vs base              │
Encode-8        1.28µ ± 3%    1.26µ ± 4%    ~ (p=0.190 n=10)

The ~ means the p-value came out above 0.05, so benchstat refuses
to claim a difference. There might be a 1.5% improvement hiding in
there. There might not. With this much sample spread, the test can't
tell the change apart from the jitter, and it says so honestly instead
of handing you a false win.

This is the outcome people find annoying and it's the most valuable one
the tool produces. A ~ on your "optimization" means one of three
things: the change is real but too small to prove at this sample size,
the change does nothing, or your machine was too noisy to measure it.
The fix for the first is more counts. The fix for the second is to stop
optimizing that path. The fix for the third is a quieter machine or a
dedicated benchmark box. All three are better than merging a ~ and
calling it a speedup.

A workflow you can actually run

Put it together and it's four commands, in order:

go test -bench=. -benchmem -count=10 > old.txt
# make your change
go test -bench=. -benchmem -count=10 > new.txt
benchstat old.txt new.txt

Read the delta for the size of the effect. Read the p-value for whether
you believe it. If you see ~, either raise -count and try again, or
accept that the change didn't move the needle. Commit the two .txt
files nowhere, but paste the benchstat table into the PR so the next
reviewer sees the p-value, not just your happy number.

One run proves nothing. Ten runs plus a p-value is the difference
between "it felt faster" and "it is faster." Go gives you the harness
for free. benchstat gives you the honesty.

Benchmarking well is mostly about knowing what the runtime is doing
underneath the number: how the scheduler, the allocator, and the GC
shape a measurement. The Complete Guide to Go Programming digs into
that machinery so your benchmarks measure the thing you think they do.
Hexagonal Architecture in Go is about keeping the hot path you're
measuring behind a boundary you can swap and test in isolation, instead
of a tangle you can only benchmark end to end.