Your Benchmarks Are Lying to You (And This 148-Star Crate Knows Why)

#opensource #rust #performance #benchmarking

Microbenchmarks lie. Not maliciously, just structurally. You write a tight loop, measure it a thousand times, compare two implementations, and declare a winner. Except your CPU was thermally throttling during the second run. Or the OS decided to schedule a background process halfway through your baseline. Or the memory allocator fragmented differently between runs because you ate lunch and came back.

Most benchmarking harnesses deal with this by collecting more samples and hoping statistics will save you. Run it ten thousand times instead of a thousand. Throw out outliers. Compute confidence intervals. It helps, but it's patching over a fundamental problem: you measured the baseline and the candidate at different times, under different system conditions.

What if you didn't have to?

What Is Tango?

Tango is a Rust microbenchmarking harness built by Denis Bazhenov around a simple idea: run baseline and candidate together. Not sequentially. Interleaved. Baseline, candidate, baseline, candidate, all within the same process, alternating on every iteration. Thermal drift affects both equally. Scheduling jitter affects both equally. By the time you compare results, you're comparing two things that experienced the same system conditions at the same moments.

The project calls this "paired testing." It produces tighter confidence intervals and fewer false positives than traditional sequential benchmarking. Under 150 stars. For anyone doing serious performance work in Rust, that's a gap in awareness worth closing.

The Snapshot


Project	tango
Stars	~148 at time of writing
Maintainer	Solo developer, actively committing
Code health	Small, dense, well-organized
Docs	Solid README with methodology explanation; API docs are thin
Contributor UX	Clear architecture, responsive maintainer, open to contributions
Worth using	Yes, if you benchmark and care about result stability

Under the Hood

The whole workspace is about 3,900 lines of Rust, with the core tango-bench crate at ~3,350. That's small. The architecture earns its line count.

The paired-testing approach requires some engineering that sequential harnesses don't. Tango needs both baseline and candidate in the same process, but it also needs them isolated enough that they can't interfere with each other. The solution: dynamic library loading via libloading. Your benchmark compiles as a dylib, and Tango loads two copies of it into the same process. On Linux, it goes further with GOT/PLT patching (using goblin for ELF parsing) to interpose function calls. On Windows, it patches the Import Address Table. This is real systems programming, not a wrapper around std::time::Instant.

The benchmarking API itself is clean:

benchmark_fn("my_algorithm", |b| {
    b.iter(|| my_function(1000))
})

Metric selection is generic, chosen via turbofish syntax. The Metric trait is one method:

pub trait Metric {
    fn measure_fn(f: impl FnMut()) -> u64;
}

Wrap the closure, return a measurement. The trait is monomorphized, so there's no vtable dispatch in the hot path. WallClock uses std::time::Instant by default (or rdtscp directly with the hw-timer feature flag). Users can swap metrics per-benchmark:

b.metric::<WallClock>().iter(|| ...)

Dependencies are reasonable: clap for CLI, rand for shuffled iteration orders, libc and windows for platform calls, goblin/scroll for ELF patching on Linux. No framework bloat. The alloca crate shows up for stack-allocated sampling buffers, which is an unusual but defensible choice for keeping allocation noise out of measurements.

The rough spots are minor. API documentation beyond the README is sparse. There's a pre-existing clippy warning in cli.rs for a function with too many arguments. Test coverage is solid for the core statistics and measurement code, thinner for the CLI and dylib loading paths. None of this is unusual for a focused solo project at this scale.

The Contribution

When Tango added the Metric trait in PR #60, it shipped with exactly one implementation: WallClock. An earlier PR (#57) had proposed switching the default timer to clock_gettime(CLOCK_THREAD_CPUTIME_ID) for per-thread CPU time. The maintainer pushed back, correctly, because CPU time and wall time measure different things. A function that sleeps for 100ms registers 100ms of wall time but near-zero CPU time. You can't just swap one for the other.

But with the pluggable Metric trait in place, both can coexist. The maintainer said as much: "Now we can implement CpuTime as a separate metric to measure. I believe we have no obstacles to implementing your idea in tango code base." That was months ago. Nobody had followed through.

So I implemented CpuTime. On Unix, it uses clock_gettime(CLOCK_THREAD_CPUTIME_ID) for per-thread CPU time with nanosecond precision. On Windows, it calls GetThreadTimes(GetCurrentThread()) and sums user and kernel time. Both platform dependencies were already in Cargo.toml, so no new crates were needed. The implementation follows the same pattern as WallClock: platform-specific code behind cfg attributes on the trait impl, no separate platform module. Two files changed, about 115 lines total including tests.

The integration test is the one I'm most pleased with. It measures CPU time during a thread::sleep(50ms) versus a busy loop, and asserts the busy loop reports at least 10x more CPU time. That's the whole point of the metric in one assertion: sleep consumes wall time but not CPU time.

The maintainer reviewed PR #72 with specific, constructive feedback: split the integration test into unit tests, tighten the assertion tolerance, minimize the Windows measurement overhead. All fair. The revised PR was approved and merged within a week of submission.

The Verdict

Tango is for Rust developers who benchmark and have been burned by inconsistent results. If you've ever had a benchmark show a 5% regression that turned out to be your laptop's fan kicking in, the paired-testing approach directly addresses that.

The project has a clear trajectory. The Metric trait landed recently, async benchmark support is in progress, and the maintainer engages thoughtfully with PRs and issues. This isn't a stalled side project. It's actively evolving, and the architecture is clean enough to support that growth.

What would push it further? More metrics (instruction counts via perf_event_open, anyone?), better API docs, and broader awareness. The paired-testing methodology is the kind of idea that, once you understand it, makes sequential benchmarking feel obviously wrong. More people should know about it.

Go Look At This

If you benchmark Rust code, go look at tango. Read the methodology section of the README. Run one of the examples. The paired-testing approach is worth understanding even if you stick with Criterion for now.

Star the repo, try it on a real benchmark, or pick up one of the open issues. Here's my merged PR adding CpuTime if you want to see what a small contribution looks like.

This is Review Bomb #3, a series where I find under-the-radar projects on GitHub, read the code, contribute something, and write it up. If you know a project that deserves more eyeballs, drop it in the comments.

This post was originally published at wshoffner.dev/blog. If you liked it, the Review Bomb series lives there too.