DEV Community

SEN LLC
SEN LLC

Posted on

Micro-benchmarking TypeScript Without Lying to Yourself

Micro-benchmarking TypeScript Without Lying to Yourself

benchmark.js has effectively been abandoned for years, hyperfine is wonderful for CLI tools but has no concept of a JavaScript function, and most teams end up pasting a Date.now() loop into a scratch file every time they need to answer "is this faster?". I wrote a small TypeScript CLI called ts-bench to do the boring parts — warmup, auto-calibration, standard deviation, baseline comparison — correctly and without any dependencies you don't already have.

📦 GitHub: https://github.com/sen-ltd/ts-bench

Screenshot

The thing I wanted, concretely, was this workflow:

# On main
ts-bench bench.ts --save bench.json

# In a feature branch, in CI
ts-bench bench.ts --baseline bench.json --fail-on-regression
Enter fullscreen mode Exit fullscreen mode

And I wanted to point it at a .ts file with zero build step. tsx makes that trivial, so the CLI itself is under 800 lines of TypeScript split across stats, runner, loader, baseline, and three formatters (human, JSON, markdown). This post is about the five or six places where a naive benchmark runner silently lies to you, and what I did about each one.

The problem with "just use Date.now"

Every Date.now() loop has the same shape:

const start = Date.now()
for (let i = 0; i < 1_000_000; i++) { myFunction() }
const elapsed = Date.now() - start
console.log(`${elapsed}ms total`)
Enter fullscreen mode Exit fullscreen mode

There are five things wrong with this code, and every team I've worked with has shipped at least one version of it to a performance investigation at some point.

  1. Date.now() has millisecond resolution, which is roughly a million nanoseconds. For anything that completes in under a millisecond per call, you're measuring the precision of your clock, not the speed of your code.
  2. There's no warmup. The first few thousand calls run through V8's baseline compiler before the optimizing tier kicks in, and those measurements are not representative of steady-state performance.
  3. There's one sample. You can't distinguish "this took 50ms" from "this took 50ms ± 40".
  4. You don't consume the return value, so V8 is free to conclude the call is pure and delete the loop body entirely. If myFunction is () => 1 + 1, the optimized loop is i++, and your benchmark reports the speed of an empty loop.
  5. You have no baseline. You know the absolute number but not whether your last commit made it worse.

ts-bench addresses each of these. Let's go in order.

Nanosecond resolution with hrtime

Node has had process.hrtime.bigint() forever. It returns nanoseconds as a BigInt:

import { computeStats, type Stats } from './stats.js'

export type NowFn = () => bigint
export type BenchFn = () => unknown

export class Runner {
  private sink: unknown = undefined

  constructor(private readonly now: NowFn) {}

  private runBucket(fn: BenchFn, iterations: number): number {
    const start = this.now()
    for (let i = 0; i < iterations; i++) {
      this.sink = fn()
    }
    const end = this.now()
    if (this.sink === Symbol.for('ts-bench/unreachable')) {
      throw new Error('unreachable')
    }
    return Number(end - start)
  }
}
Enter fullscreen mode Exit fullscreen mode

Two things worth noting:

  • The clock is injected. The runner takes a NowFn instead of calling process.hrtime.bigint() directly. In production that function is process.hrtime.bigint. In tests it's a fake that returns a pre-programmed sequence of bigints. That's how we get deterministic unit test coverage for a timing-dependent class — more on that in the next section.
  • sink is the dead-code trick. The runner stores the return value of every call into an instance field and touches the field once (inside an impossible if) after the loop. V8 cannot prove the field is unread, so it cannot delete the loop body. This is the single most important detail in the runner, and it's the one every Stack Overflow answer forgets.

Deterministic timing tests with a fake clock

Here's the test that verifies warmup works. I want to prove that warmup buckets are fully discarded — that no matter how slow they are, they don't influence the final stats. The fake clock makes this trivial:

import { describe, it, expect } from 'vitest'
import { Runner, fakeClock, defaultRunOptions } from '../src/runner.js'

it('discards warmup buckets so their timings do not influence stats', () => {
  const clock = fakeClock([
    0n, 10_000n,       // warmup 1 (discarded)
    10_000n, 20_000n,  // warmup 2 (discarded)
    20_000n, 20_100n,  // sample 1: 100ns / 1 iter
    20_100n, 20_200n,  // sample 2: 100ns / 1 iter
  ])
  const runner = new Runner(clock)
  const result = runner.run('const', () => 0, {
    ...defaultRunOptions,
    iterations: 1,
    warmup: 2,
    samples: 2,
    autoCalibrate: false,
  })
  expect(result.stats.meanNs).toBe(100)
  expect(result.stats.stddevNs).toBe(0)
})
Enter fullscreen mode Exit fullscreen mode

The warmup buckets each return 10,000ns — wildly slower than the real samples — and yet the reported mean is 100ns. If the runner were accidentally averaging warmup into the sample pool, that test would fail loudly. The fake clock also throws an exhausted error if you ask for more ticks than you provisioned, so any off-by-one bug in the bucket loop surfaces as a test failure instead of a silent wrong answer.

This is a pattern I reach for whenever I have to unit-test code that otherwise "depends on real time". Injecting the clock is always cheaper than faking timers at the runtime level.

Standard deviation is your first line of defense against noise

The single best signal that your benchmark result is garbage is a high relative standard deviation. If mean = 1000ns and stddev = 800ns, you do not have a reliable measurement of 1000ns — you have a distribution that spans most of the range from nothing to twice your mean, and the conclusion you're about to draw from it is somewhere between "wrong" and "spectacularly wrong".

export function computeStats(samplesNs: readonly number[]): Stats {
  const n = samplesNs.length
  let sum = 0
  let min = Number.POSITIVE_INFINITY
  let max = Number.NEGATIVE_INFINITY
  for (const s of samplesNs) {
    sum += s
    if (s < min) min = s
    if (s > max) max = s
  }
  const mean = sum / n

  let variance = 0
  if (n > 1) {
    let sqSum = 0
    for (const s of samplesNs) {
      const d = s - mean
      sqSum += d * d
    }
    variance = sqSum / (n - 1) // Bessel's correction
  }
  const stddev = Math.sqrt(variance)
  const rsd = mean > 0 ? stddev / mean : 0
  const opsPerSec = mean > 0 ? 1e9 / mean : 0

  return { samples: n, meanNs: mean, minNs: min, maxNs: max,
           stddevNs: stddev, rsd, opsPerSec }
}
Enter fullscreen mode Exit fullscreen mode

I use Bessel's correction (n − 1 instead of n) because we're estimating the population variance from a sample, not computing the exact variance of the sample itself. This is the default in R, NumPy, and most stats libraries, and it matters more than you'd think at small sample counts.

The human formatter flags any bench with rsd > 10% in yellow so it's visually obvious:

sum_formula: 0.8 ns ± 0.6 ns, 1.18 Gops/s, 73.52%

That last column is the rsd. 73% relative standard deviation for a benchmark claiming to run at 1.18 billion operations per second is the system telling you, in no uncertain terms, that V8 has optimized your bench body down to nothing and what you're measuring is scheduler noise in an empty loop. Which is exactly what happens if your bench is () => (999 * 1000) / 2 — a pure constant the JIT folds into itself at compile time.

The fix, for real benchmarks, is to return a value that depends on something V8 can't precompute — an array length, a call to Math.random(), whatever. But the important thing is that the runner surfaces the problem instead of confidently reporting "1.18 Gops/s" and letting you paste it into a PR description.

Auto-calibration: pick the right bucket size automatically

The naive knob to tune is --iterations, which controls how many calls go into each sample bucket. Too few and each bucket is dominated by clock jitter; too many and you wait an hour for a benchmark run. The right answer is: "enough iterations that one bucket takes about 100ms". That depends on the bench function, which you don't know in advance.

So ts-bench can estimate it:

export function estimateIterations(
  probeIters: number,
  probeNs: number,
  targetNs: number
): number {
  if (probeIters <= 0 || probeNs <= 0 || targetNs <= 0) return 1
  const perIter = probeNs / probeIters
  const est = Math.round(targetNs / perIter)
  if (!Number.isFinite(est) || est < 1) return 1
  if (est > 1_000_000_000) return 1_000_000_000
  return est
}
Enter fullscreen mode Exit fullscreen mode

With --auto-calibrate, the runner first executes a short probe of 1,000 iterations, divides the elapsed time to get nanoseconds per iteration, and then picks a bucket size that aims at targetSampleNs (default 100ms). Warmup and samples then run against that auto-selected iteration count, so you get meaningful precision without having to guess. The clamp at the top (> 1_000_000_000) stops pathologically fast probes (e.g. a probe that took 1ns) from asking for ten billion iterations.

In practice, the ideal behaviour is: fast bench → big bucket, slow bench → small bucket, same wall time either way.

Baseline comparison, for CI

benchmark.js-style reporters are great on your laptop and useless in a pull request. What you actually want from a benchmark in CI is: "did this PR make anything slower?". Which is a comparison, not an absolute measurement.

export function compareToBaseline(
  current: readonly BenchResult[],
  baseline: BaselineFile,
  thresholdPct = 5
): ComparisonResult {
  const baseByName = new Map(baseline.entries.map((e) => [e.name, e]))
  const rows: ComparisonRow[] = []
  const regressions: string[] = []

  for (const r of current) {
    const b = baseByName.get(r.name)
    if (!b) {
      rows.push({ name: r.name, status: 'new',
                  currentMeanNs: r.stats.meanNs, baselineMeanNs: null,
                  deltaPct: null })
      continue
    }
    const delta = ((r.stats.meanNs - b.meanNs) / b.meanNs) * 100
    let status: ComparisonStatus
    if (delta > thresholdPct) {
      status = 'slower'
      regressions.push(r.name)
    } else if (delta < -thresholdPct) {
      status = 'faster'
    } else {
      status = 'same'
    }
    rows.push({ name: r.name, status,
                currentMeanNs: r.stats.meanNs,
                baselineMeanNs: b.meanNs, deltaPct: delta })
  }

  return { rows, regressions }
}
Enter fullscreen mode Exit fullscreen mode

The threshold is important. If you compare against a baseline with zero tolerance, every single run will "regress" because the variance between runs is always nonzero. 5% is the default because, empirically, it's wider than the noise floor of a healthy bench (rsd < 2%) and still tight enough to catch a real algorithmic regression.

The --fail-on-regression flag makes the process exit 1 if any bench landed in the slower bucket. Drop it into a GitHub Actions workflow behind if: github.event_name == 'pull_request' and you have a perf gate.

Tradeoffs and what this is not

I want to be upfront about what you should not use ts-bench for.

  • It cannot beat the JIT. Micro-benchmarks in JavaScript are fundamentally unstable because V8's optimization decisions depend on call-site inlining, type feedback, and deoptimization history. The best you can do is warm up, consume return values, and check rsd. For anything beyond "which of these two implementations is faster at this specific call site", you want node --prof and a flamegraph.
  • It's single-process. Every bench runs in the same Node process with the same V8 instance. Benches run earlier can influence benches run later through shared inline caches. If that matters to you, run each bench in its own subprocess — ts-bench does not do that.
  • It's not a replacement for macro-benchmarks. "How fast does my HTTP server handle 10k concurrent requests" is a question ts-bench cannot answer. Use wrk, k6, or autocannon.
  • There is no statistical test for "significantly different". If two benches have means within 1% of each other, ts-bench reports them as "same" and leaves the verdict to the human. Real significance testing (Welch's t-test, bootstrap confidence intervals) is a feature I'd add before using this at a tighter threshold.

Within those limits, what it does give you is a zero-config, zero-dependency, deterministic, CI-ready runner for TypeScript functions. The test suite is 63 assertions across stats, runner, baseline round-tripping, comparison logic, all three formatters, and argument parsing — and the runner's timing tests run in microseconds because the clock is a fake.

Try it in 30 seconds

cat > bench.ts << 'EOT'
export function bench_sum_loop() {
  let s = 0
  for (let i = 0; i < 1000; i++) s += i
  return s
}
export function bench_sum_reduce() {
  return Array.from({ length: 1000 }, (_, i) => i)
    .reduce((a, b) => a + b, 0)
}
EOT

docker run --rm -v "$PWD":/work ts-bench /work/bench.ts --auto-calibrate
Enter fullscreen mode Exit fullscreen mode

Or clone the repo and run it directly:

git clone https://github.com/sen-ltd/ts-bench
cd ts-bench
npm install
npm run build
node dist/main.js bench.ts --auto-calibrate
Enter fullscreen mode Exit fullscreen mode

Source, tests, and Dockerfile are at https://github.com/sen-ltd/ts-bench. If you find a case where the runner lies to you — especially one I didn't think of — I'd love to see the issue.

Top comments (0)