Artem Zverev

Posted on May 7 • Originally published at azverev.com

What I learned measuring Pretext.js: things browser benchmarks don't tell you

#javascript #webdev #performance #webperf

A while ago — though by current standards that already feels like ancient history — Twitter was loud with discussion of Pretext.js, a text-layout library whose headline claim is dramatic speedup over standard DOM measurement. Instead of letting the browser's layout engine compute text dimensions, it does the equivalent work in pure JavaScript on top of data prepared once via prepare(). The result, in theory, is no force-reflow on every measurement.

Demos from different people looked impressive, so I got curious about how fast this actually is in practice. There was just one detail: I'm not particularly good at performance measurement. Outside of basic Core Web Vitals work — LCP, INP, the usual suspects — I had no real experience with this kind of micro-benchmarking.

I started reading about how these measurements are actually done, and concluded that I should begin with the simplest possible test: comparing single-call costs across three strategies — DOM, Canvas, Pretext.

I picked the leanest setup I could, to keep the harness from taxing the runtime: Astro with static output, React islands only where interactivity was needed, hand-rolled measurement utilities. I wrote minimal implementations of each strategy and started running tests.

What follows are the things I learned. Almost none of them are about Pretext.

The timer was lying

The first run of my benchmark came back looking like this:

Strategy	100 chars p50	1000 chars p50
DOM	0.00 µs	100.00 µs
Canvas	0.00 µs	0.00 µs
Pretext	0.00 µs	0.00 µs

Most of the cells read zero. The ones that didn't read zero were suspiciously round — exactly 100.00, 200.00, 400.00. Those round numbers were the clue.

Since the Spectre disclosures in 2018, browsers deliberately reduce performance.now() precision to make timing-based memory attacks harder. In Chrome, the timer is rounded to ~100 microseconds. In Firefox and Safari, to ~1 millisecond. You can verify it in the console:

console.log(performance.now() % 1)
// Chrome: ~0.1 increments
// Firefox: 0

I'd been benchmarking operations that, in some cases, complete in under a microsecond. Of course they came back as zero. The timer simply had no shorter unit to express them.

crossOriginIsolated pages can unlock ~5 µs timer resolution via SharedArrayBuffer-based clocks, but setting up COOP/COEP headers adds friction. Batched timing achieves the same effective precision without touching response headers.

The standard fix in serious benchmark libraries (benchmark.js, tinybench, criterion-style harnesses) is batched timing. Instead of measuring each call individually, you measure batches of N calls and divide:

// What I had — broken
for (let i = 0; i < 1000; i++) {
  const t0 = performance.now()
  measure()
  timings.push(performance.now() - t0)
}

// What it should be
const BATCH = 1000
for (let i = 0; i < iterations; i++) {
  const t0 = performance.now()
  for (let j = 0; j < BATCH; j++) measure()
  timings.push((performance.now() - t0) / BATCH)
}

If 1000 calls collectively take 500 µs, the average call cost is 0.5 µs — even though no individual call is timeable.

There are tradeoffs to disclose. Batching hides outlier distribution; the p99 of batch means is not the same as the p99 of individual calls. JIT inlining can over-optimise the inner loop. And if measure() returns a value you don't use, V8 can eliminate the call entirely — you'll need to "sink" the return value into a side effect to keep the optimiser honest:

let sink = 0
for (let j = 0; j < BATCH; j++) sink += measure()
if (sink === 0xDEADBEEF) console.log(sink)  // forces the result to be used

After adding batching, the same benchmark produced something I could actually interpret:

Strategy	100 chars p50	1000 chars p50	5000 chars p50
DOM	22.20 µs	61.60 µs	235.15 µs
Canvas	0.80 µs	7.20 µs	35.60 µs
Pretext	0.20 µs	1.50 µs	7.50 µs

A side note on percentiles: I'm reporting p50, p95, p99 throughout, not means. Means hide tails, and as you'll see in the DevTools section, the tail is where benchmarks often go wrong.

The cold-start cliff

The next thing I noticed was that the very first run of any benchmark was much slower than every subsequent one.

Run	Pretext.prepare(10 chars)
1	5400 µs
2	300 µs
3	200 µs

That's an 18× difference between the first and second run. In a production build, the gap was even bigger: 7900 → 300 µs, a 26× cliff. In Safari, run 1 took 10 000 µs at the same input.

The standard answer is warm-up: run the function a number of times before you start recording timings, and discard those runs. The cold timings are real, but they reflect engine bootstrap rather than steady-state cost:

// Untimed warm-up
for (let i = 0; i < 500; i++) measure()

// Now measure
const timings = []
for (let i = 0; i < 1000; i++) {
  // batched timing as above
}

The number of warm-up iterations isn't arbitrary. For simple functions, ~50 is enough. For branchy or library-heavy code, 500+. There's a quick way to verify: plot timings by iteration index and look for where the line stabilises.

Why is the first call so slow? Several causes overlap:

JIT compilation phases. First calls run in the interpreter. The engine collects type information. Once the function is "hot", it's compiled to optimised machine code. If the type assumptions later break, deoptimisation kicks in — back to the interpreter for a moment.
Cold instruction cache and branch predictor. Modern CPUs penalise unfamiliar code paths.
Library-internal lazy init. Pretext, like many text-layout tools, lazily creates a hidden <canvas> for measureText on first use. Loading the font, allocating the canvas, building internal caches — all of this happens once, on the first call.

The honest answer is to surface this rather than hide it. In my benchmark UI, I added a separate "Pre-warm" button. Pre-warming is a feature of the library, not measurement noise to be averaged away.

This led naturally to the next question: what else is silently affecting these numbers? I planned to test multiple browsers, of course — that's an obvious axis. But what's less obvious?

The states the browser was in when I wasn't looking

A short bit of research surfaced three conditions worth checking: incognito mode, DevTools open, and production vs development build. All three sound mundane. Two of them turned out to matter a lot.

DevTools open destroys tail latency

Same Chrome, same code, only difference is whether DevTools is open during the run.

5000ch DOM	DevTools closed	DevTools open	Δ
p50	239.40 µs	256.55 µs	+7%
p99	237.61 µs	278.14 µs	+17%

p50 looks fine — a few percent slower, easy to dismiss. But for sub-microsecond operations like Canvas or Pretext layout on small inputs, the tail behaviour collapses:

10ch p99	Closed	Open	Δ
Canvas	0.30 µs	17.70 µs	≈59×
Pretext	0.20 µs	14.30 µs	≈70×

For sub-microsecond operations, opening DevTools makes the p99 seventy times worse. That's because DevTools instruments the main thread with periodic checks; on long operations the overhead is absorbed, on fast ones it dominates.

The implication is uncomfortable: every benchmark you've ever run with DevTools open has unreliable tails. Most of us run benchmarks with DevTools open by default — that's where the performance panel is, after all.

DevTools-open p99s are not comparable to DevTools-closed p99s. If you're publishing tail-latency numbers, record them with DevTools fully closed and note it. A 70× p99 difference is enough to change the story entirely.

Dev mode lies about DOM performance

I assumed dev mode would be uniformly slower than prod — maybe 10-20% across the board. The reality is asymmetric.

Same code, Chrome warm, prod vs dev:

Input	Dev DOM p50	Prod DOM p50	Δ
10ch	17.80 µs	8.50 µs	+109%
100ch	22.20 µs	13.10 µs	+69%
1000ch	61.60 µs	54.00 µs	+14%
5000ch	235.15 µs	232.75 µs	≈0%

The Astro dev middleware, source maps, and unminified runtime add roughly 9–14 µs of fixed overhead per DOM operation, regardless of input size. At 5000 characters, that's lost in the noise of real layout work. At 10 characters, it doubles the result.

Canvas and Pretext were unaffected — same numbers in dev and prod. That tells us where the overhead lives: it's specifically in code paths that touch the DOM. Frameworks instrument their middleware around DOM-relevant operations. Pure computation slips past them.

Dev/prod divergence is a useful heuristic: if an operation is measurably slower in dev, it's touching the DOM (or a framework-instrumented path). If numbers match, it's pure computation. A quick way to classify unfamiliar code without reading it.

The takeaway: dev numbers are useful only for confirming your benchmark works at all. Every real comparison goes in pnpm build && pnpm preview.

Incognito didn't matter (this time)

For completeness: incognito-mode runs were within 1.5% of regular Chrome runs on my machine. My extension setup turned out not to interfere. For other people with heavier extensions — analytics blockers with main-thread observers, dev-tool extensions, accessibility tools — this could easily be different. Worth checking; rarely the issue.

Three engines, three answers

This is the part that surprised me the most. Same benchmark, prod build, warm runs, three browsers.

5000ch p50	Chrome (Blink)	Firefox (Gecko)	Safari (WebKit)
DOM	232.75 µs	339.00 µs	87.00 µs
Canvas	35.80 µs	27.00 µs	29.00 µs
Pretext	7.30 µs	11.00 µs	9.00 µs

Look at DOM at 5000 characters. Safari's WebKit is 2.7× faster than Chrome's Blink. This is the opposite of the "Chrome is the fastest browser" intuition that most frontend engineers carry around.

Now look at the same operation at 10 characters: Chrome runs at 8.50 µs, Safari at 34.00 µs, Firefox at 73.00 µs. Now Chrome is 4× faster than Safari. Same code, same operation, opposite ranking, depending only on input size.

Different engines have different cost models. WebKit pays more per reflow startup but less per character processed. Blink does the opposite — cheap to initiate, more expensive to scale. Gecko sits between them on DOM and slightly leads on Canvas. Pretext, being pure arithmetic, follows the JIT performance of each engine — and there V8 wins, narrowly.

There is no universally fast browser for layout. Single-engine benchmarks can produce misleading universal claims either way. If I'd run my benchmark in Chrome only, I'd have published one story about Pretext's speedup over DOM. If I'd run it in Safari only, I'd have published a noticeably weaker version of the same story. Both would have been technically correct on their respective engines and silently wrong as generalisations.

The methodological consequence: cross-browser benchmarks need three columns, not an average.

What this all adds up to

I started building this benchmark to evaluate a library. I came out of it with a much better understanding of why most JavaScript benchmarks I've read on dev.to and elsewhere are quietly off — usually in one of three ways. A timer that can't resolve the operation being measured: sub-microsecond operations in Firefox and Safari are unmeasurable without batching, and ad-hoc benchmarks tend to skip that step. A cold start being averaged into the headline number, even though run 1 is 10–30× slower than run 2 and the mean describes neither case well. A single browser engine standing in for the entire web — which can flip a comparison's ranking depending on input size, as the DOM numbers in the previous section show.

None of these are individually obscure. I knew about JIT warm-up before I started; I'd heard about Spectre mitigations in the abstract; I'd seen "p99 not mean" advice in talks. What I didn't expect was how many of them stack on the same benchmark at once, and how easily a confident-looking number can be wrong in three independent ways.

This is the first experiment in what's going to be a series. The actual verdict on Pretext — where it wins, where it doesn't, what prepare() costs in practice across realistic workloads — is going to take several more experiments to establish honestly. I'm publishing this one now because the methodology surprises along the way felt worth sharing on their own, independent of where the Pretext story lands.

The full benchmarking code, calibration log, and methodology page are at pretext-lab — public repository, MIT-licensed. I'll be linking each new experiment from there as it goes up.