How a pure-TypeScript flex layout engine closed the last WASM-Yoga gap

#typescript #javascript #performance #showdev

TL;DR

I've been building Pilates, a flex layout engine for terminal UIs in pure TypeScript. As of last week, across the 9 scenarios in my bench suite, the pure-TS engine is faster than WASM Yoga (the engine Ink uses) on each — including the structural-mutation workload (append + remove a row per frame) Yoga led on by ~5× until phases 15–17 closed it. That flipped to a ~1.7× Pilates win, in pure TypeScript.

No native bindings. No WASM port. The fix was algorithmic, and the algorithmic fix worked in TS.

The numbers

Median latency, win32-x64, Node 22, ~5s tinybench windows with bootstrap CI95:

Scenario	Pilates	yoga-layout (WASM)	Ratio
tiny (10 nodes)	4.5µs	19.0µs	4.2× faster
realistic (~100)	121µs	328µs	2.7× faster
stress (~1000)	601µs	1.94ms	3.2× faster
big (~5000)	3.32ms	9.17ms	2.8× faster
huge (~10000)	8.62ms	18.5ms	2.1× faster
hot-relayout	16.3µs	83.0µs	5.1× faster
hot-relayout + boundaries	15.8µs	77.8µs	4.9× faster
hot-relayout (text mutation)	8.9µs	90.6µs	10× faster
hot-structural	71.3µs	118.3µs	1.7× faster

Caveats up front: 9 hand-picked scenarios, not a universal claim. Reproduce with pnpm bench — about 5 minutes on a recent machine.

Why pure TS can beat WASM here

Terminal UI is a curiously hostile workload for a WASM engine. Trees are small (10–10,000 nodes), but updates are frequent — one keystroke, one tick, one frame. The crossing cost from JS into WASM dominates: Yoga's per-call kernel is a few microseconds, but node.setWidth(N) from JS to WASM is also a few microseconds. A pure-TS engine pays no crossing cost.

That was the thesis going in. Phases 15–17 are evidence the thesis holds even in the worst case — the workload where Yoga's compute kernel is exactly what's being measured, with the tree pre-built and only the structural-mutation layout timed.

How hot-structural went from ~450µs to ~70µs

Two algorithmic changes did the work.

1. Linear-recurrence main-axis positions

The original main-axis position rule was a cumulative sum: each cell's position depended on the size of every prior sibling. A 100-cell row in the stress fixture meant ~300 dependency edges per row.

// Old rule — every cell reads every prior sibling
mainPos[N] = sum(siblings[0..N-1].mainSize + margin + gap)

Replaced with a linear recurrence — each cell only reads the cell immediately before it:

// New rule — each cell only reads the previous one
mainPos[N] = mainPos[N-1] + prev.mainSize + prev.marginEnd + me.marginStart + gap

Reverse-direction (row-reverse / column-reverse) keeps the cumulative-sum fallback because the recurrence depends on the prior cell's already-resolved position, which doesn't hold when iteration is reversed.

2. Fold default-valued style inputs

Observation: roughly half of all input fields in the grammar were sitting at default values forever — margin: 0, minWidth: 0, maxWidth: undefined, etc. They still consumed dirty-flag slots, propagated through dependents, and appeared in dependency sets.

Phase 17 folds these defaults into compile-time constants at grammar-build time. Each per-cell node went from ~15 fields to ~7. The classifier's nodeSig was extended with fold-predicate bits so that mutating from default → non-default correctly triggers a structural rebuild.

Combined, hot-structural went from ~450µs to ~70µs.

Why pure TS over a native rewrite

I considered porting the engine to a native-compiled-to-WASM language before doing the algorithmic work. Glad I didn't.

Yoga's advantage wasn't speed of arithmetic — its C++ kernel is fast and well-tuned, but speed of arithmetic wasn't the bottleneck on this workload. The advantage was the structural-mutation algorithm: Yoga handled it natively, the pure-TS engine was redoing too much work per mutation.

A native-compiled port from my side would have inherited the same algorithmic shape and reached parity at best. The fix was algorithmic, and the algorithmic fix worked in TypeScript. "Pure TS is competitive with native code on this workload" is the actually-interesting result.

Validation, including a same-day hotfix story

1,470 unit + integration tests pass
Structural-differential fuzzer green at 3,000 runs
33 Yoga oracle fixtures (cell-for-cell comparison)
Byte-identical cached-vs-cold differential mode at 833 runs

A small incident worth mentioning: within hours of publishing 2.0.0, the fast-check property fuzzer caught a real bug — createStyleDirtier was throwing on a node whose entire style had been folded out, a case my analysis said couldn't happen. The fuzzer immediately found it. 2.0.1 shipped same day with the fix and a pinned regression test, and 2.0.0 was deprecated on npm pointing at 2.0.1.

Property-based fuzzing earns its keep. I had been on the fence about whether the fuzzer was worth maintaining; this answered it.

API stability

Public calculateLayout() is byte-identical between 1.x and 2.x. The SemVer-major bump reflects internal API and memory-characteristic shifts:

Typed-array runtime (Field.id integer + array storage replacing Map<Field, X>)
LayoutPool grows unbounded (tried FinalizationRegistry-based recycling in phase 15C; caused 2× regression so removed)
Per-property dirty bitmask replacing single dirty bool
Linear recurrence + fold default values (the algorithmic changes above)

If you're using only the documented public API, you upgrade and the speedup is transparent.

Try it

git clone https://github.com/pilatesjs/pilates
cd pilates
pnpm install
pnpm bench   # ~5 min

Or install the engine directly:

npm install @pilates/core

Full React stack (reconciler + widgets):

npm install @pilates/react @pilates/widgets react

Adversarial benchmarks are very welcome — if there's a workload where this approach breaks down, I'd genuinely like to find it. That's the most valuable feedback the project can get right now.

Repo (MIT): https://github.com/pilatesjs/pilates

npm: https://www.npmjs.com/package/@pilates/core