The frame was fast. The relayout was 65 too slow.

#performance #cpp #ui #graphics

I ported js-framework-benchmark's protocol onto Vel's core — no window, no GPU, no vsync, pure CPU — because I wanted to know if "compiles to native C++" actually translated into the layout numbers I was assuming it did. Each row is a color swatch, a flexible text label, and a button. I created 10,000 of them and measured.

The first number was great. The second number was embarrassing.

What the benchmark found

| rows  | build (tree) | layout (cold) | relayout (warm) |
|------:|-------------:|--------------:|----------------:|
|   100 |      0.03 ms |      0.04 ms  |       0.04 ms   |
|  1000 |      0.26 ms |      0.30 ms  |     →19.6 ms←   |
| 10000 |      1.99 ms |      2.22 ms  |     → 206 ms←   |

Build — constructing 10,000 widgets — costs 2ms. This is where the "native C++" thesis pays off bluntly: C++ allocation crushes JavaScript's createElement. Nothing to do here; it was already faster than the framework I was benchmarking against.

Now look at the last column. Relayout (warm) is the cost of laying out a tree that did not change — same widgets, same text, same constraints, the steady state you're in every frame while the user just moves the mouse. It cost 206ms for 10k rows. That's not a 60fps frame; that's three frames dropped to lay out a list that didn't move.

And the tell is in the comparison: warm relayout cost the same as the cold pass. Re-laying-out an unchanged tree was doing the full amount of work as laying it out for the first time. That's the signature of zero memoization — every frame, from scratch, as if it had never seen this tree before.

The cost wasn't geometry — it was text

My assumption was that "layout is slow" meant the flexbox math was slow: constraint propagation, two passes, intrinsic sizing. I was wrong, and a profiler said so immediately. The 206ms was almost entirely in one place:

Text::measure → FreeTypeRasterizer::measureText, which walks each string codepoint by codepoint and asks FreeType for the advance width of every glyph via FT_Load_Glyph.

Per frame. For every label. Whether or not the text had changed.

FT_Load_Glyph is not free — it's loading and scaling a glyph outline to get its metrics. Doing it for every character of every visible string, 60 times a second, on text that is identical to last frame, is pure waste. The geometry math was a rounding error next to it. I'd been optimizing the wrong mental model of where layout time goes.

Two caches, both valid forever

The fix is in engine/src/text/FreeTypeRasterizer.cpp — two process-lifetime caches:

Per-glyph advance cache, keyed on (face, pixelSize, codepoint). A glyph's advance width never changes for a given face and size, so the first time you measure an 'e' at 32px you load it; every 'e' after that is a hashmap hit. This makes even cold layout of varied text fast, because common characters are shared across every string.
Per-string width cache, keyed on (face, pixelSize, string). Relayout of unchanged text becomes a single lookup — measure the label once, and every subsequent frame that the text is identical is O(1). It's bounded to 200k entries so that live typing (which generates a new string every keystroke) can't grow it without limit.

The key insight that makes this safe: these values are immutable. A glyph advance for a fixed face and pixel size is a constant of the universe; it will never be different. So there's no invalidation logic, no staleness, no cache-coherence problem — the hard part of caching simply doesn't exist here. You compute it once and trust it forever.

The result:

rows	relayout before	relayout after	speedup
1000	19.6 ms	0.30 ms	~65×
10000	206 ms	2.2 ms	~93×

A 10,000-row list now relays out in ~2ms — comfortably inside a 60fps frame. That's the budget Figma-class apps live in, and it was the difference between "compiles to native, therefore fast" being a slogan and being true.

The other half: don't lay out at all

Caching makes a frame cheap. The bigger win is not running the frame. Vel is damage-tracked: an atomic frameDirty flag, raised by any Widget::markDirty(), gates whether the next frame does anything. When nothing's changed, the app sits in glfwWaitEventsTimeout and uses ~0 CPU — no layout, no paint, no spin. Animating widgets re-arm the flag from their tick(); a static page just sleeps.

So the steady state is: idle costs nothing, and when something does change, the relayout it triggers is ~2ms instead of 206ms. Both halves matter. A fast frame you run 60 times a second on an idle app is still a battery fire.

What it costs

The string cache trades memory for time, bounded crudely. 200k entries is a fixed ceiling, not an LRU — it's a backstop against live-typing churn, not a tuned eviction policy. For pathological workloads (millions of unique strings) you'd want real eviction.
It's still a full tree walk. Relayout re-visits every node; it's just that each visit is now cheap. The honest next step is dirty-subtree layout — skipping subtrees whose constraints and content are unchanged — so a 100k-row tree doesn't pay even the cheap per-node cost. Today the whole tree is re-walked; it's fast enough that I haven't needed to, which is its own kind of answer.
It assumes advances are independent. The per-string cache works because today a string's width is the sum of its glyphs' advances. HarfBuzz shaping will break that assumption (kerning, ligatures, complex scripts) — the cache key still holds (same string, same width) but the per-glyph cache stops being sufficient on its own.

The meta-lesson is the one I keep relearning: measure before you optimize, because your intuition about where the time goes is usually wrong. I'd have happily spent a week making the flex math faster and moved the 206ms to 204ms. The profiler pointed at text measurement in thirty seconds, and a cache with no invalidation logic — the easy kind — bought two orders of magnitude.