Dmitry Kryaklin

Posted on Apr 30

How I built the fastest color manipulation library in TypeScript and the optimization techniques I learned

#webdev #javascript #programming #opensource

Introduction

In 2025, I started building a color manipulation library called colordx. The frontend ecosystem is moving towards CSS Color 4: OKLCH, OKLab, Display-P3, Rec.2020. Most existing libraries were designed for the sRGB era and bolted modern color spaces on top. I wanted to build something that treats the modern stuff as a first-class citizen.

But the goal I cared about most was performance. Not just "faster than colord" fast. I wanted colordx to be the fastest option in the benchmarks I cared about, and I wanted to actually understand why.

This article is a short list of the optimization techniques that mattered the most. If you are working on a hot-path JavaScript library, I hope at least a few of these are useful.

Results first

Benchmark	colordx	colord	culori	chroma-js	color
Parse HEX → toHsl	38 ns	99 ns	151 ns	294 ns	382 ns
Parse HEX → lighten → toHex	64 ns	176 ns	206 ns	850 ns	1010 ns
Mix two colors	102 ns	759 ns	1230 ns	870 ns	1900 ns
Parse HEX → toOklch	271 ns	—	287 ns	916 ns	534 ns
inGamutP3	202 ns	—	1030 ns	—	—

Now let's get into how.

1. Keep one canonical internal representation

Every Colordx instance stores exactly one thing: an RgbColor object { r, g, b, a }. All conversions go through it.

The reason is V8 monomorphism. The class has a fixed shape, so V8 always sees the same two fields on every method call. A library that stores different color models in different instances ends up with polymorphic inline caches everywhere, and JIT performance drops.

2. Don't use Object.create to skip the constructor

This was the single biggest win. My first version used Object.create(Colordx.prototype) in the internal factory to skip parsing:

private static _make(rgb: RgbColor): Colordx {
  const inst = Object.create(Colordx.prototype);
  inst._rgb = rgb;
  inst._valid = true;
  return inst;
}

It looks clean but it is a trap. ES2022 classes with field declarations have a specific V8 hidden class transition chain. Object.create bypasses the constructor, so the field initialization transitions never fire. The resulting instance has a different hidden class than one created with new Colordx(). V8 sees two shapes flowing into every hot method, ICs go polymorphic, performance dies.

Fix: use a sentinel symbol so the constructor can skip parsing while still going through the proper field transition chain.

const _SENTINEL: unique symbol = Symbol();

constructor(input: AnyColor | typeof _SENTINEL, _direct?: RgbColor) {
  if (input === _SENTINEL) {
    this._valid = true;
    this._rgb = _direct!;
  } else { /* parse */ }
}

private static _make(rgb: RgbColor): Colordx {
  return new Colordx(_SENTINEL, rgb);
}

Around 330 ns → 270 ns on Parse HEX → toOklch. Just from how the object is constructed.

3. Precomputed lookup tables for hex output

toString(16).padStart(2, '0') allocates a string every call. Precompute all 256 possibilities:

const HEX_BYTE = /* #__PURE__ */ Array.from(
  { length: 256 },
  (_, i) => i.toString(16).padStart(2, '0')
);

Three array lookups instead of three string allocations. Borrowed from color-bits.

4. Bitwise hex parsing

parseInt('ff', 16) is slow because it is a general-purpose parser. Exploit the ASCII layout to decode a hex character with two integer ops:

const hexNibble = (c: number): number => (c & 0xf) + 9 * (c >> 6);

Based on Lemire's technique.

5. Reuse a module-level buffer when callers always destructure

rgbToHslRaw is the hot path for lighten, darken, saturate, etc. Every call would allocate a fresh { h, s, l, a } object. But all internal callers immediately destructure the result, so there is no aliasing. So I reuse a single object:

const _hslBuf: HslColor = { h: 0, s: 0, l: 0, a: 0 };

export const rgbToHslRaw = (rgb) => {
  // ...
  _hslBuf.h = hDeg;
  _hslBuf.s = clamp(s * 100, 0, 100);
  _hslBuf.l = clamp(l * 100, 0, 100);
  _hslBuf.a = clamp(round(a, 3), 0, 1);
  return _hslBuf;
};

This works only because the function is internal and I control all callers. I would not expose this pattern in a public API.

6. Avoid closure allocation by hoisting helpers to module level

If a helper function is defined inside another function, V8 creates a new closure object on every call. Hoist it to module level and it is allocated once.

// at module level, not inside hslToRgb
const _hueToRgb = (p: number, q: number, t: number): number => { ... };

7. Inline conversions to avoid intermediate object allocation

rgbToOklch used to call rgbToOklab and destructure the result. The intermediate OklabColor object is pure overhead. Inlining the math saves one allocation per call.

I usually hate duplicated code, but for short, well-tested math the allocation savings are real.

8. Provide *Into siblings for per-pixel work

For 500×500 OKLCH gradient renders (250k pixels per frame), the natural API allocates 500k–1M short-lived 3-tuples per frame. Wall-clock cost is modest, but the GC pressure causes frame hitches during interactive drag.

So every channel function has a sibling that writes into a caller-provided buffer:

export const oklabToLinearInto = (
  out: Float64Array | number[],
  l: number, a: number, b: number
): void => { /* writes out[0/1/2] */ };

On a 250k-pixel chained OKLCH→P3 bench, allocations drop from ~9 MB/iter to ~500 kB/iter. Wall-clock is only ~5% better, but interactive renders become visibly smoother.

I rejected the alternative of a shared module-level buffer (slightly faster in micro-bench, around 10%) because it is non-reentrant and a sharp edge in a public API. gl-matrix and three.js use the out-param pattern for the same reason.

9. DRY the data, not the structure

Once I had both oklabToLinear and oklabToLinearInto, the obvious refactor was to make the allocating version delegate to the *Into version. Looks great. Regressed the *Into path by ~20%.

The reason was V8 polymorphism. External callers pass a Float64Array. The new wrapper passes a plain [number, number, number]. The *Into call site went from monomorphic to polymorphic, V8's speculative optimizations got disabled.

The compromise: keep the math duplicated, but extract the matrix coefficients into module-level consts.

const M1_LR = 0.4122214708, M1_LG = 0.5363325363, M1_LB = 0.0514459929;
// ... 20+ named coefficients ...

export const linearSrgbToOklabInto = (out, lr, lg, lb) => {
  const lv = Math.cbrt(M1_LR * lr + M1_LG * lg + M1_LB * lb);
  // ...
};

export const linearSrgbToOklab = (lr, lg, lb) => {
  const lv = Math.cbrt(M1_LR * lr + M1_LG * lg + M1_LB * lb);
  // ...
};

V8 constant-folds module-level consts, so there is no runtime cost vs inline literals. One source of truth for the data, two monomorphic call sites.

The textbook DRY refactor was wrong here. Sometimes you DRY the data and duplicate the structure.

What didn't help

Equally important: things that looked like they should help but didn't. Save yourself the time.

A 256-entry LUT for toLinear was slower on M4. The FP unit executes Math.pow(x, 2.4) fast enough that array lookup overhead is not worth it. Result is architecture-specific.
Manually inlining toLinear inside rgbToOklch made things worse (~270 ns → ~530 ns). The function got too large for V8 to optimize the body as a single unit.
Inlining normalizeHue as an expression instead of a function call: also slower. V8 optimizes named function call sites independently.

The pattern: V8 is smarter than you about inlining small functions. Trust it until you have a profile that says otherwise.

Lessons

The biggest wins came from understanding V8's hidden class model, not from clever algorithms. Monomorphism is a feature you preserve, not a thing you add later.

Allocations matter more than CPU time on hot paths in modern JavaScript. Wall-clock differences are often small, but GC pressure shows up as frame hitches and unpredictable latency.

DRY is a tool, not a rule. V8 cares about call site shape consistency more than your engineering aesthetics.

Always measure on the hardware you care about. The LUT result on M4 might be different on a Cortex-A53 phone or an older Intel laptop.

If you want to play with the library, there is a playground at colordx.dev, and the source is at github.com/dkryaklin/colordx.

Top comments (8)