DEV Community

Cover image for How I built the fastest color manipulation library in TypeScript and the optimization techniques I learned
Dmitry Kryaklin
Dmitry Kryaklin

Posted on

How I built the fastest color manipulation library in TypeScript and the optimization techniques I learned

Introduction

In 2025, I started building a color manipulation library called colordx. The frontend ecosystem is moving towards CSS Color 4: OKLCH, OKLab, Display-P3, Rec.2020. Most existing libraries were designed for the sRGB era and bolted modern color spaces on top. I wanted to build something that treats the modern stuff as a first-class citizen.

But the goal I cared about most was performance. Not just "faster than colord" fast. I wanted colordx to be the fastest option in the benchmarks I cared about, and I wanted to actually understand why.

This article is a short list of the optimization techniques that mattered the most. If you are working on a hot-path JavaScript library, I hope at least a few of these are useful.

Results first

Benchmark colordx colord culori chroma-js color
Parse HEX → toHsl 38 ns 99 ns 151 ns 294 ns 382 ns
Parse HEX → lighten → toHex 64 ns 176 ns 206 ns 850 ns 1010 ns
Mix two colors 102 ns 759 ns 1230 ns 870 ns 1900 ns
Parse HEX → toOklch 271 ns 287 ns 916 ns 534 ns
inGamutP3 202 ns 1030 ns

Now let's get into how.

1. Keep one canonical internal representation

Every Colordx instance stores exactly one thing: an RgbColor object { r, g, b, a }. All conversions go through it.

The reason is V8 monomorphism. The class has a fixed shape, so V8 always sees the same two fields on every method call. A library that stores different color models in different instances ends up with polymorphic inline caches everywhere, and JIT performance drops.

2. Don't use Object.create to skip the constructor

This was the single biggest win. My first version used Object.create(Colordx.prototype) in the internal factory to skip parsing:

private static _make(rgb: RgbColor): Colordx {
  const inst = Object.create(Colordx.prototype);
  inst._rgb = rgb;
  inst._valid = true;
  return inst;
}
Enter fullscreen mode Exit fullscreen mode

It looks clean but it is a trap. ES2022 classes with field declarations have a specific V8 hidden class transition chain. Object.create bypasses the constructor, so the field initialization transitions never fire. The resulting instance has a different hidden class than one created with new Colordx(). V8 sees two shapes flowing into every hot method, ICs go polymorphic, performance dies.

Fix: use a sentinel symbol so the constructor can skip parsing while still going through the proper field transition chain.

const _SENTINEL: unique symbol = Symbol();

constructor(input: AnyColor | typeof _SENTINEL, _direct?: RgbColor) {
  if (input === _SENTINEL) {
    this._valid = true;
    this._rgb = _direct!;
  } else { /* parse */ }
}

private static _make(rgb: RgbColor): Colordx {
  return new Colordx(_SENTINEL, rgb);
}
Enter fullscreen mode Exit fullscreen mode

Around 330 ns → 270 ns on Parse HEX → toOklch. Just from how the object is constructed.

3. Precomputed lookup tables for hex output

toString(16).padStart(2, '0') allocates a string every call. Precompute all 256 possibilities:

const HEX_BYTE = /* #__PURE__ */ Array.from(
  { length: 256 },
  (_, i) => i.toString(16).padStart(2, '0')
);
Enter fullscreen mode Exit fullscreen mode

Three array lookups instead of three string allocations. Borrowed from color-bits.

4. Bitwise hex parsing

parseInt('ff', 16) is slow because it is a general-purpose parser. Exploit the ASCII layout to decode a hex character with two integer ops:

const hexNibble = (c: number): number => (c & 0xf) + 9 * (c >> 6);
Enter fullscreen mode Exit fullscreen mode

Based on Lemire's technique.

5. Reuse a module-level buffer when callers always destructure

rgbToHslRaw is the hot path for lighten, darken, saturate, etc. Every call would allocate a fresh { h, s, l, a } object. But all internal callers immediately destructure the result, so there is no aliasing. So I reuse a single object:

const _hslBuf: HslColor = { h: 0, s: 0, l: 0, a: 0 };

export const rgbToHslRaw = (rgb) => {
  // ...
  _hslBuf.h = hDeg;
  _hslBuf.s = clamp(s * 100, 0, 100);
  _hslBuf.l = clamp(l * 100, 0, 100);
  _hslBuf.a = clamp(round(a, 3), 0, 1);
  return _hslBuf;
};
Enter fullscreen mode Exit fullscreen mode

This works only because the function is internal and I control all callers. I would not expose this pattern in a public API.

6. Avoid closure allocation by hoisting helpers to module level

If a helper function is defined inside another function, V8 creates a new closure object on every call. Hoist it to module level and it is allocated once.

// at module level, not inside hslToRgb
const _hueToRgb = (p: number, q: number, t: number): number => { ... };
Enter fullscreen mode Exit fullscreen mode

7. Inline conversions to avoid intermediate object allocation

rgbToOklch used to call rgbToOklab and destructure the result. The intermediate OklabColor object is pure overhead. Inlining the math saves one allocation per call.

I usually hate duplicated code, but for short, well-tested math the allocation savings are real.

8. Provide *Into siblings for per-pixel work

For 500×500 OKLCH gradient renders (250k pixels per frame), the natural API allocates 500k–1M short-lived 3-tuples per frame. Wall-clock cost is modest, but the GC pressure causes frame hitches during interactive drag.

So every channel function has a sibling that writes into a caller-provided buffer:

export const oklabToLinearInto = (
  out: Float64Array | number[],
  l: number, a: number, b: number
): void => { /* writes out[0/1/2] */ };
Enter fullscreen mode Exit fullscreen mode

On a 250k-pixel chained OKLCH→P3 bench, allocations drop from ~9 MB/iter to ~500 kB/iter. Wall-clock is only ~5% better, but interactive renders become visibly smoother.

I rejected the alternative of a shared module-level buffer (slightly faster in micro-bench, around 10%) because it is non-reentrant and a sharp edge in a public API. gl-matrix and three.js use the out-param pattern for the same reason.

9. DRY the data, not the structure

Once I had both oklabToLinear and oklabToLinearInto, the obvious refactor was to make the allocating version delegate to the *Into version. Looks great. Regressed the *Into path by ~20%.

The reason was V8 polymorphism. External callers pass a Float64Array. The new wrapper passes a plain [number, number, number]. The *Into call site went from monomorphic to polymorphic, V8's speculative optimizations got disabled.

The compromise: keep the math duplicated, but extract the matrix coefficients into module-level consts.

const M1_LR = 0.4122214708, M1_LG = 0.5363325363, M1_LB = 0.0514459929;
// ... 20+ named coefficients ...

export const linearSrgbToOklabInto = (out, lr, lg, lb) => {
  const lv = Math.cbrt(M1_LR * lr + M1_LG * lg + M1_LB * lb);
  // ...
};

export const linearSrgbToOklab = (lr, lg, lb) => {
  const lv = Math.cbrt(M1_LR * lr + M1_LG * lg + M1_LB * lb);
  // ...
};
Enter fullscreen mode Exit fullscreen mode

V8 constant-folds module-level consts, so there is no runtime cost vs inline literals. One source of truth for the data, two monomorphic call sites.

The textbook DRY refactor was wrong here. Sometimes you DRY the data and duplicate the structure.

What didn't help

Equally important: things that looked like they should help but didn't. Save yourself the time.

  1. A 256-entry LUT for toLinear was slower on M4. The FP unit executes Math.pow(x, 2.4) fast enough that array lookup overhead is not worth it. Result is architecture-specific.
  2. Manually inlining toLinear inside rgbToOklch made things worse (~270 ns → ~530 ns). The function got too large for V8 to optimize the body as a single unit.
  3. Inlining normalizeHue as an expression instead of a function call: also slower. V8 optimizes named function call sites independently.

The pattern: V8 is smarter than you about inlining small functions. Trust it until you have a profile that says otherwise.

Lessons

The biggest wins came from understanding V8's hidden class model, not from clever algorithms. Monomorphism is a feature you preserve, not a thing you add later.

Allocations matter more than CPU time on hot paths in modern JavaScript. Wall-clock differences are often small, but GC pressure shows up as frame hitches and unpredictable latency.

DRY is a tool, not a rule. V8 cares about call site shape consistency more than your engineering aesthetics.

Always measure on the hardware you care about. The LUT result on M4 might be different on a Cortex-A53 phone or an older Intel laptop.

If you want to play with the library, there is a playground at colordx.dev, and the source is at github.com/dkryaklin/colordx.

Top comments (3)

Collapse
 
alaadeen profile image
Alaadin

ty for sharing

Collapse
 
solozaki_751e44598f035fc8 profile image
Solozaki

nice

Collapse
 
movie_shorts_db36b7963605 profile image
Movie Shorts

fdrtrert