WASM SIMD by example: 16 RGB pixels to grayscale per instruction

#webassembly #performance #deno #typescript

I finally understood WASM SIMD by writing a real kernel: RGB → grayscale luma, 16 pixels per instruction instead of one at a time. Same math, ~4× faster on a single core. This is the condensed version — the full write-up has every op desugared into plain TypeScript, lane-by-lane ASCII diagrams, and an optional detour down to the carry wires.

Spoiler first, theory after. One file, nothing to install besides Deno — it compiles its own AssemblyScript at startup:

🪄 code A → code B, runnable and benchmarked (click to spoil yourself)

// luma_bench.ts, fully self-contained: the AssemblyScript compiles itself on the fly.
// Run: deno bench -A luma_bench.ts
import asc from "npm:assemblyscript/asc";

// ---------- code A: the loop anyone would write ----------

function rgbToLumaNaive(rgb: Uint8Array, out: Uint8Array): void {
  const pixelCount = out.length;
  for (let i = 0; i < pixelCount; i++) {
    const r = rgb[i * 3];
    const g = rgb[i * 3 + 1];
    const b = rgb[i * 3 + 2];
    out[i] = Math.round(0.2126 * r + 0.7152 * g + 0.0722 * b);
  }
}

// ---------- code A.5: same loop, float math swapped for Q15 integers ----------

function rgbToLumaScalar(rgb: Uint8Array, out: Uint8Array): void {
  const pixelCount = out.length;
  for (let i = 0; i < pixelCount; i++) {
    const r = rgb[i * 3];
    const g = rgb[i * 3 + 1];
    const b = rgb[i * 3 + 2];
    out[i] = (6966 * r + 23436 * g + 2366 * b + 0x4000) >> 15;
  }
}

// ---------- code B: the same math, spoken in 16-wide lane language ----------

const source = `
export function luma(inPtr: usize, outPtr: usize, pixels: i32): void {
  const redWeight = i16x8.splat(6966);  // 0.2126 in Q15
  const greenWeight = i16x8.splat(23436); // 0.7152 in Q15
  const blueWeight = i16x8.splat(2366);  // 0.0722 in Q15
  const q15Half = i32x4.splat(0x4000);

  for (let i = 0; i + 16 <= pixels; i += 16) {
    const chunkPtr = inPtr + <usize>(i * 3);

    // load 48 interleaved bytes, de-interleave into R/G/B planes
    const chunk0 = v128.load(chunkPtr);
    const chunk1 = v128.load(chunkPtr, 16);
    const chunk2 = v128.load(chunkPtr, 32);
    const redPartial = i8x16.shuffle(chunk0, chunk1, 0,3,6,9,12,15,18,21,24,27,30,0,0,0,0,0);
    const redPlane = i8x16.shuffle(redPartial, chunk2, 0,1,2,3,4,5,6,7,8,9,10,17,20,23,26,29);
    const greenPartial = i8x16.shuffle(chunk0, chunk1, 1,4,7,10,13,16,19,22,25,28,31,0,0,0,0,0);
    const greenPlane = i8x16.shuffle(greenPartial, chunk2, 0,1,2,3,4,5,6,7,8,9,10,18,21,24,27,30);
    const bluePartial = i8x16.shuffle(chunk0, chunk1, 2,5,8,11,14,17,20,23,26,29,0,0,0,0,0,0);
    const bluePlane = i8x16.shuffle(bluePartial, chunk2, 0,1,2,3,4,5,6,7,8,9,16,19,22,25,28,31);

    // widen u8 -> u16
    const redLow = i16x8.extend_low_i8x16_u(redPlane);
    const redHigh = i16x8.extend_high_i8x16_u(redPlane);
    const greenLow = i16x8.extend_low_i8x16_u(greenPlane);
    const greenHigh = i16x8.extend_high_i8x16_u(greenPlane);
    const blueLow = i16x8.extend_low_i8x16_u(bluePlane);
    const blueHigh = i16x8.extend_high_i8x16_u(bluePlane);

    // multiply + accumulate in Q15, 4 pixels per accumulator
    let luma0to3 = i32x4.extmul_low_i16x8_u(redLow, redWeight);
    luma0to3 = i32x4.add(luma0to3, i32x4.extmul_low_i16x8_u(greenLow, greenWeight));
    luma0to3 = i32x4.add(luma0to3, i32x4.extmul_low_i16x8_u(blueLow, blueWeight));
    let luma4to7 = i32x4.extmul_high_i16x8_u(redLow, redWeight);
    luma4to7 = i32x4.add(luma4to7, i32x4.extmul_high_i16x8_u(greenLow, greenWeight));
    luma4to7 = i32x4.add(luma4to7, i32x4.extmul_high_i16x8_u(blueLow, blueWeight));
    let luma8to11 = i32x4.extmul_low_i16x8_u(redHigh, redWeight);
    luma8to11 = i32x4.add(luma8to11, i32x4.extmul_low_i16x8_u(greenHigh, greenWeight));
    luma8to11 = i32x4.add(luma8to11, i32x4.extmul_low_i16x8_u(blueHigh, blueWeight));
    let luma12to15 = i32x4.extmul_high_i16x8_u(redHigh, redWeight);
    luma12to15 = i32x4.add(luma12to15, i32x4.extmul_high_i16x8_u(greenHigh, greenWeight));
    luma12to15 = i32x4.add(luma12to15, i32x4.extmul_high_i16x8_u(blueHigh, blueWeight));

    // round, shift out of Q15, narrow i32 -> i16 -> u8, store 16 results at once
    luma0to3 = i32x4.shr_s(i32x4.add(luma0to3, q15Half), 15);
    luma4to7 = i32x4.shr_s(i32x4.add(luma4to7, q15Half), 15);
    luma8to11 = i32x4.shr_s(i32x4.add(luma8to11, q15Half), 15);
    luma12to15 = i32x4.shr_s(i32x4.add(luma12to15, q15Half), 15);
    const luma0to7 = i16x8.narrow_i32x4_s(luma0to3, luma4to7);
    const luma8to15 = i16x8.narrow_i32x4_s(luma8to11, luma12to15);
    v128.store(outPtr + <usize>i, i8x16.narrow_i16x8_u(luma0to7, luma8to15));
  }
}
`;

// compile in memory and import the bytes as a module, no files involved
const { error, binary, stderr } = await asc.compileString(source, {
  optimizeLevel: 3,
  runtime: "stub",
  enable: ["simd"],
});
if (error) throw new Error(stderr.toString());

const { memory, luma } = await import(
  `data:application/wasm;base64,${binary!.toBase64()}`
) as { memory: WebAssembly.Memory; luma: (i: number, o: number, n: number) => void };

// ---------- one 12-megapixel photo (4000x3000) ----------

const N = 4000 * 3000;
const IN = 65536, OUT = IN + N * 3;
memory.grow(Math.ceil((OUT + N) / 65536)); // stub runtime starts at 0 pages

const rgbWasm = new Uint8Array(memory.buffer, IN, N * 3); // views AFTER grow (grow detaches)
for (let i = 0; i < rgbWasm.length; i++) rgbWasm[i] = (i * 2654435761) >>> 24;

const rgbJs = rgbWasm.slice(); // the JS versions get their own plain copy, fair is fair
const outJs = new Uint8Array(N);

// ---------- the receipts ----------

Deno.bench("code A: naive float", () => {
  rgbToLumaNaive(rgbJs, outJs);
});

Deno.bench("code A.5: scalar Q15", () => {
  rgbToLumaScalar(rgbJs, outJs);
});

Deno.bench("code B: WASM SIMD", () => {
  luma(IN, OUT, N);
});

The receipts, M2 Pro, one 12-megapixel photo:

benchmark	time/iter	speedup
code A: naive float	23.3 ms	1×
code A.5: scalar Q15	19.1 ms	~1.2×
code B: WASM SIMD	5.7 ms	~4.1×

(The integer trick alone is worth ~20% before any SIMD.)

The one idea behind all of it

A v128 is just 128 bits = 16 bytes. The type prefix on an op decides how you slice those bytes into "lanes":

i8x16:  16 lanes ×  8-bit   [b0][b1]...[b15]
i16x8:   8 lanes × 16-bit   [ h0 ][ h1 ]...[ h7 ]
i32x4:   4 lanes × 32-bit   [   w0   ]...[   w3   ]

"Lane-wise" = the op runs on every lane in parallel. One instruction, N results. That's the whole win.

The problem

Grayscale (Rec.709): $L = 0.2126R + 0.7152G + 0.0722B$ , one output byte per pixel. Every pixel is independent, so it should vectorize perfectly. The friction is purely mechanical: pixels arrive interleaved (RGBRGBRGB…) but SIMD wants each channel contiguous, and bytes are too narrow to multiply without overflowing.

Ditch the floats first: Q15

Store fractions in plain integers: scale by $2^{15} = 32768$ and remember the decimal point lives 15 bits from the right — the binary version of "store dollars as cents":

\begin{aligned} \lfloor 0.2126 \cdot 2^{15} \rceil &= 6966 \ \lfloor 0.7152 \cdot 2^{15} \rceil &= 23436 \ \lfloor 0.0722 \cdot 2^{15} \rceil &= 2366 \end{aligned}

And $6966 + 23436 + 2366 = 32768 = 2^{15}$ exactly — deliberate, so dividing by the total weight is an exact shift and pure white comes out as exactly 255, no bias. Multiply by the constant, add half ( $2^{14}$ = 0x4000) to round, shift right by 15. Bonus: integer math is bit-deterministic across platforms — my float version drifted ±1 between machines, the integer one never does.

out[i] = (6966 * r + 23436 * g + 2366 * b + 0x4000) >> 15;

That line is the entire kernel. The rest is plumbing to make it happen 16 pixels at a time.

The kernel in four moves

1. De-interleave. Three loads grab 48 bytes = 16 RGB pixels. i8x16.shuffle picks each output byte from any of 32 input bytes — "every 3rd byte" turns RGBRGB… into planar RRRR… / GGGG… / BBBB…:

const chunk0 = v128.load(chunkPtr);
const chunk1 = v128.load(chunkPtr, 16);
const chunk2 = v128.load(chunkPtr, 32);
const redPartial = i8x16.shuffle(chunk0, chunk1, 0,3,6,9,12,15, 18,21,24,27,30, 0,0,0,0,0);
const redPlane   = i8x16.shuffle(redPartial, chunk2, 0,1,2,3,4,5,6,7,8,9,10, 17,20,23,26,29);
// redPlane = [R0 R1 ... R15] — same trick for green and blue

2. Widen. 255 × 23436 doesn't fit in a byte, so zero-extend u8 → u16:

const redLow  = i16x8.extend_low_i8x16_u(redPlane);  // pixels 0..7  as u16
const redHigh = i16x8.extend_high_i8x16_u(redPlane); // pixels 8..15 as u16

3. Multiply-accumulate in Q15. splat broadcasts a constant to all lanes; extmul multiplies 8×u16 pairs into 4×i32 products (it has to split low/high — 8 wide products are 256 bits and a register holds 128):

const redWeight = i16x8.splat(6966);
let luma0to3 = i32x4.extmul_low_i16x8_u(redLow, redWeight);
luma0to3 = i32x4.add(luma0to3, i32x4.extmul_low_i16x8_u(greenLow, greenWeight));
luma0to3 = i32x4.add(luma0to3, i32x4.extmul_low_i16x8_u(blueLow, blueWeight));
// max value ≈ 23 bits — can't overflow i32, and summing before rounding
// means exactly one rounding error per pixel

4. Round, narrow, store. Add half, shift out of Q15, then squeeze i32 → i16 → u8 and write all 16 grays with one store. The narrow ops saturate (clamp, not wrap) — that's why they're real ops and not bit-casts:

luma0to3 = i32x4.shr_s(i32x4.add(luma0to3, q15Half), 15);
const luma0to7  = i16x8.narrow_i32x4_s(luma0to3,  luma4to7);
const luma8to15 = i16x8.narrow_i32x4_s(luma8to11, luma12to15);
v128.store(outPtr, i8x16.narrow_i16x8_u(luma0to7, luma8to15)); // 16 px, one store

Every op above is desugared into a plain TS loop (with lane diagrams) in the full version — if you can read for (let lane = 0; lane < 8; lane++), you can read SIMD.

Why it's fast

The lanes aren't "connected" by clever hardware — they're deliberately disconnected. The vector ALU is one wide slab of adders, and the lane prefix just decides where the carry chain gets cut. All 16 byte-adds physically exist side by side and fire in the same cycle. The other half of the win: one instruction fetch/decode/retire now buys 16 results, so the CPU front-end does 1/16th of the work. WASM v128 compiles ~1:1 to NEON/SSE, so none of it is lost in translation. (Full post has the carry-wire detour, down to the AND gate.)

Gotchas

Pass --enable simd to the AssemblyScript compiler or v128 won't compile.
Lane width must match the op — i32x4.add on bytes silently treats 4 bytes as one int.
The loop condition i + 16 <= pixels silently skips the last pixels % 16 pixels — no crash, no error, just stale output bytes. Real code needs a scalar tail loop; mine dodges it because my image dimensions are multiples of 16.
Integer (Q15) math made the kernel bit-exact against its TS reference. Floats wouldn't.

Isn't this just a tiny GPU?

Kind of, yeah. A GPU is the same lane trick scaled until it's the whole chip — a warp is 32 lanes in lock-step, and the shader programming model exists so you write the scalar loop while the hardware runs the lanes. The catch is that the GPU is in another building: pixels have to commute through buffer uploads and a dispatch queue, and those costs don't care that your math is three multiplies per pixel. For one cheap pass over data already in your hands, SIMD on the CPU is hard to argue with.

Condensed from the canonical version, which has the per-op TS desugarings, ASCII lane diagrams, and the silicon detour. Origin story: dithering photos for an e-ink display, 64 ms → 16 ms.