Using SIMD in WebAssembly (Part 1)

#webassembly #javascript #wat

Overview of SIMD in WebAssembly

SIMD in WebAssembly has the same meaning as in CPUs: Single Instruction Multiple Data. SIMD instructions achieve parallel data processing by performing the same operation on multiple data elements simultaneously, enabling vectorized computation. Compute-intensive applications like audio/video processing, codecs, and image processing leverage SIMD for performance gains. SIMD implementation depends on CPU hardware, and different architectures support varying SIMD capabilities. WebAssembly's SIMD instruction set is relatively conservative, currently limited to fixed-length 128-bit (16-byte) instructions.

Most mainstream virtual machines now support SIMD:

Chrome ≥ 91 (May 2021)
Firefox ≥ 89 (June 2021)
Safari ≥ 16.4 (March 2023)
Node.js ≥ 16.4 (June 2021)

Before using SIMD, check client support in your user base, then implement progressive enhancement in your project. This means:

Create two versions of the same wasm module: one with SIMD instructions and one without
Detect host support for SIMD using libraries like wasm-feature-detect
Load the appropriate module based on detection results

wasm-feature-detect tests support for wasm features (including SIMD, 64-bit memory, multithreading) and is tree-shakable for web compatibility.

// loadWasmModule.js
import { simd } from 'wasm-feature-detect';

export default function(url, simdUrl) {

  return simd().then(isSupported => {
    return isSupported ? () => import(simdUrl) : () => import(url);
  });
}

SIMD Instruction Set

SIMD instructions resemble scalar operations but process vectors. Key categories include arithmetic, load/store, logical operations, and lane manipulation. Summary of common instructions:

Instruction Format	Description	Example
Load/Store
`v128.load offset=<n> align=<m>`	Load 128-bit vector from memory	`(v128.load offset=0 align=16 (i32.const 0))`
`v128.load8_splat`	Load 8-bit integer and splat to 16 lanes	`(v128.load8_splat (i32.const 42))`
`v128.store offset=<n> align=<m>`	Store 128-bit vector to memory	`(v128.store offset=16 align=16 (i32.const 32) (local.get $vec))`
Constants
`v128.const <type> <values>`	Create constant vector	`(v128.const i32x4 0 1 2 3)`
Integer Arithmetic
`i8x16.add(a, b)`	8-bit integer addition (16 lanes)	`(i8x16.add (local.get $a) (local.get $b))`
`i16x8.sub(a, b)`	16-bit integer subtraction (8 lanes)	`(i16x8.sub (local.get $a) (local.get $b))`
`i8x16.add_saturate_s(a, b)`	8-bit signed saturating addition	`(i8x16.add_saturate_s (local.get $a) (local.get $b))`
Integer Comparison
`i8x16.eq(a, b)`	8-bit integer equality (returns mask)	`(i8x16.eq (local.get $a) (local.get $b))`
`i32x4.lt_s(a, b)`	32-bit signed integer less-than	`(i32x4.lt_s (local.get $a) (local.get $b))`
Floating Point
`f32x4.add(a, b)`	32-bit float addition (4 lanes)	`(f32x4.add (local.get $a) (local.get $b))`
`f64x2.sqrt(a)`	64-bit float square root (2 lanes)	`(f64x2.sqrt (local.get $a))`
Bitwise
`v128.and(a, b)`	Bitwise AND	`(v128.and (local.get $a) (local.get $b))`
`v128.bitselect(a, b, mask)`	Bitwise selection by mask	`(v128.bitselect (local.get $a) (local.get $b) (local.get $mask))`
Shifts
`i32x4.shl(a, imm)`	32-bit integer left shift (immediate)	`(i32x4.shl (local.get $a) (i32.const 2))`
Lane Operations
`i8x16.extract_lane_s(idx, a)`	Extract signed 8-bit lane	`(i8x16.extract_lane_s 3 (local.get $a))`
`i8x16.shuffle(mask, a, b)`	Shuffle lanes from two vectors	`(i8x16.shuffle 0 1 2 3 12 13 14 15... (local.get $a) (local.get $b))`
Type Conversion
`i32x4.trunc_sat_f32x4_s(a)`	f32 to i32 (saturated truncation)	`(i32x4.trunc_sat_f32x4_s (local.get $a))`
Other
`v128.any_true(a)`	Check if any lane is non-zero	`(v128.any_true (local.get $a))`
`f32x4.ceil(a)`	32-bit float ceiling	`(f32x4.ceil (local.get $a))`

Instruction set summarized with DeepSeek assistance. Please report any inaccuracies.

Using SIMD Instructions

Example: Image color inversion

Non-SIMD implementation processes one pixel (4 bytes) per iteration:

(module
  (import "env" "log" (func $log (param i32)))

  (import "env" "memory" (memory 100))

  ;; invert RGB in place, skip Alpha
  (func $invert (param $start i32) (param $length i32)
    (local $end i32)   
    (local $i i32)    

    ;; Calculate end address = start + length * 4
    local.get $start
    (i32.mul (local.get $length) (i32.const 4))
    i32.add
    local.set $end

    local.get $start
    local.set $i

    (block $exit
      ;; Process R, G, B channels individually
      (loop $loop

        local.get $i
        local.get $end
        i32.ge_u
        br_if $exit


        ;; R
        local.get $i
        i32.const 255
        local.get $i
        i32.load8_u     
        i32.sub          
        i32.store8      

        ;; G
        local.get $i
        i32.const 1
        i32.add
        i32.const 255
        local.get $i
        i32.const 1
        i32.add
        i32.load8_u     
        i32.sub          
        i32.store8       

        ;; B
        local.get $i
        i32.const 2
        i32.add
        i32.const 255
        local.get $i
        i32.const 2
        i32.add
        i32.load8_u     
        i32.sub          
        i32.store8       

        ;; i = i + 4
        local.get $i
        i32.const 4
        i32.add
        local.set $i

        br $loop
      )
    )
  )

  (export "invert" (func $invert))
)

SIMD version processes 4 pixels (16 bytes) per iteration:

(module
  (import "env" "log" (func $log (param i32)))
  (import "env" "memory" (memory 100))

  (func $invert (param $start i32) (param $length i32)
    (local $end i32)        
    (local $i i32)          
    (local $chunk v128)     
    (local $mask v128)     
    (local $full255 v128)  

    ;; end = start + length * 4
    local.get $start
    local.get $length
    i32.const 4
    i32.mul

    i32.add
    i32.const 3
    i32.add
    local.set $end

    ;; i = start
    local.get $start
    local.set $i

    ;; Full 255 vector
    v128.const i8x16 255 255 255 255 255 255 255 255
                     255 255 255 255 255 255 255 255
    local.set $full255

    ;; Alpha channel mask (preserve positions 3,7,11,15)
    v128.const i8x16 0 0 0 255 0 0 0 255
                     0 0 0 255 0 0 0 255
    local.set $mask

    (block $exit
      (loop $loop
        ;; if (i >= end) break
        local.get $i
        local.get $end
        i32.ge_u
        br_if $exit

        ;; load 16 bytes (4 pixels)
        local.get $i
        v128.load
        local.set $chunk

        ;; tmp = 255 - chunk
        local.get $full255
        local.get $chunk
        i8x16.sub
        local.set $chunk

        ;; Preserve alpha channels
        local.get $i
        v128.load
        local.get $chunk
        local.get $mask
        v128.bitselect
        local.set $chunk

        ;; store back
        local.get $i
        local.get $chunk
        v128.store

        ;; i += 16
        local.get $i
        i32.const 16
        i32.add
        local.set $i

        br $loop
      )
    )
  )

  (export "invert" (func $invert))
)

Note: The SIMD version processes 16 bytes per iteration (line 18-20). Since image data might not be multiples of 16 bytes, we add 3 to the end address for alignment. This could potentially overwrite memory if other data exists, but is acceptable in this isolated example.

Performance Comparison:

Left: Original image (928×927 pixels)
Middle: Non-SIMD result (processing time: ~2.9ms)
Right: SIMD result (processing time: 0.5ms)

The SIMD implementation shows ~6x speedup. Larger images yield greater benefits, but even smaller images like the classic Lenna test image show significant improvements: