DEV Community

Yangholmes
Yangholmes

Posted on

Using SIMD in WebAssembly (Part 1)

Overview of SIMD in WebAssembly

SIMD in WebAssembly has the same meaning as in CPUs: Single Instruction Multiple Data. SIMD instructions achieve parallel data processing by performing the same operation on multiple data elements simultaneously, enabling vectorized computation. Compute-intensive applications like audio/video processing, codecs, and image processing leverage SIMD for performance gains. SIMD implementation depends on CPU hardware, and different architectures support varying SIMD capabilities. WebAssembly's SIMD instruction set is relatively conservative, currently limited to fixed-length 128-bit (16-byte) instructions.

Most mainstream virtual machines now support SIMD:

  • Chrome ≥ 91 (May 2021)
  • Firefox ≥ 89 (June 2021)
  • Safari ≥ 16.4 (March 2023)
  • Node.js ≥ 16.4 (June 2021)

Before using SIMD, check client support in your user base, then implement progressive enhancement in your project. This means:

  1. Create two versions of the same wasm module: one with SIMD instructions and one without
  2. Detect host support for SIMD using libraries like wasm-feature-detect
  3. Load the appropriate module based on detection results

wasm-feature-detect tests support for wasm features (including SIMD, 64-bit memory, multithreading) and is tree-shakable for web compatibility.

// loadWasmModule.js
import { simd } from 'wasm-feature-detect';

export default function(url, simdUrl) {

  return simd().then(isSupported => {
    return isSupported ? () => import(simdUrl) : () => import(url);
  });
}
Enter fullscreen mode Exit fullscreen mode

SIMD Instruction Set

SIMD instructions resemble scalar operations but process vectors. Key categories include arithmetic, load/store, logical operations, and lane manipulation. Summary of common instructions:

Instruction Format Description Example
Load/Store
v128.load offset=<n> align=<m> Load 128-bit vector from memory (v128.load offset=0 align=16 (i32.const 0))
v128.load8_splat Load 8-bit integer and splat to 16 lanes (v128.load8_splat (i32.const 42))
v128.store offset=<n> align=<m> Store 128-bit vector to memory (v128.store offset=16 align=16 (i32.const 32) (local.get $vec))
Constants
v128.const <type> <values> Create constant vector (v128.const i32x4 0 1 2 3)
Integer Arithmetic
i8x16.add(a, b) 8-bit integer addition (16 lanes) (i8x16.add (local.get $a) (local.get $b))
i16x8.sub(a, b) 16-bit integer subtraction (8 lanes) (i16x8.sub (local.get $a) (local.get $b))
i8x16.add_saturate_s(a, b) 8-bit signed saturating addition (i8x16.add_saturate_s (local.get $a) (local.get $b))
Integer Comparison
i8x16.eq(a, b) 8-bit integer equality (returns mask) (i8x16.eq (local.get $a) (local.get $b))
i32x4.lt_s(a, b) 32-bit signed integer less-than (i32x4.lt_s (local.get $a) (local.get $b))
Floating Point
f32x4.add(a, b) 32-bit float addition (4 lanes) (f32x4.add (local.get $a) (local.get $b))
f64x2.sqrt(a) 64-bit float square root (2 lanes) (f64x2.sqrt (local.get $a))
Bitwise
v128.and(a, b) Bitwise AND (v128.and (local.get $a) (local.get $b))
v128.bitselect(a, b, mask) Bitwise selection by mask (v128.bitselect (local.get $a) (local.get $b) (local.get $mask))
Shifts
i32x4.shl(a, imm) 32-bit integer left shift (immediate) (i32x4.shl (local.get $a) (i32.const 2))
Lane Operations
i8x16.extract_lane_s(idx, a) Extract signed 8-bit lane (i8x16.extract_lane_s 3 (local.get $a))
i8x16.shuffle(mask, a, b) Shuffle lanes from two vectors (i8x16.shuffle 0 1 2 3 12 13 14 15... (local.get $a) (local.get $b))
Type Conversion
i32x4.trunc_sat_f32x4_s(a) f32 to i32 (saturated truncation) (i32x4.trunc_sat_f32x4_s (local.get $a))
Other
v128.any_true(a) Check if any lane is non-zero (v128.any_true (local.get $a))
f32x4.ceil(a) 32-bit float ceiling (f32x4.ceil (local.get $a))

Instruction set summarized with DeepSeek assistance. Please report any inaccuracies.

Using SIMD Instructions

Example: Image color inversion

Non-SIMD implementation processes one pixel (4 bytes) per iteration:

(module
  (import "env" "log" (func $log (param i32)))

  (import "env" "memory" (memory 100))

  ;; invert RGB in place, skip Alpha
  (func $invert (param $start i32) (param $length i32)
    (local $end i32)   
    (local $i i32)    

    ;; Calculate end address = start + length * 4
    local.get $start
    (i32.mul (local.get $length) (i32.const 4))
    i32.add
    local.set $end

    local.get $start
    local.set $i

    (block $exit
      ;; Process R, G, B channels individually
      (loop $loop

        local.get $i
        local.get $end
        i32.ge_u
        br_if $exit


        ;; R
        local.get $i
        i32.const 255
        local.get $i
        i32.load8_u     
        i32.sub          
        i32.store8      

        ;; G
        local.get $i
        i32.const 1
        i32.add
        i32.const 255
        local.get $i
        i32.const 1
        i32.add
        i32.load8_u     
        i32.sub          
        i32.store8       

        ;; B
        local.get $i
        i32.const 2
        i32.add
        i32.const 255
        local.get $i
        i32.const 2
        i32.add
        i32.load8_u     
        i32.sub          
        i32.store8       

        ;; i = i + 4
        local.get $i
        i32.const 4
        i32.add
        local.set $i

        br $loop
      )
    )
  )

  (export "invert" (func $invert))
)
Enter fullscreen mode Exit fullscreen mode

SIMD version processes 4 pixels (16 bytes) per iteration:

(module
  (import "env" "log" (func $log (param i32)))
  (import "env" "memory" (memory 100))

  (func $invert (param $start i32) (param $length i32)
    (local $end i32)        
    (local $i i32)          
    (local $chunk v128)     
    (local $mask v128)     
    (local $full255 v128)  

    ;; end = start + length * 4
    local.get $start
    local.get $length
    i32.const 4
    i32.mul

    i32.add
    i32.const 3
    i32.add
    local.set $end

    ;; i = start
    local.get $start
    local.set $i

    ;; Full 255 vector
    v128.const i8x16 255 255 255 255 255 255 255 255
                     255 255 255 255 255 255 255 255
    local.set $full255

    ;; Alpha channel mask (preserve positions 3,7,11,15)
    v128.const i8x16 0 0 0 255 0 0 0 255
                     0 0 0 255 0 0 0 255
    local.set $mask

    (block $exit
      (loop $loop
        ;; if (i >= end) break
        local.get $i
        local.get $end
        i32.ge_u
        br_if $exit

        ;; load 16 bytes (4 pixels)
        local.get $i
        v128.load
        local.set $chunk

        ;; tmp = 255 - chunk
        local.get $full255
        local.get $chunk
        i8x16.sub
        local.set $chunk

        ;; Preserve alpha channels
        local.get $i
        v128.load
        local.get $chunk
        local.get $mask
        v128.bitselect
        local.set $chunk

        ;; store back
        local.get $i
        local.get $chunk
        v128.store

        ;; i += 16
        local.get $i
        i32.const 16
        i32.add
        local.set $i

        br $loop
      )
    )
  )

  (export "invert" (func $invert))
)
Enter fullscreen mode Exit fullscreen mode

Note: The SIMD version processes 16 bytes per iteration (line 18-20). Since image data might not be multiples of 16 bytes, we add 3 to the end address for alignment. This could potentially overwrite memory if other data exists, but is acceptable in this isolated example.

Performance Comparison:

Performance comparison

  • Left: Original image (928×927 pixels)
  • Middle: Non-SIMD result (processing time: ~2.9ms)
  • Right: SIMD result (processing time: 0.5ms)

The SIMD implementation shows ~6x speedup. Larger images yield greater benefits, but even smaller images like the classic Lenna test image show significant improvements:

Lenna image comparison

Next

Part 2 will explore using SIMD in WebAssembly via C/C++ programs.

Top comments (0)