Overview of SIMD in WebAssembly
SIMD in WebAssembly has the same meaning as in CPUs: Single Instruction Multiple Data. SIMD instructions achieve parallel data processing by performing the same operation on multiple data elements simultaneously, enabling vectorized computation. Compute-intensive applications like audio/video processing, codecs, and image processing leverage SIMD for performance gains. SIMD implementation depends on CPU hardware, and different architectures support varying SIMD capabilities. WebAssembly's SIMD instruction set is relatively conservative, currently limited to fixed-length 128-bit (16-byte) instructions.
Most mainstream virtual machines now support SIMD:
- Chrome ≥ 91 (May 2021)
- Firefox ≥ 89 (June 2021)
- Safari ≥ 16.4 (March 2023)
- Node.js ≥ 16.4 (June 2021)
Before using SIMD, check client support in your user base, then implement progressive enhancement in your project. This means:
- Create two versions of the same wasm module: one with SIMD instructions and one without
- Detect host support for SIMD using libraries like wasm-feature-detect
- Load the appropriate module based on detection results
wasm-feature-detect tests support for wasm features (including SIMD, 64-bit memory, multithreading) and is tree-shakable for web compatibility.
// loadWasmModule.js
import { simd } from 'wasm-feature-detect';
export default function(url, simdUrl) {
return simd().then(isSupported => {
return isSupported ? () => import(simdUrl) : () => import(url);
});
}
SIMD Instruction Set
SIMD instructions resemble scalar operations but process vectors. Key categories include arithmetic, load/store, logical operations, and lane manipulation. Summary of common instructions:
Instruction Format | Description | Example |
---|---|---|
Load/Store | ||
v128.load offset=<n> align=<m> |
Load 128-bit vector from memory | (v128.load offset=0 align=16 (i32.const 0)) |
v128.load8_splat |
Load 8-bit integer and splat to 16 lanes | (v128.load8_splat (i32.const 42)) |
v128.store offset=<n> align=<m> |
Store 128-bit vector to memory | (v128.store offset=16 align=16 (i32.const 32) (local.get $vec)) |
Constants | ||
v128.const <type> <values> |
Create constant vector | (v128.const i32x4 0 1 2 3) |
Integer Arithmetic | ||
i8x16.add(a, b) |
8-bit integer addition (16 lanes) | (i8x16.add (local.get $a) (local.get $b)) |
i16x8.sub(a, b) |
16-bit integer subtraction (8 lanes) | (i16x8.sub (local.get $a) (local.get $b)) |
i8x16.add_saturate_s(a, b) |
8-bit signed saturating addition | (i8x16.add_saturate_s (local.get $a) (local.get $b)) |
Integer Comparison | ||
i8x16.eq(a, b) |
8-bit integer equality (returns mask) | (i8x16.eq (local.get $a) (local.get $b)) |
i32x4.lt_s(a, b) |
32-bit signed integer less-than | (i32x4.lt_s (local.get $a) (local.get $b)) |
Floating Point | ||
f32x4.add(a, b) |
32-bit float addition (4 lanes) | (f32x4.add (local.get $a) (local.get $b)) |
f64x2.sqrt(a) |
64-bit float square root (2 lanes) | (f64x2.sqrt (local.get $a)) |
Bitwise | ||
v128.and(a, b) |
Bitwise AND | (v128.and (local.get $a) (local.get $b)) |
v128.bitselect(a, b, mask) |
Bitwise selection by mask | (v128.bitselect (local.get $a) (local.get $b) (local.get $mask)) |
Shifts | ||
i32x4.shl(a, imm) |
32-bit integer left shift (immediate) | (i32x4.shl (local.get $a) (i32.const 2)) |
Lane Operations | ||
i8x16.extract_lane_s(idx, a) |
Extract signed 8-bit lane | (i8x16.extract_lane_s 3 (local.get $a)) |
i8x16.shuffle(mask, a, b) |
Shuffle lanes from two vectors | (i8x16.shuffle 0 1 2 3 12 13 14 15... (local.get $a) (local.get $b)) |
Type Conversion | ||
i32x4.trunc_sat_f32x4_s(a) |
f32 to i32 (saturated truncation) | (i32x4.trunc_sat_f32x4_s (local.get $a)) |
Other | ||
v128.any_true(a) |
Check if any lane is non-zero | (v128.any_true (local.get $a)) |
f32x4.ceil(a) |
32-bit float ceiling | (f32x4.ceil (local.get $a)) |
Instruction set summarized with DeepSeek assistance. Please report any inaccuracies.
Using SIMD Instructions
Example: Image color inversion
Non-SIMD implementation processes one pixel (4 bytes) per iteration:
(module
(import "env" "log" (func $log (param i32)))
(import "env" "memory" (memory 100))
;; invert RGB in place, skip Alpha
(func $invert (param $start i32) (param $length i32)
(local $end i32)
(local $i i32)
;; Calculate end address = start + length * 4
local.get $start
(i32.mul (local.get $length) (i32.const 4))
i32.add
local.set $end
local.get $start
local.set $i
(block $exit
;; Process R, G, B channels individually
(loop $loop
local.get $i
local.get $end
i32.ge_u
br_if $exit
;; R
local.get $i
i32.const 255
local.get $i
i32.load8_u
i32.sub
i32.store8
;; G
local.get $i
i32.const 1
i32.add
i32.const 255
local.get $i
i32.const 1
i32.add
i32.load8_u
i32.sub
i32.store8
;; B
local.get $i
i32.const 2
i32.add
i32.const 255
local.get $i
i32.const 2
i32.add
i32.load8_u
i32.sub
i32.store8
;; i = i + 4
local.get $i
i32.const 4
i32.add
local.set $i
br $loop
)
)
)
(export "invert" (func $invert))
)
SIMD version processes 4 pixels (16 bytes) per iteration:
(module
(import "env" "log" (func $log (param i32)))
(import "env" "memory" (memory 100))
(func $invert (param $start i32) (param $length i32)
(local $end i32)
(local $i i32)
(local $chunk v128)
(local $mask v128)
(local $full255 v128)
;; end = start + length * 4
local.get $start
local.get $length
i32.const 4
i32.mul
i32.add
i32.const 3
i32.add
local.set $end
;; i = start
local.get $start
local.set $i
;; Full 255 vector
v128.const i8x16 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255
local.set $full255
;; Alpha channel mask (preserve positions 3,7,11,15)
v128.const i8x16 0 0 0 255 0 0 0 255
0 0 0 255 0 0 0 255
local.set $mask
(block $exit
(loop $loop
;; if (i >= end) break
local.get $i
local.get $end
i32.ge_u
br_if $exit
;; load 16 bytes (4 pixels)
local.get $i
v128.load
local.set $chunk
;; tmp = 255 - chunk
local.get $full255
local.get $chunk
i8x16.sub
local.set $chunk
;; Preserve alpha channels
local.get $i
v128.load
local.get $chunk
local.get $mask
v128.bitselect
local.set $chunk
;; store back
local.get $i
local.get $chunk
v128.store
;; i += 16
local.get $i
i32.const 16
i32.add
local.set $i
br $loop
)
)
)
(export "invert" (func $invert))
)
Note: The SIMD version processes 16 bytes per iteration (line 18-20). Since image data might not be multiples of 16 bytes, we add 3 to the end address for alignment. This could potentially overwrite memory if other data exists, but is acceptable in this isolated example.
Performance Comparison:
- Left: Original image (928×927 pixels)
- Middle: Non-SIMD result (processing time: ~2.9ms)
- Right: SIMD result (processing time: 0.5ms)
The SIMD implementation shows ~6x speedup. Larger images yield greater benefits, but even smaller images like the classic Lenna test image show significant improvements:
Next
Part 2 will explore using SIMD in WebAssembly via C/C++ programs.
Top comments (0)