How Much Faster Is Rust/WASM Than JS for Image Filters? I Ran Both Side by Side and Measured
A small single-page app that runs three image filters (grayscale, box blur, Sobel) in vanilla JavaScript and in Rust-compiled-to-raw-WebAssembly against the same RGBA buffer, then prints the millisecond cost of each and the speedup. No wasm-bindgen, no wasm-pack, ~4.7 KB
.wasm.
Every benchmark blog says "WASM is faster than JS". I wanted to see exactly how much faster for an everyday workload โ not a micro-benchmark, just "take a 1024ร1024 picture, run a filter, print the time". So I wrote one.
๐ฆ GitHub: https://github.com/sen-ltd/wasm-image-filter
Drop in an image, pick a filter, press Run benchmark. JS and WASM implementations are byte-identical algorithms; same kernel, same traversal order, same integer arithmetic. I verified pixel-for-pixel parity in the app โ 0 bytes differ between the two outputs across 4 million bytes.
The measurements (1024ร1024 image, median of 5 runs)
| Filter | JS | WASM | Speedup |
|---|---|---|---|
| Grayscale (BT.601 luma) | 1.90 ms | 1.30 ms | 1.46ร |
| Box blur (radius 3) | 130.60 ms | 40.60 ms | 3.22ร |
| Sobel edge detection | 20.10 ms | 9.50 ms | 2.12ร |
Numbers from an M2 MacBook Air on Chrome stable. They wobble by ยฑ10% run-to-run depending on how warm V8 is, but the ratio is stable at roughly what you see above.
The interesting part isn't any single number โ it's the shape of the difference:
- Grayscale is memory-bound. Three multiplications and a shift per pixel. V8's JIT keeps up; WASM only edges it out by ~1.5ร.
- Box blur is compute-bound. A radius-3 kernel is 49 samples per pixel โ roughly 50ร more inner-loop work than grayscale. The speedup jumps to 3ร.
- Sobel sits in between. 8 samples, multiply-accumulate, one integer square root.
The rule of thumb that fell out of this: WASM wins in proportion to how much math you do inside the pixel loop. "WASM is 10ร faster" is misleading as a blanket claim. "WASM pulls ahead as your inner loop gets heavier" is what actually happens.
Why I skipped wasm-bindgen
wasm-bindgen is wonderful for real applications, but for a benchmark it's noise. Every call goes through a JS glue layer that copies Uint8Array payloads through marshaler functions and manages ownership. For a grayscale pass that completes in ~1 ms, the tens-of-microseconds of glue per call shows up as "WASM overhead" that isn't actually WASM โ it's wasm-bindgen.
I wanted the smallest possible boundary between JS and WASM so the measurement reflects the compute cost, not the plumbing cost. Turns out the raw WebAssembly API gives you that in ~40 lines of JS:
let instance = null;
export async function loadWasm(url = './assets/filter.wasm') {
const res = await fetch(url);
const bytes = await res.arrayBuffer();
({ instance } = await WebAssembly.instantiate(bytes, {}));
}
function writeInto(ptr, src) {
new Uint8ClampedArray(instance.exports.memory.buffer, ptr, src.length).set(src);
}
function readOut(ptr, len) {
return new Uint8ClampedArray(instance.exports.memory.buffer, ptr, len).slice();
}
export function grayscale(buf) {
const { alloc_buffer, reset_heap, grayscale: run } = instance.exports;
reset_heap();
const ptr = alloc_buffer(buf.length);
writeInto(ptr, buf);
run(ptr, buf.length);
buf.set(readOut(ptr, buf.length));
}
The JS side owns the input Uint8ClampedArray, writes it straight into the exported linear memory, calls one exported function, then reads back. That's it. No externref, no JsValue, no glue.
Why #![no_std] and a bump allocator
The Rust crate is #![no_std] with a hand-rolled bump allocator, for two reasons:
1. The output .wasm is tiny. It comes out to 4781 bytes. Full-stop โ that's the entire module: three filters, an allocator, a panic handler, everything. A roughly equivalent module built through wasm-bindgen with default settings is 10ร bigger because it ships stdlib internals, a real allocator (dlmalloc), and formatter code.
2. The allocator fits the workload. The filters need, at most, two buffers alive at once (source + destination). A bump allocator that grows forward and resets to zero between frames is a perfect fit:
struct Bump;
unsafe impl GlobalAlloc for Bump {
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
let align = layout.align();
let size = layout.size();
let cur = *CURSOR.offset.get();
let aligned = (cur + align - 1) & !(align - 1);
let end = aligned + size;
if end > HEAP_SIZE { return core::ptr::null_mut(); }
*CURSOR.offset.get() = end;
(HEAP.bytes.get() as *mut u8).add(aligned)
}
unsafe fn dealloc(&self, _: *mut u8, _: Layout) {}
}
dealloc does nothing. reset_heap() resets the cursor to zero, freeing everything at once. General-purpose malloc/free is over-engineered for this shape of workload and you pay for it in binary size.
Pair that with a tight release profile:
[profile.release]
opt-level = 3
lto = true
codegen-units = 1
panic = "abort"
strip = true
The panic = "abort" line is the biggest single size win โ it strips all the unwinder code that stdlib-based Rust emits for panic propagation.
Enabling SIMD128 with a config flag
One line in .cargo/config.toml:
[target.wasm32-unknown-unknown]
rustflags = ["-C", "target-feature=+simd128"]
Tells LLVM that the target has WebAssembly SIMD128, and the auto-vectorizer is free to emit v128 instructions when it sees the shape of a vector loop. I didn't write a single SIMD intrinsic by hand; the 3ร speedup on box blur comes entirely from the compiler noticing that the inner loop sums consecutive memory and turning it into packed adds.
The browser (Chrome, Firefox, Safari 16.4+) consumes the SIMD opcodes transparently. No runtime feature detection needed in 2026.
Building with Docker so you don't have to install a toolchain
I don't love installing Rust toolchains on laptops I use for everything else. The whole build is a Docker one-liner:
#!/usr/bin/env bash
set -euo pipefail
cd "$(dirname "$0")"
docker run --rm -v "$PWD/rust:/work" -w /work \
-e CARGO_TARGET_DIR=/work/target \
rust:1.90-alpine sh -c '
rustup target add wasm32-unknown-unknown >/dev/null 2>&1 || true
cargo build --release
'
cp rust/target/wasm32-unknown-unknown/release/wasm_image_filter.wasm assets/filter.wasm
rust:1.90-alpine is ~200 MB. After the first pull, incremental builds come back in ~1.3 seconds. The .wasm artifact is committed to the repo so npm run serve works without a build step at all.
Traps I walked into while measuring
A few things you'll want to know if you roll your own browser benchmark:
- The first run is always slower. V8 hasn't specialized the inner loop yet. Throw away the first sample; this app does one warmup before the measurement loop.
-
performance.now()is clamped. Some browser configurations round to 5 ยตs or even 100 ยตs for Spectre mitigation. If your measured values refuse to go below 1 ms for something you know should take 50 ยตs, that's what's happening. - Use the median. Average gets dragged around by GC pauses and thermal throttling. The median of 5 or 10 runs is far more stable.
-
Watch
devicePixelRatio. If you're drawing to a canvas sized by CSS and forget to account for the device pixel ratio, a "1024ร1024" canvas is actually 2048ร2048 pixels on a Retina display โ 4ร the work. This app works onImageDatadirectly so the trap doesn't bite, but it absolutely will if you're usinggetContext('2d')with CSS sizing.
When WASM does not help
Grayscale is the case where "just add WASM" buys you almost nothing. The kernel is so simple that V8's JIT already emits reasonable machine code; the remaining overhead is memory bandwidth, which both implementations share. If your workload looks like grayscale โ small per-pixel arithmetic, large image โ WASM isn't the lever you want. Bigger tiles, OffscreenCanvas, or a GPU shader will move the needle more.
WASM earns its keep when the inner loop is compute-heavy. Sobel has a meaningful amount of per-pixel math (8 samples, mul-add, sqrt) and benefits by ~2ร. Box blur at radius 3 does 49ร the work grayscale does and benefits by ~3ร. A blur at radius 7 โ 225 samples per pixel โ ships with a speedup closer to 5โ6ร in my testing.
If you're sizing up whether to port a hot function to WASM, the crude estimate that lines up with what I saw: multiply (samples-per-pixel) ร (pixel count) ร (arithmetic ops per sample). Under ~10^7, WASM won't visibly help. Between 10^7 and 10^8, expect 1.5โ2ร. Above 10^8, you're in the sweet spot and the speedup grows linearly with inner-loop weight.
Try it
git clone https://github.com/sen-ltd/wasm-image-filter
cd wasm-image-filter
npm run serve # python3 -m http.server 8080
# open http://localhost:8080
Click Load sample for a synthetic 1024ร1024 image, or drop in any JPEG/PNG. Flip the filter, adjust runs, press Run benchmark. The history table keeps your previous runs so you can compare filter settings on the same image.
MIT-licensed. Rust source is in rust/src/lib.rs, JS reference implementations in src/filters-js.js. They're the same algorithm on purpose โ so the numbers you see are the compiler difference, not the logic difference.
About SEN
This is entry #205 of a project to build 100+ open-source portfolio pieces at sen.ltd/portfolio/. SEN LLC is a small Japanese software studio; if you're evaluating us for contract work, the full catalog is the best sample.

Top comments (0)