Arjun R k

Posted on Sep 30

Analyzing Performance Improvements of AI Models Using Wasm at the Edge

#genai #webassembly #ai

TL;DR: WebAssembly (Wasm) lets you ship the same AI inference module across wildly different edge devices, with strong sandboxing and near-native speed when using SIMD + threads. Below is a minimal setup to measure latency, throughput, memory, and cold-start across:

Native CPU inference (onnxruntime-node)

Wasm CPU inference (onnxruntime-web running via WebAssembly + SIMD + threads)

We’ll use a small vision model (SqueezeNet) to keep things fast.

Why Wasm

Portability: One module → many CPUs/OSes.
Security: Strong sandbox limits blast radius on shared edge boxes.
Performance: With SIMD + threads and AOT/JIT, Wasm often lands close to native CPU; cold-starts are typically faster than spinning up containers.

Project layout

wasm-edge-ai/
├─ models/
│  └─ squeezenet.onnx
├─ data/
│  └─ cat.jpg
├─ src/
│  ├─ preprocess.js
│  ├─ bench-native.js
│  └─ bench-wasm.js
├─ package.json
└─ README.md

Model: use any small ImageNet ONNX (e.g., SqueezeNet 1.1).
Image: any small RGB image (e.g., 224×224 or we’ll resize).

1) Install deps & fetch model

mkdir -p wasm-edge-ai/{models,data,src}
cd wasm-edge-ai
npm init -y

# Native CPU runtime
npm i onnxruntime-node

# Wasm runtime (runs in Node via WebAssembly)
npm i onnxruntime-web

# Image + tensor utils
npm i sharp ndarray

# Optional: system metrics
npm i pidusage

# (Get model & image)
curl -L -o models/squeezenet.onnx https://github.com/onnx/models/raw/main/vision/classification/squeezenet/model/squeezenet1.1-7.onnx
curl -L -o data/cat.jpg https://raw.githubusercontent.com/onnx/models/main/vision/classification/squeezenet/test_data/images/cat.jpg

2) Preprocessing (shared by both paths)

// src/preprocess.js
const sharp = require('sharp');
const ndarray = require('ndarray');

async function loadAndPreprocess(imagePath) {
  // 1) Load & resize to 224×224
  const { data, info } = await sharp(imagePath)
    .resize(224, 224)
    .raw()
    .toBuffer({ resolveWithObject: true });

  // 2) Convert HWC uint8 -> float32 CHW normalized (ImageNet mean/std)
  const W = info.width, H = info.height, C = info.channels; // likely 3
  const float = new Float32Array(C * H * W);
  const mean = [0.485, 0.456, 0.406];
  const std  = [0.229, 0.224, 0.225];

  for (let y = 0; y < H; y++) {
    for (let x = 0; x < W; x++) {
      for (let c = 0; c < C; c++) {
        const idxHWC = (y * W + x) * C + c;
        const idxCHW = c * H * W + y * W + x;
        float[idxCHW] = (data[idxHWC] / 255 - mean[c]) / std[c];
      }
    }
  }

  // ORT expects NCHW; add batch dimension N=1
  return ndarray(float, [1, 3, 224, 224]);
}

module.exports = { loadAndPreprocess };

A tiny helper to convert ndarray → plain Float32Array + shape:

// src/arr.js
function arrToOrtTensor(nd) {
  return { data: nd.data, dims: nd.shape }; // ORT accepts {data, dims}
}
module.exports = { arrToOrtTensor };

3) Native CPU baseline (onnxruntime-node)

// src/bench-native.js
const ort = require('onnxruntime-node');
const pidusage = require('pidusage');
const { loadAndPreprocess } = require('./preprocess');
const { arrToOrtTensor } = require('./arr');
const path = require('path');

async function main() {
  const session = await ort.InferenceSession.create(
    path.join(__dirname, '..', 'models', 'squeezenet.onnx'),
    {
      executionProviders: ['cpu'], // use 'cpu' to match wasm fair-ish
    }
  );

  const inputNd = await loadAndPreprocess(path.join(__dirname, '..', 'data', 'cat.jpg'));
  const input = arrToOrtTensor(inputNd);

  // Warmup
  for (let i = 0; i < 5; i++) {
    await session.run({ 'data': input }); // SqueezeNet input is 'data'
  }

  // Benchmark
  const N = parseInt(process.env.N || '200', 10);
  const latencies = [];
  const t0 = performance.now();
  for (let i = 0; i < N; i++) {
    const t1 = performance.now();
    await session.run({ 'data': input });
    const t2 = performance.now();
    latencies.push(t2 - t1);
  }
  const tN = performance.now();

  // Stats
  latencies.sort((a, b) => a - b);
  const p = q => latencies[Math.floor(q * latencies.length)];
  const total = tN - t0;
  const throughput = (N * 1000) / total;

  const mem = process.memoryUsage.rss();
  const cpu = await pidusage(process.pid);

  console.log(JSON.stringify({
    kind: 'native',
    N,
    ms_p50: p(0.5),
    ms_p95: p(0.95),
    ms_avg: latencies.reduce((a, b) => a + b, 0) / N,
    throughput_ips: throughput.toFixed(2),
    rss_mb: (mem / (1024*1024)).toFixed(1),
    cpu_percent: cpu.cpu.toFixed(1)
  }, null, 2));
}

main().catch(err => { console.error(err); process.exit(1); });

4) Wasm CPU (onnxruntime-web with SIMD + threads)

Node supports WebAssembly; onnxruntime-web uses a Wasm backend. We’ll enable SIMD and threads (if your Node + host support it).

// src/bench-wasm.js
// onnxruntime-web exposes a "web" API, but works in Node too.
const ort = require('onnxruntime-web');
const pidusage = require('pidusage');
const { loadAndPreprocess } = require('./preprocess');
const { arrToOrtTensor } = require('./arr');
const path = require('path');

async function main() {
  // Suggest using wasm with SIMD + threads where available
  ort.env.wasm = ort.env.wasm || {};
  ort.env.wasm.simd = true;
  ort.env.wasm.numThreads = Math.max(1, Number(process.env.WASM_THREADS || 4));
  // (Optional) set path for .wasm assets if bundling:
  // ort.env.wasm.wasmPaths = path.join(__dirname, '..', 'node_modules', 'onnxruntime-web', 'dist');

  const session = await ort.InferenceSession.create(
    // wasm session takes the same ONNX model
    path.join(__dirname, '..', 'models', 'squeezenet.onnx'),
    { executionProviders: ['wasm'] }
  );

  const inputNd = await loadAndPreprocess(path.join(__dirname, '..', 'data', 'cat.jpg'));
  const input = arrToOrtTensor(inputNd);

  // Warmup
  for (let i = 0; i < 5; i++) {
    await session.run({ 'data': input });
  }

  // Benchmark
  const N = parseInt(process.env.N || '200', 10);
  const latencies = [];
  const t0 = performance.now();
  for (let i = 0; i < N; i++) {
    const t1 = performance.now();
    await session.run({ 'data': input });
    const t2 = performance.now();
    latencies.push(t2 - t1);
  }
  const tN = performance.now();

  // Stats
  latencies.sort((a, b) => a - b);
  const p = q => latencies[Math.floor(q * latencies.length)];
  const total = tN - t0;
  const throughput = (N * 1000) / total;

  const mem = process.memoryUsage.rss();
  const cpu = await pidusage(process.pid);

  console.log(JSON.stringify({
    kind: 'wasm',
    N,
    threads: ort.env.wasm.numThreads,
    simd: !!ort.env.wasm.simd,
    ms_p50: p(0.5),
    ms_p95: p(0.95),
    ms_avg: latencies.reduce((a, b) => a + b, 0) / N,
    throughput_ips: throughput.toFixed(2),
    rss_mb: (mem / (1024*1024)).toFixed(1),
    cpu_percent: cpu.cpu.toFixed(1)
  }, null, 2));
}

main().catch(err => { console.error(err); process.exit(1); });

Run them:

# Native (CPU)
node src/bench-native.js | tee native.json

# Wasm (SIMD + threads)
WASM_THREADS=4 node --experimental-wasm-simd --experimental-wasm-threads src/bench-wasm.js | tee wasm.json

If your Node build already enables Wasm SIMD/threads, the flags may be unnecessary. If threads aren’t available on your target edge runtime, set WASM_THREADS=1.

5) Collect & compare

Each script prints a compact JSON report like:

{
  "kind": "wasm",
  "N": 200,
  "threads": 4,
  "simd": true,
  "ms_p50": 7.21,
  "ms_p95": 8.88,
  "ms_avg": 7.49,
  "throughput_ips": "133.51",
  "rss_mb": "138.2",
  "cpu_percent": "96.7"
}

Key metrics to watch:

Latency (p50/p95): per-inference latency distribution
Throughput (ips): inferences/sec
RSS (MB): memory footprint
CPU%: how hard the core(s) worked

Test on the actual edge box (Raspberry Pi 4/5, Intel NUC, ARM SBC, router-class CPU, etc.). Then rerun on your CI runner or laptop to see portability vs raw speed trade-offs.

6) (Optional) Cold-start & concurrency

Cold-start: measure the first create() + first inference:

NODE_OPTIONS="--trace-warnings" \
node -e "console.time('cold');require('./src/bench-wasm.js');" 2>/dev/null

Concurrency: run multiple Node processes (simulate multi-tenant edge node):

# 4 parallel Wasm workers
seq 1 4 | xargs -I{} -P4 bash -c \
'WASM_THREADS=1 node src/bench-wasm.js N=150 >> wasm-multi.jsonl'

You can then aggregate p50/p95 across workers.

7) What improvements to expect?

On modern x86/ARM with SIMD + threads, Wasm latency can land close to native for CPU-friendly models like SqueezeNet/MobileNet (often within a small constant factor).
Cold start tends to be excellent (tiny module, fast init) vs “spin up a containerized Python stack.”
Memory is usually smaller for the Wasm runner, helpful when squeezing multiple tenants/models on the same device.
Throughput depends on how well your runtime maps threads & vector ops; tune WASM_THREADS per device core count

GPU acceleration in Wasm is emerging (WebGPU backends, wasi-nn), but CPU Wasm is the most portable today.

8) Production notes

Pin model opset and runtime versions; re-bench when you update.
Use SIMD-optimized models (quantized INT8/UINT8 variants often shine on edge CPUs).
Pre-warm modules on boot for ultra-low p50.
Batch carefully: micro-batches (2–8) can improve throughput with modest latency impact.
Resource limits: cgroup/ulimits per module to keep “noisy neighbors” in check on shared gateways.

Conclusion

Wasm gives you a portable, safe, and fast AI runtime for the edge. With a small harness like the above, you can prove (on your devices) that Wasm gets you the reliability and operational simplicity you want—often with performance that’s close enough to native to be a no-brainer.

DEV Community