Run AI Models Entirely in the Browser Using WebAssembly + ONNX Runtime (No Backend Required)

#ai #webassembly #onnx #backend

Most devs assume running AI models requires Python, GPUs, or cloud APIs. But modern browsers are capable of running full neural network inference, using ONNX Runtime Web with WebAssembly — no backend, no cloud, no server.

In this tutorial, we’ll build a fully client-side AI inference engine that runs a real ONNX model (like sentiment analysis or image classification) entirely in the browser using WebAssembly — perfect for privacy-focused tools, offline workflows, or local-first apps.

Looking to sharpen your prompt engineering skills? Check out Vibe Coding: Prompting Best Practices — a focused, 10-part guide packed with real-world techniques to help you write more effective AI prompts. Whether you’re refining outputs or experimenting with creative use cases, this PDF delivers practical, high-impact strategies you can apply immediately

Step 1: Choose a Small ONNX Model

To keep things performant, pick a lightweight ONNX model. You can use:

Let’s use a text model for simplicity — TinyBERT.

Download the ONNX model:

wget https://huggingface.co/onnx/tinybert-distilbert-base-uncased/resolve/main/model.onnx

Store this file in your public assets directory (e.g., public/models/model.onnx).

Step 2: Set Up ONNX Runtime Web

Install the ONNX Runtime Web runtime:

npm install onnxruntime-web

Then, initialize the inference session in your frontend code:

import * as ort from "onnxruntime-web";

let session;
async function initModel() {
  session = await ort.InferenceSession.create("/models/model.onnx", {
    executionProviders: ["wasm"],
  });
}

This loads the ONNX model into a WASM-based runtime, running entirely in-browser.

Step 3: Tokenize Input Text (No HuggingFace Needed)

ONNX models expect pre-tokenized inputs. Instead of using HuggingFace or Python tokenizers, we’ll use a compact JavaScript tokenizer like bert-tokenizer:

npm install bert-tokenizer

Then tokenize user input:

import BertTokenizer from "bert-tokenizer";

const tokenizer = new BertTokenizer();
const { input_ids, attention_mask } = tokenizer.encode("this is great!");

Prepare inputs for ONNX:

const input = {
  input_ids: new ort.Tensor("int64", BigInt64Array.from(input_ids), [1, input_ids.length]),
  attention_mask: new ort.Tensor("int64", BigInt64Array.from(attention_mask), [1, input_ids.length])
};

Step 4: Run Inference in the Browser

Now run the model, right in the user's browser:

const results = await session.run(input);
const logits = results.logits.data;

Interpret the logits for your task (e.g., choose the argmax index for classification).

You’ve just run a transformer-based AI model with zero server calls.

Step 5: Add WebAssembly Optimizations (Optional)

ONNX Runtime also supports WebAssembly SIMD and multithreading if the browser supports it:

await ort.env.wasm.setNumThreads(2);
ort.env.wasm.simd = true;

Enable these for dramatically better inference speed.

✅ Pros:

🧠 Full AI model execution directly in the browser
🔐 No cloud, no server, fully private
📴 Works offline — ideal for PWAs or local-first apps
🚀 Uses ONNX: works with any exported PyTorch/TensorFlow model

⚠️ Cons:

🐢 Limited to lightweight models (mobile-scale)
👀 Manual preprocessing and tokenization required
📦 Bundle size can grow due to model + tokenizer
❌ Not supported in all browsers (e.g., some mobile browsers may limit WASM features)

Summary

Running AI inference in the browser used to sound like science fiction — now it’s just WebAssembly + ONNX. With this setup, you can deliver powerful, privacy-preserving AI capabilities entirely client-side: from offline transcription to secure chat assistants to smart document processors. The performance is real, and the applications are endless — especially in health, security, and creative tools.

Give users smart features without compromising speed or privacy — no server required.

Mastering prompt engineering is no longer optional — it’s essential. With Vibe Coding: Prompting Best Practices, you’ll explore structured methods to create prompts that consistently yield accurate, relevant, and creative results. Designed for developers and AI practitioners alike, this guide cuts through the fluff and delivers what works.