on device models and how they work in the browser thanks to web assembly and webgpu

#webdev #performance #ai #machinelearning

I’ve been a fan of WebAI for a while now, and the more I learn and work on this, the more convinced I am that on-device is the way forward.

If we want AI that feels fast, private, and easy to access, it should run right where people already are: in the browser. No installs, no waiting on a server, no cloud bills. Just you, your device, and a tab open.

This is possible thanks to two big breakthroughs: WebAssembly (Wasm) and WebGPU.

WebAssembly

WebAssembly changed everything. By compiling runtimes to Wasm, code could run almost as fast as native apps, but safely inside the browser. That opened up a whole new opportunity:

ONNX Runtime Web lets you take models trained elsewhere (in PyTorch or TensorFlow, for example) and run them directly in a tab. It compiles the CPU engine into Wasm, so it works everywhere without needing a GPU.
Transformers.js (from Hugging Face) builds on this by giving developers a one-line JS API for common NLP and vision tasks—things like embeddings, question answering, image classification, or even running Whisper for speech-to-text. Instead of wiring up a complex runtime yourself, you can just call pipeline('sentiment-analysis') in JavaScript and it works.

This was enough to get small and medium models running in the browser. Embeddings, sentiment analysis, and simple image classification suddenly became possible without a backend. And it worked everywhere—from laptops to mid-range phones.

Another important player here is TensorFlow.js. This library has been around for a few years and was one of the first to make ML in the browser feel approachable. It offers multiple backends (CPU, WebGL, Wasm, and now WebGPU), and it comes with lots of ready to use models like PoseNet (for body pose detection) or MobileNet (for image recognition). For many developers and myself, TensorFlow.js was the entry point to experimenting with browser ML before newer frameworks arrived.

WebGPU

Then came WebGPU. For the first time, the browser got a proper GPU compute API. Not just graphics tricks, but compute shaders, parallel math, and control over buffers.

This unlocked new possibilities:

Stable Diffusion generating images in a tab.
Segment Anything producing realtime masks on client images.
LLMs streaming tokens locally, without touching a server.

The performance gains are real too. Chrome added FP16 shaders (half-precision floating point) and packed 8-bit dot products (DP4a), both of which make models faster and lighter. Benchmarks show 2–3× speedups on text embedding and LLM decoding just by enabling these features.

And the ecosystem is catching up:

ONNX Runtime Web now ships with a WebGPU backend, meaning models that were too heavy for Wasm execution suddenly become practical.
WebLLM, built by the MLC team, brings large language models into the browser. It uses WebGPU under the hood and exposes a familiar API similar to OpenAI’s. Developers can call it as if they were talking to an API in the cloud, but everything runs locally in the user’s browser.
Now with Firefox shipping WebGPU on Windows and Safari preparing support, 2025 is shaping up to be the first year you can target GPU compute across all major browsers. That’s huge.

In summary

Here’s how I like to frame it:

Wasm is the backbone. It runs the runtime, keeps things portable, and gives a CPU fallback for older devices.

WebGPU is the muscle. It handles the heavy math—matrix multiplications, attention, convolutions.

Runtimes tie it together. ONNX Runtime Web handles general-purpose models, Transformers.js makes pipelines easy, TensorFlow.js covers a wide range of prebuilt models, and WebLLM brings LLMs to the browser with an already familiar API.

Together, these tools let you ship apps that are both accessible and powerful.

Why on-device matters to me

Latency feels human. When building in the browser, every second counts. Time-to-first-token under a second makes users relax and explore.
Privacy by default. Inputs never leave the device unless the user chooses. For health, finance, or education apps, this is the best way forward.
Costs. Inference runs on the user’s hardware, not your server bill. That makes scaling more predictable.
Reach. The browser is the largest runtime in the world. A single link is all you need.

What’s next?

With WebGPU in Chrome, Edge, Firefox, and soon Safari, the baseline is stronger than ever. Mobile support is also improving fast.

I hope 5–10B parameter models to become normal soon on consumer devices with quantization and smart caching.

Developers don’t need to reinvent the wheel. The frameworks are here, the APIs are stabilizing, and the performance is finally good enough to make real apps.

Resources to explore

ONNX Runtime Web
– run ONNX models in the browser with Wasm and WebGPU backends.

Transformers.js
– Hugging Face’s library for running models directly in JavaScript.

TensorFlow.js
– prebuilt models and training/inference support in the browser.

WebLLM
– large language models running fully client-side with WebGPU acceleration.

MDN WebGPU Docs
– a good overview of the WebGPU API itself.

Final thoughts

The browser is the runtime. Users already have it. Let’s build there.

With Wasm as the backbone and WebGPU as the muscle, we can deliver fast, private, and accessible AI experiences that were impossible just a couple of years ago.