Gautam Vhavle

Posted on May 25

I Ran a 2-Billion Parameter AI Model in a Browser Tab. No Server.

#ai #machinelearning #llm #webdev

I ran a 2-billion-parameter language model entirely inside a browser tab.

No server. No API key. No cloud. Completely offline, Just Chrome, WebGPU, and my laptop's GPU generating tokens locally.

This was not a frontend talking to a hidden backend. The model loaded on my machine and replied at around 20+ tokens/sec on an M1 MacBook Pro.

That means private AI, no inference bill, and no waiting on someone else's backend.

The first time it worked, it felt slightly illegal.

I had spent enough time with local LLM setups to know the usual shape of the problem. Download a model. Load it through a native runtime. Watch your fans spin up. Accept that the browser is just the nice front end sitting on top of the real machine learning system.

Then I hit a question I could not let go of: if my laptop can already run the model locally, why can't the browser talk to the same GPU directly?

That question became BrowserAI. It started as a local experiment and ended up as a public app that runs 100+ models from 13+ families, entirely client-side, with no backend inference at all.

It also runs on modern phones. The UI is responsive, the app is installable as a PWA, and smaller models are surprisingly usable on mobile hardware.

If you want to try the product before reading the internals:

Live demo: BrowserAI
Code: GitHub

My machine snapshot
Device: MacBook Pro (Apple M1, 8 GB RAM)
Browser: Chrome/Brave
Model: Qwen 3.5 9B q4f16_1 (4-bit quantized)
Observed speed: ~20+ tokens/sec
Caveat: this is one personal result, not a universal benchmark

This article is the real technical story behind it: what made it possible, what was harder than expected, and how the whole stack works from first page load to streamed token.

It started with a very specific kind of curiosity

I was already deep in the usual local AI rabbit hole: llama.cpp, Ollama, quantized model files, token speed comparisons, context window tuning, all of it. Once you start running models on your own machine, your brain immediately begins asking annoying questions.

Mine was simple:

If the GPU is already here, why am I still treating the browser like a spectator?

At first I assumed the answer was some boring platform limitation. Browsers do UI. Native apps do heavy compute. End of story.

But that stopped being true the moment I started reading about WebGPU.

Then I found WebLLM from MLC-AI. That was the real turning point. WebLLM takes open models, compiles them through Apache TVM, and runs them on top of WebGPU with an API that feels suspiciously familiar if you've used OpenAI's SDK before.

I pointed it at Qwen 3.5 2B, opened Chrome, typed a message, and watched the reply stream back locally. No hidden server call. No magic API endpoint. Just my browser pushing work to my own GPU.

That was the moment the project stopped being an experiment and started becoming a product.

The whole project is really about solving three problems

Once I had the first prototype working, the architecture became much clearer. Running LLMs in the browser is not one problem. It is three separate problems stacked on top of each other:

Compute: How do you get browser code close enough to the GPU to make transformer inference practical?
Memory: How do you fit billion-parameter models into the memory a browser can realistically use?
Responsiveness: How do you do all that without freezing the UI and making the app feel broken?

BrowserAI exists because the modern browser finally has answers to all three.

Problem 1: getting real GPU compute in the browser

The foundation is WebGPU.

If WebGL was about teaching browsers how to draw, WebGPU is about teaching browsers how to compute. That difference matters a lot for AI. Large language models are mostly giant piles of matrix multiplications, and matrix multiplications are exactly the kind of work GPUs are absurdly good at.

What made this click for me was realizing that I did not need the browser to become PyTorch. I only needed it to get close enough to the GPU that a compiled runtime could use the hardware efficiently.

That is what WebGPU enables.

BrowserAI does not hand-write raw WebGPU kernels for inference. The heavy lifting is handled by @mlc-ai/web-llm, which in turn relies on Apache TVM to compile model execution for the browser. But BrowserAI does directly use the WebGPU capability layer to figure out what the device can handle before the user ever downloads a model.

Here is the key starting point:

if (!("gpu" in navigator)) {
  throw new Error("WebGPU is not available in this browser.");
}

const adapter = await navigator.gpu.requestAdapter();
const adapterInfo = adapter.info;
const features = adapter.features;
const limits = adapter.limits;

From that adapter, BrowserAI pulls the information that actually matters for model selection:

GPU vendor and architecture
Support for shader-f16
maxStorageBufferBindingSize
compute limits like workgroup size and bind group counts

The most interesting part is VRAM estimation.

WebGPU does not simply tell you, "here is your VRAM budget." So the app estimates it from maxStorageBufferBindingSize, then applies a correction for unified-memory systems like Apple Silicon. That correction matters because an M-series Mac shares RAM between CPU and GPU, so a naive estimate can be way too optimistic.

In practice, BrowserAI caps unified-memory machines to about 65% of reported system memory for GPU work, then keeps another chunk in reserve so the model has room for the KV cache during generation. That is the difference between a model that loads reliably and one that crashes after making the user wait through a 2 GB download.

This hardware-aware step is why BrowserAI can sort machines into useful tiers instead of just pretending every GPU is the same.

Tier	Approx VRAM	Realistic model range
Low	under 2.5 GB	0.5B to 1B
Medium	2.5 GB to 6 GB	1B to 4B
High	6 GB to 12 GB	4B to 9B
Ultra	12 GB+	9B and above

That is also why the default recommendation in BrowserAI is Qwen 3.5 2B q4f16_1. It is small enough to fit on a wide range of modern machines, but strong enough to feel useful immediately.

On my machine, that first setup streamed at around 20+ tokens per second. That is not a universal benchmark and I would never present it as one, but it was the moment I knew browser inference had crossed the line from novelty to product-worthy.

Problem 2: fitting a giant into browser memory

Even with GPU compute available, there is a very annoying reality check: model weights are huge.

A 2B model in full 32-bit precision is roughly 8 GB. That is already too big for a lot of browser contexts. A 9B model at fp32 is way beyond what most users can realistically load.

This is where quantization stops being an optimization and becomes the entire reason the app exists.

The core idea is simple. At inference time, you usually do not need the full precision the model had during training. So instead of storing every weight as a 32-bit float, you store compressed approximations and recover enough fidelity during computation to keep quality high.

In BrowserAI, the most important formats are:

q4f16_1: 4-bit weights with float16 accumulation
q4f32_1: 4-bit weights with float32 accumulation
q0f16: unquantized float16, used only for small models where the trade-off makes sense

The practical effect is massive.

Model	q4f16_1	q4f32_1	q0f16
Qwen 3.5 0.8B	1.6 GB	1.9 GB	2.7 GB
Qwen 3.5 2B	2.2 GB	2.6 GB	-
Qwen 3.5 4B	3.9 GB	4.7 GB	-
Qwen 3.5 9B	6.4 GB	-	-

That is the trick. Not a small trick. The trick.

Without quantization, browser inference is mostly an academic flex. With good 4-bit formats, billion-parameter models become realistic for normal hardware.

The easiest way to think about it is image compression. A raw image can be huge. A compressed image can be tiny while still looking almost identical to your eye. Quantization does something similar for model weights. It compresses the numbers hard enough to make deployment possible without wrecking the behavior that actually matters to the user.

There is still a trade-off, of course. Smaller is not free. But for a browser chat app, this is one of the best deals in modern ML engineering: dramatic memory savings, minor practical quality loss, and a product people can actually use.

One more subtle point matters here too: q4f16_1 is usually the best default, but it depends on shader-f16 support. If a machine cannot do that efficiently, BrowserAI can fall back to broader-compatibility options like q4f32_1. That is why hardware detection and quantization strategy are not separate topics. They are the same decision from two angles.

Problem 3: making it feel like an actual app instead of a frozen tab

Now comes the part most demos get wrong.

Even if the model can run locally, the experience still falls apart if the browser locks up every time the user sends a message. A responsive chat app cannot block the main thread while it downloads gigabytes of weights or generates tokens one by one.

This is where WebLLM, Apache TVM, and Web Workers all click together.

The runtime stack

At a high level, the stack looks like this:

Apache TVM compiles model execution for the browser
WebAssembly handles orchestration and control flow
WebGPU handles the heavy numerical work on the GPU
WebLLM wraps all of that in a developer-friendly runtime
Web Workers keep the inference work off the UI thread

This is what makes the project feel less like a hack and more like a real runtime.

WebLLM exposes an API that is intentionally familiar:

const chunks = await engine.chat.completions.create({
  messages: chatHistory,
  stream: true,
  stream_options: { include_usage: true },
  temperature: 0.7,
  max_tokens: 1024,
});

for await (const chunk of chunks) {
  const delta = chunk.choices[0]?.delta?.content || "";
  fullReply += delta;
  onToken(fullReply);
}

That surface is deceptively simple. Underneath it, the runtime is tokenizing input, dispatching compiled GPU work, sampling outputs, tracking usage, and streaming token deltas back to the UI.

Why the worker architecture matters

All active inference in BrowserAI runs inside a dedicated worker. The worker file is almost absurdly small:

import { WebWorkerMLCEngineHandler } from "@mlc-ai/web-llm";

const handler = new WebWorkerMLCEngineHandler();

self.onmessage = (msg: MessageEvent) => {
  handler.onmessage(msg);
};

That tiny file is enough because WebLLM does the hard orchestration. The important design choice is not the code size. It is the separation of concerns.

The main thread handles UI, input, rendering, chat state, and progress feedback.

The worker handles model loading and token generation.

That means the user can keep typing, scrolling, switching chats, or watching progress indicators while the browser is doing serious inference work in the background.

BrowserAI also uses the same pattern for background downloads. If the user wants to pre-cache another model while chatting with the current one, the app spins up a separate worker, downloads and initializes the model just long enough to get it into cache, then unloads it. That one detail makes the app feel much more intentional because users can prepare larger models without interrupting their current session.

There are even model-specific runtime quirks hidden in the plumbing. For example, Gemma 3 needs a special sliding-window attention configuration during initialization. That kind of detail does not show up in screenshots, but it is exactly the sort of thing you have to handle if you want a broad model catalog instead of a single hero demo.

The part that turns a demo into a product: caching and persistence

The first load is the tax you pay for local AI.

If a model is 2.2 GB, there is no clever copywriting trick that makes that small. The first visit can be slow. The download can be heavy. On a weak connection, it can be annoying.

The only acceptable answer is to make that pain happen once.

BrowserAI uses three storage layers for that:

Cache API for model files and weight shards
localStorage for chats, preferences, and active session metadata
sessionStorage for remembering the currently loaded model across refreshes

The model cache is the most interesting part. WebLLM stores the downloaded artifacts in browser caches, and BrowserAI inspects those caches to figure out which models are complete. It does not just look for a random file and assume success. It checks for model-specific cache entries, finds the manifest file such as ndarray-cache.json or tensor-cache.json, and verifies that the expected shards are present. If the manifest is missing, it falls back to a heuristic.

That sounds like a tiny detail, but it solves a real product problem: partial downloads. If a user closes the tab halfway through a large model, you do not want the app pretending that the model is fully ready the next time they visit.

Then there is the Service Worker.

The Service Worker in BrowserAI uses a simple, practical strategy:

navigation requests are network-first
static assets are cache-first

That means the app shell loads quickly on repeat visits, but the HTML can still stay fresh when the user is online. The result is that once the app and a model are cached, BrowserAI feels much closer to a local desktop app than a website.

There is also a nice quality-of-life piece here: when a model is successfully loaded, the app saves that model ID in session storage. Refresh the tab, and BrowserAI can auto-load the same model from cache instead of making the user start over.

This is the part many technical writeups skip, but it matters enormously. Users do not experience your stack as WebGPU plus TVM plus quantization. They experience it as: "I opened the page, downloaded a model once, and next time it was just there."

This is edge AI, just in a browser tab

At this point it is worth naming what BrowserAI actually is.

This is edge AI.

The model runs where the user already is, on the local device, close to the data and close to the interface. There is no round trip to a remote inference server for every prompt. The browser is acting as the deployment layer, but the execution model is the same core idea driving edge AI everywhere else: move intelligence to the edge instead of forcing every interaction back through the cloud.

What makes the browser version interesting is that it combines two things that usually do not coexist.

Edge execution: inference happens on the user's own hardware
Web distribution: the app is still one URL away, with no install flow, no native packaging, and no platform-specific client build

That combination changes the trade-offs in ways that are both technical and product-facing.

From a technical perspective, edge AI improves four things immediately:

Privacy and data locality. Prompts, drafts, notes, and conversations can stay on-device instead of being shipped to a vendor API by default.
Latency. You remove network round trips, server queueing, and cold-start behavior. The remaining bottleneck is local compute, which is often the better bottleneck to have for interactive work.
Offline resilience. Once the model and app shell are cached, the system can keep working on a plane, in a tunnel, or behind a flaky enterprise network.
Cost structure. There is no per-token bill and no inference fleet to operate for each user session. You pay in client hardware constraints instead of server spend.

That does not mean edge AI is automatically better. It means the bottlenecks move.

On the edge, you stop worrying as much about API throughput and start worrying about VRAM limits, thermal throttling, battery drain, storage quotas, browser support, and model footprint. Debugging also gets more complicated because the hardware matrix is messy. A fast Apple Silicon laptop, an older Windows machine, and a modern phone can all support the same product in principle while delivering very different ceilings in practice.

That is why BrowserAI spends so much effort on detection, model filtering, quantization, caching, and worker isolation. Those are not just implementation details. They are the product surface area of edge AI. If you are going to run models on user devices, you need to adapt to the device you actually got, not the one you wish you had.

I also think edge AI changes what kinds of products become worth building.

If inference can happen locally, you can justify AI features in places where cloud inference feels awkward or too expensive: private note tools, internal enterprise assistants, offline field software, education apps, developer tools, on-device copilots, kiosks, and mobile experiences that should not fall apart when connectivity does. Even when a product eventually uses a hybrid architecture, pushing some tasks to the edge can reduce cloud cost, improve responsiveness, and narrow the privacy boundary.

The bigger shift is philosophical. For the last few years, most AI products trained users to think of intelligence as something that lives in somebody else's data center. Edge AI pushes in the other direction. It says a surprising amount of useful intelligence can live on your laptop, your phone, your workstation, or your browser session.

I do not think the future is "edge everywhere, servers nowhere." Large models, shared memory, enterprise coordination, and heavy retrieval workflows will keep plenty of AI in the cloud. But I do think a lot of current cloud usage is there because the tooling was easier, not because the architecture was inherently right.

That is why projects like this matter to me. They make edge AI feel less like a research demo and more like a normal software design option.

What actually happens when you send a message

The full flow is simpler than it sounds once you flatten it:

You type a message in the chat UI.
BrowserAI appends it to local chat state and sends the formatted history to the WebLLM engine.
The engine call is proxied into a worker.
WebLLM tokenizes the prompt, runs the compiled model through WASM plus WebGPU, and streams token deltas back.
The UI updates the assistant bubble in real time and stores stats like tokens per second, total tokens, and generation time.
The finished response and its metadata get persisted to localStorage.

That last bit is one of my favorite parts of the product. BrowserAI does not just stream text. It surfaces runtime information too. After a generation finishes, the app can show tokens generated, prompt size, generation speed, and context usage. That makes the whole system feel transparent instead of mystical.

And that transparency matters, because browser AI still feels unbelievable to a lot of people the first time they see it.

The constraints are real, and saying that makes the project stronger

One thing I wanted this project to avoid was fake magic.

Browser AI is real. It is also constrained.

Here are the honest limits:

The first download is heavy. A local model that feels instant on later visits still has to arrive somehow.
WebGPU support is not universal. Modern Chrome, Edge, and recent Safari builds are the target. Older browsers are a hard stop.
It runs on mobile too, but mobile still has tighter limits. Modern phones can run the app, especially with smaller models, but thermals, memory ceilings, and browser support still make laptops and desktops the better experience for heavier use.
Bigger models still need real VRAM. Quantization helps a lot, but physics still wins.
Performance is hardware-dependent. Tokens per second varies wildly across machines.

In my experience, being explicit about these limits does not make the project less impressive. It makes it more credible.

The exciting part is not that local browser AI is magically free of constraints. It is that those constraints are now reasonable enough to build a product around.

Why I think this matters

What started as a curiosity project became a much bigger idea for me.

If the browser can do real GPU compute, if open models can be quantized aggressively enough to fit consumer hardware, and if runtimes like WebLLM can package all of that into something usable, then the browser is no longer just a thin client for AI.

It becomes a real runtime.

That changes the product equation in a few important ways:

private inference becomes easier
API costs disappear for many use cases
offline AI starts feeling practical
distribution gets much simpler because the browser is already the platform

That does not mean servers go away. It means we finally get to choose which workloads belong on the server and which ones are perfectly happy living on the device.

BrowserAI is my attempt to push that idea into something people can actually touch.

If you try it, start with Qwen 3.5 2B. It is the best first impression of the stack: small enough to load on ordinary hardware, fast enough to feel alive, and strong enough to make the whole approach click.

Open Chrome. Visit BrowserAI. Pick a model. Let it download once. Then send a prompt and watch your own machine do the work.

That feeling is the whole point.

If you try it, tell me your setup: chip or GPU, browser, model, and tokens/sec. I'm especially curious which low-end machines surprise me. The code is open source if you want to inspect the internals or build on top of it.

⭐ GitHub Repo

Top comments (1)

Harjot Singh • May 31

"It felt slightly illegal" is the perfect reaction, because we've all internalized that real inference happens on someone else's GPU with a meter running. 20+ tokens/sec for a 2B model on an M1 via WebGPU is the genuinely interesting threshold: it's fast enough to be useful, and the no-server, no-key, no-bill, fully-private properties are exactly the ones that matter for a specific tier of work. The honest framing I'd add: this doesn't replace cloud frontier models, it reshapes what you route where. The browser-local 2B is perfect for the high-volume, low-stakes, privacy-sensitive tier (classify, summarize, redact, autocomplete) where shipping data to a cloud was always overkill, and you keep the expensive cloud call for the hard reasoning that earns it. Two tiers, two homes. The privacy angle is underrated too, some data should never leave the device, and local inference makes that a default rather than a promise. That route-the-task-to-the-right-tier thinking is core to how I approach cost and privacy in Moonshift. What's the model load time like, is the first-token wait (downloading and warming weights) the real UX tax versus the steady-state tokens/sec?