Zero-Cost AI: Running LLMs Locally in the Browser

Roman Solodkyi — Fri, 10 Apr 2026 05:49:27 +0000

"Any application that can be written in JavaScript will eventually be written in JavaScript." — Atwood's Law

If you think running AI models requires a billion-dollar company and a GPU cluster worth the GDP of a small country, I have good news. All you need is a browser and good old JavaScript! Almost.

The tools shown in this article make it surprisingly easy to run AI inference offline in just a few lines of code, giving you full control to build or prototype neat features at zero cost.

The pros on paper:

🆓 No inference cost - runs on the user's device
🔒 Privacy by default - data never leaves the browser
📵 Offline support - works without a connection
🔓 No vendor lock-in - you own the stack
⚡ Low integration overhead - minimal code to get started

Of course, there are catches. First, models must be downloaded to run on the user's device, so your AI features won't be available immediately on first page load. They will, however, be cached and ready on later visits.

Second, not every device is created equal, and not every model can run everywhere. I'll discuss the limitations and how to work around them in detail further.

Given the initial loading time, it's reasonable to assume a pretty narrow set of use cases. Most importantly, the user should interact with the feature repeatedly over time. It makes no sense to force someone to wait that long for the model to download just to use it once. A better approach is to load it in the background, enabling additional functionality while the user continues to use your app. For SaaS apps, this works naturally: users return, the model stays cached, and AI becomes part of the product rather than a per-request cost.

Good candidates, in my opinion, include:

Smart autofill and suggestions
Content summarization
In-app Q&A over existing data
Auto-tagging and classification
Writing assistance
Sentiment and intent detection

First, a quick look at what makes all of this possible: the technologies that power the tools I'll show you, even if you never interact with them directly.

Behind the Scenes

WebGPU
A modern API for GPU compute in the browser. Think of it as the next-generation successor to WebGL. Designed to provide high-performance 3D graphics and compute capabilities for modern web browsers. It provides access to modern GPU hardware capabilities, enabling support for AI/machine learning.

ONNX
A portable cross-platform model format. Instead of being tied to PyTorch or TensorFlow, models can be converted once and run anywhere, including the browser.

WebAssembly (Wasm)
A near-native execution layer for the web. When WebGPU isn't available, Wasm runs models on the CPU - slower, but reliable and widely supported.

With that out of the way, on to the tools.

Swiss Knife: Transformers.js

The easiest entry point is Transformers.js - essentially Hugging Face's Python library, but for JavaScript.

It uses WebGPU when available and falls back to Wasm (CPU) otherwise, and supports a wide range of tasks: classification, summarization, translation, speech recognition, image classification, and more.

The core abstraction is the pipeline: load a model, pass input, get output.

import {pipeline} from '@huggingface/transformers';

const classifier = await pipeline('sentiment-analysis');

const result = await classifier('I love how easy it is to run ML models in the browser!');
// [{ label: 'POSITIVE', score: 0.9998 }]

Same approach for a different model:

import {pipeline} from '@huggingface/transformers';

const classifier = await pipeline('zero-shot-classification');

const result = await classifier(
    'Large language models can now run directly in the browser',
    ['technology', 'sports', 'politics']
);

Models are downloaded once from a CDN and then cached in the browser for the next visit. After that, inference is effectively instant.

For typical NLP tasks, Transformers.js models are generally browser-friendly: lightweight tasks like text embeddings or entity extraction are around 30 MB, sentiment and text classification are closer to 80 MB, while NER sits in a similar range. Larger tasks like summarization, zero-shot classification, speech recognition, or image classification can reach hundreds of megabytes or even gigabytes. You can also run these in Node.js, where file size is less of a concern.

600 MB summarization model example:

const summarizer = await pipeline(
    'summarization',
    'Xenova/distilbart-cnn-6-6'
);

const [{summary_text}] = await summarizer(longArticle);

What stands out is responsiveness. You can run inference on every keystroke and still feel real-time for small models. To keep the experience non-blocking, I recommend running inference in a Web Worker.

// worker.js
import {pipeline} from '@huggingface/transformers';

const summarizer = await pipeline('summarization', 'Xenova/distilbart-cnn-6-6');

self.onmessage = async ({data}) => {
    const [{summary_text}] = await summarizer(data.text, {
        max_new_tokens: 100,
    });
    self.postMessage({summary: summary_text});
};

In my opinion, Transformers.js is best suited for focused NLP tasks in the browser. You can run LLMs with it, but that's not where it truly shines. Classification, summarization, tagging, and other lightweight NLP tasks are examples that can fit naturally into almost any product.

WebLLM: Your True AI Browser Assistant

WebLLM runs full instruction-tuned LLMs (Llama, Mistral, Qwen, etc.) directly in the browser using WebGPU. Under the hood, it uses MLC-LLM to compile models into an optimized WebGPU/Wasm target.

The API is intentionally OpenAI-compatible. If you've built against the OpenAI chat completions API, the mental model transfers directly. Beyond basic chat, WebLLM supports streaming via AsyncGenerator, JSON mode for structured output, seeding for reproducible results, and preliminary function calling (still WIP). It also supports Web Workers and Service Workers, meaning a loaded model persists across page reloads and works offline indefinitely.

The setup is straightforward:

import * as webllm from "@mlc-ai/web-llm";

const engine = await webllm.CreateMLCEngine(
    "Llama-3.2-3B-Instruct-q4f32_1-MLC",
    {
        initProgressCallback: ({progress, text}) => {
            console.log(`Loading: ${Math.round(progress * 100)}% - ${text}`);
        },
    }
);

const response = await engine.chat.completions.create({
    messages: [
        {role: "user", content: "Explain WebGPU briefly"}
    ],
});

The main tradeoff is the initial download. Models are large (hundreds of MBs to several GBs), so progress feedback is essential. After that:

The model is cached
Loads fairly quickly on revisit
Can work offline

The API mirrors OpenAI's, so it's easy to adopt.

The VRAM reality

The real constraint is VRAM. Every token in the context occupies GPU memory on the user's device, competing with the model weights themselves. A 3B model at 4-bit quantization takes ~2–3 GB just to load - whatever remains is your conversation budget. Precompiled MLC models typically ship with 4K–8K token context windows.

What "decent hardware" means in practice:

Integrated graphics: usually fine for office work, weak for local AI, because they rely on shared system memory and limited bandwidth.
Budget laptop dGPUs: usable for lightweight inference, but often constrained by 4–8 GB VRAM and quickly hit limits on larger models.
Midrange GPUs: 8–12 GB VRAM is a practical sweet spot for local 3B–7B models and casual experimentation.
High-end GPUs: 16–24 GB VRAM makes bigger models and longer contexts much more manageable.
Apple Silicon: unified memory means the model can use the machine's shared memory pool, which makes 16 GB and up much more useful for local AI than raw specs alone suggest.

The practical bar is roughly 6+ GB for smaller models and 8–12 GB for anything larger. A midrange gaming laptop from the last couple of years clears that. A typical thin-and-light with integrated graphics likely doesn't, which is exactly why this kind of feature should be opt-in.

WebLLM requires a WebGPU-capable browser (Chrome/Edge 113+, Firefox 141+, Safari 26+) and degrades with a clear error rather than silently hanging, so feature detection and graceful fallback are straightforward.

Where it fits best: an in-app chat assistant with zero per-token cost, contextual onboarding that adapts to app state, draft generation for emails or reports, or full AI functionality that survives going offline. It's also the fastest way to prototype. No backend, no API keys, no cost to iterate.

Next, let's explore what a "truly native" browser AI could look like.

The Zero-Setup Option: Chrome Built-in AI

The fourth path doesn't involve any npm packages, CDN downloads, or model management at all. Chrome already ships with AI capabilities built directly into the browser - powered by Gemini Nano running on-device.

Chrome 138 shipped the first batch of stable APIs - Summarizer, Translator, and Language Detector. Since these are Chromium features, they work in Edge, Opera, Brave, and other Chromium-based browsers too. More are in the pipeline: a Writer and Rewriter for content generation, a Proofreader, and a Prompt API for open-ended Gemini Nano access (already stable for Chrome Extensions). No libraries to install, no models to host, no loading spinners. The model is part of the browser itself - if the user has a compatible browser, they have the model (or the browser downloads it transparently in the background, ~2 GB once). The full list and current status of each API is tracked on Chrome's Built-in AI page.

The Summarizer API is minimal. You create an instance, pass text, get a summary:

const summarizer = await Summarizer.create({
    type: 'key-points',    // or 'tldr', 'teaser', 'headline'
    format: 'markdown',
    length: 'medium',      // 'short', 'medium', 'long'
});

const result = await summarizer.summarize(longArticle);

Four summary types cover the most common needs: key-points for bullet lists, tldr for a condensed paragraph, teaser for a hook, and headline for a single line. Combined with three length options, you get a reasonable amount of control without any prompt engineering.

The Translator API follows the same pattern. You specify a language pair, and Chrome handles everything, including downloading language-specific packs on demand:

const translator = await Translator.create({
    sourceLanguage: 'en',
    targetLanguage: 'es',
});

const translated = await translator.translate('The browser is the platform.');
// "El navegador es la plataforma."

You can check availability before creating an instance. If you haven't used it before, you'll need to download it:

const availability = await Translator.availability({
    sourceLanguage: 'en',
    targetLanguage: 'ja',
});
// 'available' | 'downloadable' | 'downloading' | 'unavailable'

Both APIs support a monitor callback for tracking download progress, useful when Chrome needs to fetch the base model or a language pack for the first time:

const summarizer = await Summarizer.create({
    type: 'tldr',
    monitor: (m) => {
        m.addEventListener('downloadprogress', (e) => {
            console.log(`${Math.round(e.loaded * 100)}%`);
        });
    },
});

The major trade-off is obvious: it's Chromium-only. There's no Firefox support, no Safari support, no polyfill. If you're building for a broad audience, this is a non-starter as a primary feature. But for internal tools, Chrome-first SaaS apps, or progressive enhancements where you can feature-detect and fall back gracefully, it's hard to beat the simplicity. If your users are on Chrome and you need quick summarization or translation with zero setup, this is the path of least resistance.

The other constraint is flexibility. You get what Chrome gives you: there's no model selection, no fine-tuning, no custom tasks. Summarizer summarizes. Translator translates. If you need more, you're back to the other tools in this article.

Bonus: When the Browser Hits Its Ceiling - Ollama

The approaches discussed above have many limitations. The model is too large. The context window is too small. The performance requirements are too demanding. You need stricter guarantees around structured output.

Ollama is the obvious next step. And yes, it's still easy to set up.

It spins up a local server and exposes an OpenAI-compatible API. You can call it from your Node.js or any other backend, or even directly from the browser with a plain fetch when running on your machine.

import {Ollama} from 'ollama';

const ollama = new Ollama();

const stream = await ollama.chat({
    model: 'llama3.2',
    messages: [
        {role: 'user', content: 'Summarize this incident report in 2 sentences: ...'}
    ],
    stream: true,
});

for await (const chunk of stream) {
    process.stdout.write(chunk.message.content);
}

Unlike WebLLM, which is limited to smaller quantized models that fit within a browser's VRAM budget, Ollama can run the full spectrum: from nimble 2B models to 70B ones capable of significantly richer reasoning and output quality. And since you control the hardware, context isn't constrained by end-user VRAM. Models like Llama 3.1 and Qwen 2.5 support up to 128K tokens natively.

Feature-wise, it goes well beyond what a browser can offer. Structured output via JSON Schema, function/tool calling, multimodal input for vision models, embedding generation for semantic search, and a thinking mode for reasoning models like DeepSeek R1. The model library is extensive - from Gemma 3 1B (~1 GB) to Llama 3.3 70B (~43 GB at 4-bit quantization).

Nothing comes free, and you need to consider the cost. Cloud AI providers charge per token, and at scale, that compounds. At some point, self-hosting becomes cheaper. How soon depends on the model tier and volume. For a SaaS product with consistent AI usage, owning the inference layer is worth modeling out.

Think internal tools, document processing pipelines, RAG over private data, or anything where the model or context length outgrows what a browser can handle.

Reality Check: Performance and Constraints

Performance varies by model, task, and hardware. A sentiment analysis model feels instant. A local LLM might take a few seconds per response. The user's device needs to be decent, and "decent" is doing real work in that sentence.

These tools are best suited for desktop browsers. On mobile, WebGPU support is limited and inconsistent, and even where it works, the hardware simply isn't built for this kind of workload. Expect slow inference, higher battery drain, and thermal throttling. Transformers.js with very small task models may work on mobile, but WebLLM should be considered desktop-only. Treating mobile as an unsupported tier for these features is a reasonable default.

Side by Side

	Transformers.js	WebLLM	Chrome Built-in AI	Node.js + Ollama
Runtime	Browser (Wasm default, WebGPU optional)	Browser (WebGPU required, no CPU fallback)	Browser (Chromium 138+)	Server / local machine
Model types	Task-specific (NLP, CV, audio)	Full LLMs (Llama, Mistral, Phi…)	Summarizer, Translator, Language Detector	Full LLMs + task models
Model size	~100 MB – ~500 MB	~500 MB – ~8 GB+	Ships with browser (~2 GB)	~300 MB – ~240 GB+
Context window	N/A (task models)	4K–8K typical (VRAM-bound)	N/A (task APIs)	Up to 128K+ (configurable)
Hardware req.	Modest - Wasm fallback	Modern GPU required	Chrome with Gemini Nano	GPU helpful, CPU works
Privacy	100% local	100% local	100% local	Local (server-controlled)
Offline	Yes (after first load)	Yes (after first load)	Yes (after model ready)	Yes (with local Ollama)
API cost	None	None	None	None
Best for	Classification, summarization, embeddings	Conversational AI, text generation	Quick summarization, translation	Dev tools, internal apps, heavier workloads

What's Coming: WebNN and the NPU Era

The WebNN API (Web Neural Network) is in the W3C spec pipeline, and it's the most significant upcoming piece. Unlike WebGPU (GPU-only) or Wasm (CPU-only), WebNN is hardware-agnostic: it routes inference to whichever accelerator is available - CPU / GPU / NPU. It's currently the only browser API that exposes NPU access.

That matters as AI PCs with dedicated neural accelerators become mainstream: Intel Core Ultra, Qualcomm Snapdragon X, Apple Silicon's Neural Engine. WebNN reached W3C Candidate Recommendation status in 2024, Chromium-based browsers have experimental support behind flags, and ONNX Runtime Web (which powers Transformers.js) already has a WebNN execution provider in development.

Neither WebLLM nor Transformers.js tap into WebNN yet in any meaningful way, but the connection is coming. When it lands, even the "modest hardware" bar drops significantly.

Closing Thoughts

Client-side AI is coming and will become increasingly accessible, finding its niche in web applications. Next time you reach for an API key to add a simple text classification feature, consider whether a couple-dozen-MB model cached in the user's browser might do the job - for free, forever, with no data leaving their device.

Personally, I've been using lightweight NLP features like tag suggestions based on user input in my apps - Transformers.js makes that effortless. Ollama is my go-to for demos and prototyping: no API keys, no cost, instant iteration. WebLLM and Chrome AI are impressive in their own ways, but still very niche in practice. If you're serious about shipping, be mindful of the constraints.

Try It Yourself

To experiment with all four approaches in one place, I built a companion demo - a React playground that lets you switch between Transformers.js, WebLLM, Chrome AI, and Ollama side by side.

The Transformers.js tab runs live sentiment analysis, zero-shot classification, and text summarization - all in-browser via Web Workers. WebLLM provides a full chat interface with model selection, streaming responses, and configurable system prompts and context. Chrome AI demonstrates the built-in Summarizer and Translator APIs with no setup. Ollama connects to a local server with auto-discovery of installed models. Each tab demonstrates the real trade-offs firsthand: download sizes, loading times, response quality, and hardware requirements.

→ Source on GitHub

The code examples in this article use @huggingface/transformers v4, @mlc-ai/web-llm v0.2.x, and ollama v0.5.x. APIs evolve quickly - check the official docs for the latest.

DEV Community: Roman Solodkyi