There's a deeply embedded belief in the dev community about AI in production that's, with all due respect, just wrong: that to put an LLM in your app you absolutely need an API key, a server doing the inference, and someone paying the OpenAI bill at the end of the month. The default architecture in 2025 is: frontend → API call → cloud → response. Always. No exceptions.
Nope.
Last week I ran Gemma — Google's open model — directly in the browser. No API keys. No server. No network latency. The model downloaded, loaded into client memory, and inference ran right there, on the user's device. And the moment I saw the first response generate without a single request leaving the network... hold on. This changes everything.
Gemma LLM in the browser without API keys: what it is and why it matters
Before getting into the code, quick context for anyone who missed the previous post about running a small LLM in Next.js.
Gemma is Google DeepMind's family of open-weights models. The small ones — Gemma 2B, Gemma 3 1B — have a reasonable size for running on consumer hardware. What's new in 2025 is that with WebGPU and the right libraries, that "consumer hardware" includes the user's browser.
The tools that make this possible:
- WebGPU API: direct GPU access from the browser, no plugins
- @huggingface/transformers.js: Transformers ported to the browser, WebAssembly + WebGPU
- MediaPipe LLM Inference API: Google's approach, optimized specifically for Gemma
I went with Transformers.js because I already had experience with the Hugging Face ecosystem, and because the distribution model — loading weights from CDN with browser caching — felt like the most practical approach for a real app context.
The experiment: real code, no magic
I started simple. React component, no server, inference on the client. This is the code I actually ran:
// components/GemmaLocal.tsx
// Inference runs entirely in the browser — no API calls
'use client';
import { useState, useEffect, useRef } from 'react';
// Import pipeline from transformers.js — runs in the browser
import { pipeline, TextGenerationPipeline } from '@huggingface/transformers';
type LoadState = 'idle' | 'loading' | 'ready' | 'error';
export function GemmaLocal() {
const [state, setState] = useState<LoadState>('idle');
const [progress, setProgress] = useState(0);
const [response, setResponse] = useState('');
const [input, setInput] = useState('');
const pipelineRef = useRef<TextGenerationPipeline | null>(null);
const loadModel = async () => {
setState('loading');
try {
// Gemma 2B instruct — ~1.5GB on first load, cached after that
// The model downloads once and lives in the browser's Cache Storage
pipelineRef.current = await pipeline(
'text-generation',
'Xenova/gemma-2b-it', // quantized version, lighter weight
{
// Use WebGPU if available, fallback to WASM
device: 'webgpu',
progress_callback: (info: { progress?: number }) => {
if (info.progress) {
setProgress(Math.round(info.progress));
}
},
}
);
setState('ready');
} catch (error) {
console.error('Error loading Gemma:', error);
setState('error');
}
};
const generateResponse = async () => {
if (!pipelineRef.current || !input.trim()) return;
setResponse('');
// Gemma instruct template — important for getting good responses
const prompt = `<start_of_turn>user\n${input}<end_of_turn>\n<start_of_turn>model\n`;
const result = await pipelineRef.current(prompt, {
max_new_tokens: 256,
// Streaming: each token is emitted as soon as it's generated
// Response appears progressively without waiting for a server
callback_function: (output: Array<{ generated_text: string }>) => {
const text = output[0]?.generated_text ?? '';
// Extract only the model's part, without the prompt
const pureResponse = text.split('<start_of_turn>model\n').pop() ?? '';
setResponse(pureResponse);
},
});
return result;
};
return (
<div className="p-6 max-w-2xl mx-auto">
{state === 'idle' && (
<button
onClick={loadModel}
className="px-4 py-2 bg-blue-600 text-white rounded"
>
Load Gemma (first time: ~1.5GB)
</button>
)}
{state === 'loading' && (
<div>
<p>Downloading model... {progress}%</p>
{/* After the first load this won't show — the browser caches it */}
<p className="text-sm text-gray-500">
First time only. After that it loads instantly.
</p>
</div>
)}
{state === 'ready' && (
<div className="space-y-4">
<textarea
value={input}
onChange={(e) => setInput(e.target.value)}
className="w-full p-3 border rounded"
placeholder="Your question..."
rows={3}
/>
<button
onClick={generateResponse}
className="px-4 py-2 bg-green-600 text-white rounded"
>
Generate (no internet needed)
</button>
{response && (
<div className="p-4 bg-gray-50 rounded">
<p className="whitespace-pre-wrap">{response}</p>
</div>
)}
</div>
)}
</div>
);
}
// app/demo-local/page.tsx
// Standalone page — zero server components needed for inference
import { GemmaLocal } from '@/components/GemmaLocal';
export default function DemoLocalPage() {
return (
<main>
<h1>Gemma in the browser — 100% local inference</h1>
{/* This component makes zero fetches to any server of ours */}
<GemmaLocal />
</main>
);
}
What happened: first load, ~1.5GB download (4-bit quantized model). Slow. But after that first load, the browser caches it in Cache Storage. Second visit: the model's already there, loads in seconds.
And the inference: on a machine with a discrete GPU, between 5–15 tokens per second. On mine, with an RTX 3060, I hit 20 tokens/sec. It's not GPT-4 Turbo, but for specific tasks — classification, short summarization, data extraction — it works.
The "hold on, this changes everything" moment
After it worked, I killed the WiFi. Typed a question. The response came anyway.
I've been watching compute migrate for 30+ years. The pattern has always been the same: power starts centralized, democratizes toward the edge, and at some point lands on the device. The Amiga did in the client what used to require a mainframe. The internet café where I worked at 14 had more compute than entire institutions from ten years prior. Every generation, the client eats a piece of the server.
What just happened with LLMs is exactly that same movement, but in fast-forward.
The concrete implications:
No inference billing. Zero API cost. The user brings their own GPU. If your app has 100,000 active users doing 50 queries a day, with GPT-4 those are numbers that hurt. With client-side inference, that's literally zero dollars of inference cost.
No network latency. The round-trip to a server in us-east-1 from Argentina is 200–300ms before the first token even starts arriving. Local: 0ms. For UX this is brutal — the difference between "waiting for it to load" and "responds instantly."
No data leaving the device. For use cases with sensitive data — legal documents, medical notes, proprietary code — local inference changes the game entirely. The data doesn't travel anywhere.
Connecting this to what I wrote about sandboxes for coding agents: part of the problem with giving an agent real autonomy is the cost and latency of every LLM call. If the model runs locally, the economics of the problem change completely.
Mistakes and gotchas I walked straight into
Not everything was pretty. The real problems:
WebGPU isn't everywhere. Firefox has it behind a flag. Safari added it in recent versions. The WebAssembly fallback works, but it's 3–5x slower. You need feature detection and a graceful degraded experience.
// Check for support before trying to load
const checkWebGPU = async (): Promise<boolean> => {
if (!navigator.gpu) return false;
try {
const adapter = await navigator.gpu.requestAdapter();
return adapter !== null;
} catch {
return false;
}
};
// Choose device based on support
const device = (await checkWebGPU()) ? 'webgpu' : 'wasm';
The first load is a real UX problem. 1.5GB on the first visit is a lot. I had to add an explicit "installation" screen with clear progress. Treat it like a PWA that installs, not a page that loads.
RAM. The quantized model needs ~1–2GB of RAM. On devices with 4GB total, this can freeze the tab. You need to set expectations and offer a cloud API fallback for devices that can't handle it.
The model is small — it acts like it. Gemma 2B is not GPT-4. For short summarization, classification, and tasks with lots of context in the prompt, it handles itself well. For complex reasoning or long generation, the results are noticeably worse. I calibrated my expectations after an hour of testing. The trick is designing the task for the model, not the other way around.
This connected to something I learned optimizing a Next.js app I brought down from 3 seconds to 300ms: performance doesn't come from hitting a magic button, it comes from understanding what's actually happening and designing around that reality.
Limited context window. The quantized model I used has an effective context of 2048 tokens. Send it a long document and it truncates without telling you. I had to implement explicit chunking.
// Basic chunking to avoid blowing the context window
const MAX_TOKENS_APPROX = 1500; // safety margin
const CHARS_PER_TOKEN_APPROX = 4;
const MAX_CHARS = MAX_TOKENS_APPROX * CHARS_PER_TOKEN_APPROX;
const truncateContext = (text: string): string => {
if (text.length <= MAX_CHARS) return text;
// Truncate from the beginning, preserve the end (usually more relevant)
return '...' + text.slice(text.length - MAX_CHARS);
};
I felt this same pain while working with Claude Code in February — context management is the problem nobody has fully solved yet.
FAQ: Gemma LLM in the browser without API keys
Which browsers support WebGPU for running Gemma?
Chrome 113+ and Edge have stable support. Safari 18+ supports it. Firefox has it behind dom.webgpu.enabled in about:config — not in production yet. For real production today, Chrome/Edge are the safe target. Always implement a WebAssembly fallback for everyone else.
How big is the model and how do I handle the first download?
Gemma 2B quantized at 4-bit weighs ~1.4–1.6GB. The first download is real and it takes time — on slow connections it can be 5–10 minutes. The key is treating it like a PWA installation: explicit progress screen, explanation that it's a one-time thing, and that the browser caches it in Cache Storage afterward. Subsequent visits: loads in seconds.
How fast is inference compared to a cloud API?
It depends a lot on hardware. On a modern discrete GPU (RTX 3060+): 15–25 tokens/second with WebGPU. On integrated hardware (Apple Silicon M1): 8–15 tokens/sec. On CPU via WASM: 1–3 tokens/sec, noticeably slow. OpenAI/Anthropic APIs deliver 50–100 tokens/sec with better quality. The advantage of local isn't raw speed — it's zero network latency and zero cost.
Does it work completely offline?
Yes, and that's the part that rewired my mental model. Once the model is cached, inference runs without a single network request. I tested it by killing the WiFi. It works. This opens up use cases that were previously impossible: apps for areas with intermittent connectivity, tools that handle sensitive data that can't leave the device, features that work on planes/subways/wherever.
Is this production-ready or just an experiment?
Right now it sits somewhere between advanced experiment and early-adopter production. Cases where it already makes sense: apps with sensitive data (legal, medical, personal notes), nice-to-have features where the fallback is simply not having them, tech-savvy users with good hardware. Cases where it doesn't scale yet: mass consumer experience on mobile with varied hardware, tasks that require the reasoning level of large models, apps where a 1.5GB first download kills the conversion funnel.
What about mobile?
WebGPU on mobile is in development but limited. Chrome on Android is advancing, iOS Safari has partial support. The big problem is RAM — phones with 4–6GB don't have room to load a 1.5GB model. Gemma 1B (the smaller version, ~700MB quantized) is more viable for mobile. Honest reality: mobile-first with local inference is still 1–2 years away from being reliable.
The compute always migrates to the edge
What I experienced with Gemma in the browser is the same pattern I saw when the internet café where I worked started having more power than enterprise servers from five years prior. Compute always migrates to the edge. Always.
I'm not saying cloud APIs are going to disappear. GPT-4, Claude, Gemini Pro — for cases that need maximum capability, they'll keep being the answer. But there's a whole category of features — classification, summarization, extraction, contextual assistance — where a small model running on the client solves the problem just as well, with no API cost, no network latency, and no data leaving the device.
The biggest shift for me wasn't technical. It was conceptual: I stopped thinking of "LLM in my app" as synonymous with "API call to a cloud endpoint." Now it's a real architectural decision: does this model go on the server, at the edge, or on the client?
And once you start asking that question, you can't stop asking it.
If you already read the post about small LLMs in Next.js and want to go one step further, this is that step. Grab Transformers.js, load Gemma, kill your WiFi, and ask it something. The first time it responds without a single packet leaving your network, you'll have the same moment I had.
Worth it.
Top comments (0)