This is a submission for the Gemma 4 Challenge: Write About Gemma 4
I spent 11 days shipping AULA — an AI tutor that runs Gemma 4 entirely inside the browser for Latin American students without reliable internet. The build forced me to learn things about deploying Gemma 4 in production that the official documentation glosses over.
This post distills the 5 patterns I wish someone had handed me on day one. Every one of them cost me hours (or in one case, an entire afternoon) to figure out. If you're shipping Gemma 4 to real users on real hardware, this is the playbook I would have wanted.
If you want to see the result first, AULA's repo is here: github.com/jpablortiz96/aula. The Build with Gemma 4 submission has the full demo. This post is the technical postmortem.
Pattern 1 — MediaPipe is the right runtime, not transformers.js (yet)
If you Google "run Gemma 4 in the browser", you'll mostly find tutorials using @huggingface/transformers.js. It's a fantastic library and the obvious starting point. I started there too.
On my development laptop — a Windows machine with an NVIDIA RTX 3050 (Ampere, 6 GB VRAM) — transformers.js with WebGPU gave me 2 tokens per second. The benchmarks I'd seen online claimed 20-30 tok/s on similar hardware. Something was very wrong.
After a full afternoon of debugging (chrome://gpu, Task Manager GPU monitor, NVIDIA Control Panel, Vulkan flags, switching to Edge), I found the root cause: on NVIDIA Optimus laptops, dispatch was routing through the integrated Intel UHD GPU, not the discrete NVIDIA. WebGPU's requestAdapter({ powerPreference: 'high-performance' }) is ignored on Windows (Chromium bug 369219127). The model "worked" but ran on the wrong silicon.
What fixed it: migrating to @mediapipe/tasks-genai with the WebGPU delegate.
import { FilesetResolver, LlmInference } from "@mediapipe/tasks-genai";
const fileset = await FilesetResolver.forGenAiTasks(
"https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai/wasm"
);
const llm = await LlmInference.createFromOptions(fileset, {
baseOptions: {
modelAssetPath: "https://huggingface.co/litert-community/gemma-4-e2b-it/resolve/main/gemma-4-e2b-it-int4-web.task",
},
maxTokens: 2048,
topK: 40,
temperature: 0.7,
});
const response = await llm.generateResponse(prompt);
Same hardware. Same model. Jumped from 2 tok/s to 14-16 tok/s. A 7x speedup, just from switching runtimes.
MediaPipe is Google's official runtime for Gemma on edge devices. The team optimized the dispatch path specifically for the WebGPU delegate. It's also the only path that supports the .task artifact format Google publishes for browser deployment.
Lesson: if you're targeting consumer hardware in 2026, MediaPipe is the production runtime. transformers.js is excellent for prototyping but has not yet caught up on dispatch quality across all GPU/OS combinations. Use it for the local engine; revisit transformers.js in 6-12 months.
Pattern 2 — Pick the right Gemma 4 variant for the constraint, not the benchmark
Gemma 4 comes in three flavors and the marketing pages emphasize the 31B Dense and 26B MoE as the headline models. For a browser deployment, the only variant that actually matters is the E2B (~2 billion effective parameters, quantized to ~1.5 GB).
Here's the honest tradeoff matrix I built when picking the model for AULA's local engine:
| Variant | Size on disk | Runs in browser? | Math/reasoning quality | When to use |
|---|---|---|---|---|
| Gemma 4 E2B-IT | ~1.5 GB (q4f16) | ✅ Yes, WebGPU | Good for conversational tutoring | Local browser deployments |
| Gemma 4 E4B-IT | ~3 GB (q4f16) | ⚠️ Only on 8 GB+ VRAM GPUs | Slightly better than E2B | Mid-range GPUs only |
| Gemma 4 26B-A4B-IT (MoE) | ~13 GB | ❌ Cloud only | Near-31B quality, lower latency | Cloud API for structured output |
| Gemma 4 31B-IT (Dense) | ~16 GB | ❌ Cloud only | Best reasoning | When latency doesn't matter |
For AULA's offline-first use case, the picking logic was straightforward: the model has to fit in VRAM on a Raspberry Pi 5 (8 GB unified memory). E4B is too big the moment you account for KV cache + browser overhead. E2B fits with margin.
The non-obvious learning: on my RTX 3050 (6 GB VRAM), I tried to ship with E4B because it scored better on benchmarks. The model loaded but spilled into shared system memory via PCIe, dropping inference to ~1.8 tok/s. Switching to E2B (which actually fits in dedicated VRAM) jumped me back to 14+ tok/s.
Rule of thumb: for in-browser inference, the right model is the largest one that fits entirely in dedicated VRAM after counting ~1.5 GB of browser/system overhead. Anything larger spills to PCIe and is unusable.
For Cloud Boost (the optional half of AULA), I picked 26B-A4B over 31B Dense despite the lower parameter count. The mixture-of-experts architecture activates only ~4B parameters per forward pass, giving 2-3x lower latency at near-31B quality. For short structured outputs (a quiz JSON, a Mermaid diagram), this latency difference is the difference between "feels instant" and "user gives up".
Pattern 3 — Don't force small models into rigid structured output
This is the pattern I had to relearn three times before accepting it.
Gemma 4 E2B is brilliant at conversational tasks: math explanations, language tutoring, Socratic dialogue, multi-step reasoning in plain text. It is not reliable at:
- Producing valid JSON without surrounding prose
- Generating syntactically-valid Mermaid diagrams
- Outputting coherent SVG with proper geometry
- Following "respond ONLY with X" instructions consistently
This is not a bug. It's a known property of small open models. The "instruction following" capability scales roughly with parameter count, and at 2B effective parameters, E2B sits at the edge.
My first three approaches all failed:
-
Stricter prompts. "Respond ONLY with valid JSON, no markdown, no prose." Worked 70% of the time. The other 30% the model added an explanation paragraph or a
Here is the JSON:prefix. - Higher temperature for diversity, lower for structure. Marginal improvement, but introduced its own failure modes.
-
Tolerant JSON parser that strips fences and reaches for the first
{. Helped, but didn't fix the cases where the model produced almost-valid JSON with unescaped quotes inside string values.
What actually worked: route structured-output features to a larger model in the cloud (26B-A4B), keep the local model for conversational features, and be transparent about the routing in the UI.
In AULA, every screen shows a badge: green for local, blue for cloud. The user always knows which engine answered. This is the design pattern I'd argue for as a general principle:
Don't pretend your small model can do something it can't. Make the limitation a UX surface, not a hidden failure mode.
Here's the shape of the routing logic:
// Routing decision per feature, not per request
function chooseEngine(feature: Feature, hasApiKey: boolean): EngineId {
const structuredOutputFeatures: Feature[] = [
"infinite-practice", // requires JSON
"svg-illustration", // requires valid SVG
"mermaid-mindmap", // requires strict syntax
"interactive-quiz", // requires JSON array
"handwriting-ocr", // requires vision (E2B is text-only)
];
if (structuredOutputFeatures.includes(feature)) {
return hasApiKey ? "cloud-boost" : "unavailable";
}
return "local"; // chat, voice, calculator, Socratic, etc.
}
And the user always sees the routing decision, with an honest reason if cloud isn't available:
if (engine === "unavailable") {
showInfoMessage(
"This feature needs Cloud Boost. Add your free Google AI Studio API key in Settings to unlock it. The rest of AULA works offline regardless."
);
}
Pattern 4 — LlmInference is exclusive. Build a queue.
This bit me on day 9 and cost me half a day to diagnose.
MediaPipe's LlmInference instance is a singleton with exclusive access. It can process exactly one generation at a time. If you call generateResponse() while a previous generation is still in flight, you get:
Previous invocation or loading is still ongoing.
In a single-page app with multiple routes (chat, practice, mind maps), this is easy to trigger:
- User starts a long response in /chat
- User navigates to /practice before it finishes
- /practice tries to generate an exercise
- The model is locked. Everything breaks.
The fix: a FIFO queue with abort propagation across navigations.
class LocalEngine {
private isGenerating = false;
private currentAbort: AbortController | null = null;
private queue: Array<() => Promise<void>> = [];
async generate(prompt: string, opts: GenerateOptions): Promise<string> {
// Cancel any in-flight generation
if (this.isGenerating) {
this.abort();
await new Promise((r) => setTimeout(r, 200)); // small buffer
}
return new Promise((resolve, reject) => {
const task = async () => {
this.isGenerating = true;
this.currentAbort = new AbortController();
try {
const result = await this.llm.generateResponse(prompt);
resolve(result);
} catch (err) {
reject(err);
} finally {
this.isGenerating = false;
this.currentAbort = null;
const next = this.queue.shift();
if (next) next();
}
};
this.isGenerating ? this.queue.push(task) : task();
});
}
abort() {
this.currentAbort?.abort();
this.queue = [];
}
// Recovery path when the model gets stuck
forceReset() {
this.isGenerating = false;
this.currentAbort = null;
this.queue = [];
}
}
Critical: every component that uses the engine must call abort() on unmount.
useEffect(() => {
return () => engine.abort();
}, []);
Without this cleanup, navigating away mid-generation leaves the model locked, and the next page that wants to generate will silently hang.
Pattern 5 — Gemma 4 26B does not stream reliably. Use generateContent, not streamGenerateContent.
This one took an afternoon and a careful read of DevTools Network tab to find.
The Gemini API exposes two endpoints for Gemma 4 models:
POST .../models/gemma-4-26b-a4b-it:generateContent ← Full response
POST .../models/gemma-4-26b-a4b-it:streamGenerateContent ← SSE chunks
For chat use cases, you obviously want streaming. So I wired everything through :streamGenerateContent?alt=sse and assumed it would Just Work.
It did, for chat. It returned 400 Bad Request for AULA's Practice and Mind Map features.
The DevTools investigation revealed: when the prompt requested structured output (JSON, SVG, Mermaid), the streaming endpoint failed with 400 while the non-streaming endpoint succeeded with the same payload. I never got a clear root cause from the API — it may be a Gemma-specific quirk in how streamGenerateContent handles certain responseSchema configurations or thinking-mode trailers.
The fix that unblocked everything: two separate API client paths.
// For chat — streaming, long responses
async function streamChat(prompt: string, onToken: (t: string) => void) {
const res = await fetch(
`${BASE}/gemma-4-26b-a4b-it:streamGenerateContent?alt=sse&key=${apiKey}`,
{ method: "POST", body: JSON.stringify({ contents: [...] }) }
);
// ...parse SSE chunks, call onToken per chunk
}
// For structured output — single-shot, short responses
async function cloudGenerate(opts: CloudOptions): Promise<string> {
const res = await fetch(
`${BASE}/gemma-4-26b-a4b-it:generateContent?key=${apiKey}`,
{ method: "POST", body: JSON.stringify({
systemInstruction: { parts: [{ text: opts.system }] },
contents: [{ role: "user", parts: [{ text: opts.prompt }] }],
generationConfig: { temperature: 0.85, maxOutputTokens: 1024 },
})}
);
const data = await res.json();
// Filter out "thought" parts (Gemma 4 thinking mode)
const parts = data.candidates?.[0]?.content?.parts ?? [];
return parts
.filter((p) => !p.thought)
.map((p) => p.text ?? "")
.join("");
}
Lesson: for short structured outputs, streaming buys you nothing. The user is waiting for one complete artifact, not a slow reveal of text. Use
generateContent. Save streaming for genuine chat.
One more detail worth flagging: Gemma 4 has a thinking mode that emits "thought" parts in the response. If you naively concatenate all parts[].text, you'll surface the model's chain-of-thought in the user-visible output. Filter on part.thought === true and skip those. AULA's chat looked very weird until I added that filter — the model was literally showing its work to the student, which is not the goal.
What this means for developers shipping Gemma 4 today
If you're building with Gemma 4 in 2026, the patterns I'd internalize before writing a single line of code:
-
MediaPipe for browser, period. Don't waste a week on
transformers.jsbenchmarks. Migrate or start there. - Pick the model that fits in VRAM, not the model that benchmarks best. Spilling to PCIe destroys throughput. E2B is the only realistic browser model in 2026.
- Design routing as a UX surface. Small models can't do everything. Make the limitation visible and let the user opt into cloud where it matters. Honesty beats hiding limitations behind retries.
-
Treat
LlmInferenceas a single-threaded mutex. Queue your requests, abort on unmount, expose a recovery path. The cost of not doing this is a frustrating "the AI broke" experience that the user can't diagnose. -
Streaming is for chat.
generateContentis for everything else. Don't fight the API.
These five patterns saved me probably a week of additional debugging once I internalized them. AULA exists because Gemma 4 is genuinely good enough to run in a browser tab — but it only feels good to use because the patterns above turn the rough edges into smooth UX.
What I'm hopeful about
The interesting thing about all five patterns above: none of them are about Gemma 4's quality. They're about deployment ergonomics. The model itself is remarkable. A 2-billion-parameter open model that runs at 15 tok/s in a browser tab and can hold a real tutoring conversation with a high school student is a thing that genuinely did not exist 18 months ago.
For my specific use case — students in rural Latin America who have no other access to AI tools — Gemma 4 is the first model that crosses the practical viability line. It's small enough to download once over a school WiFi connection. It's capable enough to teach. It runs offline. It's free.
If you're working on local-first AI for any underserved population, I'd encourage you to start with Gemma 4. The deployment patterns above will save you a week. The model will do the rest.
If you want to see what the patterns look like in a finished product, AULA is open source under MIT: github.com/jpablortiz96/aula. Pull requests welcome.
About the author: I'm a solo founder in Cali, Colombia building educational tech for Latin American students. AULA was built solo in 11 days for this challenge.
Companion submission (Build track): AULA — The AI tutor that fits in a browser tab — live demo, video walkthrough, full architecture.
🇨🇴 Made in LATAM, for the students the world forgot.
Top comments (0)