DEV Community

Cover image for Running AI in the Browser with Gemma 4 (No API, No Server)
System Rationale
System Rationale

Posted on

Running AI in the Browser with Gemma 4 (No API, No Server)

Most “AI apps” today are just API wrappers.
That’s fine… until you care about latency, cost, or privacy.

I’ve been exploring what it actually takes to run LLMs inside the browser, and Gemma 4 completely changes what’s possible.

This is not theory this is what actually works.

Why Gemma 4 is different

Gemma 4 isn’t just another model release.

It’s designed for:
• on-device inference
• agentic workflows
• multimodal tasks (text, audio, vision)

The important part?

👉 The E2B / E4B variants are small enough to run inside a browser tab.

No backend required.

⚙️ How it actually runs in the browser

Let’s cut the hype.

There are only 2 real approaches:

1. MediaPipe LLM Inference (Recommended)

• WebAssembly + WebGPU under the hood
• Load model like:
Enter fullscreen mode Exit fullscreen mode

const llm = await LlmInference.createFromOptions({
modelAssetPath: "/models/gemma-4-E2B.litertlm",
});

That’s it.

You now have:
• streaming responses
• token control
• temperature, top-k, etc.

2. WebGPU (Transformers.js style)

More control, more pain.
• You host quantized model
• Run inference via WebGPU
• Manage decoding loop yourself

👉 Only use this if you need custom pipelines.

⚡ Performance Reality (What nobody tells you)

Running LLMs in browser ≠ free magic.

Here’s what actually matters:

1. Model size will kill you if you’re careless

• Raw models → GBs
• Optimized (4-bit) → hundreds of MB
Enter fullscreen mode Exit fullscreen mode

👉 Rule:
• E2B → default
• E4B → only for high-end devices

2. Token limits = UX

Don’t blindly use 128K context.

You’ll:
• increase latency
• kill memory
• freeze UI

👉 Cap aggressively:

maxTokens: 512

3. Main thread blocking = bad UX

If you don’t handle this:
• UI freezes
• typing lag
• users drop

👉 Always:
• stream tokens
• use Web Workers if custom setup

  1. ## You need device intelligence

Don’t assume every device can handle it.

👉 Do this:
• Check WebGPU support
• Estimate memory
• Fallback → API model

🔐 Privacy = Your biggest advantage

This is where things get interesting.

With browser-based Gemma:
• No API calls
• No prompt logging
• No server dependency

Your pitch becomes :

“Your data never leaves your device.”

That’s not marketing — that’s architecture.

** How to keep your app lightweight**

If you mess this up, your app is dead.

❌ Wrong approach:
• Bundle model in JS
• Load on startup

✅ Correct approach:
1. Lazy load model

if (userClicksAI) {
loadModel();
}

  1. Separate asset hosting

• /models/gemma-4-E2B.litertlm

  1. Cache aggressively

• long cache headers
• avoid re-download

  1. Progressive upgrade

• start small → offer bigger model later

🧠 Real use cases (not demos)

Where this actually makes sense:
• Private note summarizer
• Offline AI assistant
• In-browser coding helper
• Document parsing (OCR + reasoning)

⚠️ Brutal truth

This is NOT for:
• low-end phones
• heavy reasoning tasks
• large-scale SaaS

🚀 Where this fits in real products

If you’re building something like:
• productivity tools
• education apps
• private assistants

This is a massive differentiator.

🔚 Final thought

We’re moving from:

“AI as API” to “AI as runtime”

And browsers are becoming compute platforms.

If you’re building something real (not demos),
this shift matters more than any model benchmark.

  • agent workflows
  • on-device AI
  • system design decisions
  • mistakes & trade-offs

→ Follow me on X: [(https://x.com/systemRationale)]

Top comments (0)