DEV Community

Cover image for I Stuffed a Tiny LLM Inside a Next.js App — Here's What I Learned
Juan Torchia
Juan Torchia

Posted on • Edited on • Originally published at juanchi.dev

I Stuffed a Tiny LLM Inside a Next.js App — Here's What I Learned

It was 2am and Chrome was showing me 4.2GB of RAM used in a single tab. The model had been "thinking" for 47 seconds about a three-word response. I was staring at the screen with that specific mix of fascination and horror that technology gives you when it's working and not working at the same time. This is what happened when I decided to shove a tiny LLM inside a Next.js app.


Tiny LLMs in the browser: what they promise, what they deliver

When I saw the Show HN thread with 836 points about tiny LLMs running directly in the browser, my first thought was: this has to go into my stack. Then I saw the Gemma one with 141 points. The idea is simple and powerful: local inference, no API keys, no network latency, no per-token costs. Real privacy.

The technical concept is concrete: quantized models (GGUF, int4, int8) that bring 7B-parameter behemoths down to manageable territory — 1B, 500M, even smaller — running on WebAssembly or WebGPU directly in the browser. No server, no Claude, no OpenAI. Just the client and the model.

It sounds beautiful. And in part, it is. But there's a chasm between a Show HN demo and actually shipping this in a real app.


The real setup: Next.js, WebLLM, and my first collision with reality

I started with WebLLM from MLC AI — the most mature library for this. The approach is WebGPU when available, with a fallback to WebAssembly. The model I picked: Gemma-2B-it-q4f32_1, which theoretically weighs ~1.5GB.

# Installation — the easiest part of this whole process
npm install @mlc-ai/web-llm
Enter fullscreen mode Exit fullscreen mode

The first problem showed up before I'd written a single line of business logic.

// app/components/LocalLLM.tsx
'use client' // Critical — all of this lives on the client

import { CreateMLCEngine, MLCEngine } from '@mlc-ai/web-llm'
import { useState, useEffect, useRef } from 'react'

// The model I landed on after several failed attempts
const MODEL_ID = 'Gemma-2B-it-q4f32_1-MLC'

export function LocalLLM() {
  const engineRef = useRef<MLCEngine | null>(null)
  const [status, setStatus] = useState<'idle' | 'loading' | 'ready' | 'error'>('idle')
  const [progress, setProgress] = useState(0)
  const [response, setResponse] = useState('')

  const initEngine = async () => {
    setStatus('loading')

    try {
      // This downloads ~1.5GB on first load — the user needs to know that
      engineRef.current = await CreateMLCEngine(MODEL_ID, {
        initProgressCallback: (report) => {
          // Progress comes as text, not a number — you have to parse it
          const match = report.text.match(/(\d+\.\d+)%/)
          if (match) setProgress(parseFloat(match[1]))
        }
      })

      setStatus('ready')
    } catch (error) {
      // This fires if the browser doesn't support WebGPU
      // Safari on iOS: straight to error
      console.error('Engine init failed:', error)
      setStatus('error')
    }
  }

  const runInference = async (prompt: string) => {
    if (!engineRef.current) return

    const reply = await engineRef.current.chat.completions.create({
      messages: [{ role: 'user', content: prompt }],
      // Without this, it waits for the ENTIRE response before showing you anything
      stream: true,
    })

    // Streaming in the browser — the best part of this whole experiment
    for await (const chunk of reply) {
      const delta = chunk.choices[0]?.delta?.content || ''
      setResponse(prev => prev + delta)
    }
  }

  return (
    // Basic UI for the experiment
    <div>
      {status === 'idle' && (
        <button onClick={initEngine}>Load model (~1.5GB)</button>
      )}
      {status === 'loading' && <p>Downloading: {progress.toFixed(1)}%</p>}
      {status === 'ready' && (
        <button onClick={() => runInference('Explain what a neural network is in 2 sentences')}>Run inference</button>
      )}
      {response && <p>{response}</p>}
    </div>
  )
}
Enter fullscreen mode Exit fullscreen mode

This worked. First token appeared. I got excited.

Then I looked at the task manager.


Where everything breaks — the limits nobody mentions in demos

The happy-path tutorial ends when the first token shows up on screen. The real experiment starts there.

Problem 1: The initial download is a UX nightmare

1.5GB on first visit. Without a configured service worker cache, that downloads every time the browser clears its cache. With cache, the model lives in the browser's IndexedDB — which on Safari has aggressive storage limits.

WebLLM uses the browser's Cache API automatically, but the UX of "please wait while we download 1.5GB" doesn't exist in any product you've ever actually used. I had to build a progress screen from scratch.

Problem 2: Memory — the number that'll scare you

Gemma 2B quantized to int4 promises ~1GB of RAM. In practice I saw spikes of 3-4GB in Chrome during initial loading. Why: the initialization process loads the full model before moving it to the GPU. On devices with less than 8GB available, it's Russian roulette.

On mobile: just no. iOS Safari doesn't have stable WebGPU. Android Chrome works on some Pixels, unpredictable on everything else.

Problem 3: Real latency vs. demo latency

On an M2 MacBook with WebGPU: 8-12 tokens/second. Decent.
On a 2019 i7 without a dedicated GPU (WebAssembly fallback): 0.8-1.2 tokens/second. Unusable.
On a Railway server (CPU): doesn't make sense — for that you'd just use an API.

The Show HN demo ran on the perfect setup. Your average user doesn't have that setup.

Problem 4: Next.js and the SSR that blows everything up

// This import explodes on the server — WebGPU doesn't exist in Node
import { CreateMLCEngine } from '@mlc-ai/web-llm'

// The fix: dynamic import with ssr: false
import dynamic from 'next/dynamic'

const LocalLLM = dynamic(
  () => import('./components/LocalLLM'),
  { 
    ssr: false, // Without this, Railway throws an error on build
    loading: () => <p>Loading inference interface...</p>
  }
)
Enter fullscreen mode Exit fullscreen mode

I learned this the hard way. Successful build, deployed to Railway, white screen. Three hours later: ssr: false. I've written before about deploying to Railway and the Next.js optimizations that actually matter — but the ssr: false for WebGPU is one I didn't see documented anywhere.

Problem 5: The model is small — and it shows

Gemma 2B is impressive for its size. But when you compare it to GPT-4o or Claude, the gap in reasoning is a canyon. For simple tasks — classification, short summarization, entity extraction — it works well. For anything requiring complex reasoning, you feel the ceiling immediately.

This isn't a knock on the model. It's about calibrating expectations: it's a 2B running quantized in a browser. The right question isn't "is it as good as GPT-4?" — it's "is it good enough for my specific use case?"


The moment I decided whether it was worth it

After three days of experimenting, I sat down and did the cold analysis. I have a habit of thinking about the stack from the project's perspective, not from the excitement of the technology itself.

Cases where I WOULD use this:

  • An internal tool where you control the user's hardware (always Chrome on a powerful desktop)
  • Privacy as a product differentiator — processing sensitive text without sending it to a server
  • Offline-first apps where API latency is the killer
  • Prototypes and demos where the WOW factor matters more than consistent performance

Cases where I WOULD NOT use this:

  • A public app with a heterogeneous user base across devices
  • Anything where response speed is critical
  • When the small model isn't capable enough for the task (most production cases)

The honest conclusion: this is a technology I'm going to keep watching, but it needs another year or two to mature before I put it in front of real users whose hardware I don't control. The TypeScript patterns I use to abstract these decisions helped me wrap this as a feature flag — the component exists, it's off by default, I turn it on only in contexts where I know it'll work.

On juanchi.dev I have it running as an experiment on a separate route, not as a main feature. That's the right place for this today.


FAQ — What you'd ask me if I talked about this at a meetup

What's the difference between running an LLM in the browser vs. on the edge (Cloudflare Workers, Vercel Edge)?

They're two different things. Browser inference = WebGPU/WASM, runs on the user's machine, no server. Edge inference = the model runs on the edge server, with limited GPU access (Cloudflare has experimental access to models via Workers AI). The browser is more private and has no compute costs for you, but it's totally dependent on the user's hardware. Edge gives you more control over latency and the model, but has costs and the available models are limited.

What's the smallest model that's actually useful for something?

In my experiment, the minimum viable for reasonable natural language tasks was Gemma 2B quantized (~1.5GB download). Smaller models exist — Phi-3 mini 3.8B is surprisingly good, and there are 500M-parameter variants for classification — but for free-form text generation, once you go below 1B the quality falls off a cliff. File size isn't the only number that matters: the architecture and fine-tuning of the model matter too.

Does this replace using the OpenAI or Anthropic API?

No, and I don't think it will for most cases in the near term. The capability gap between a local 2B and GPT-4o is enormous. What it can replace: simple NLP tasks where you're currently paying for millions of tokens on things that don't need complex reasoning — sentiment classification, keyword extraction, short summaries. For that, a local model makes economic and privacy sense.

Is WebGPU production-ready yet?

Depends on your definition of production. Chrome 113+ on desktop: yes, stable. Firefox: available but slower. Safari macOS: available since Safari 18. iOS Safari: in progress, inconsistent. Android Chrome: available on modern devices, unpredictable on mid-to-low-end hardware. If your app has users across multiple browsers and devices, you need a robust WebAssembly fallback and you need to communicate to the user that the experience will be slower.

Can you stream the response or do you have to wait for the full completion?

Yes, WebLLM supports native streaming with the same OpenAI interface (stream: true). Streaming is basically mandatory — without it, the user stares at a blank screen for 30-60 seconds and then all the text dumps at once. With streaming, the first token appears in 2-5 seconds and the response flows in gradually. The UX difference is night and day. I implemented it with the same for await pattern I use with the Anthropic API.

Is it worth it for a side project, or is this only for big companies with resources?

For a side project it's actually perfect — precisely because you don't have to pay for API calls. The real cost is setup time and understanding the limits. If you're building a niche tool where you can assume your users have decent hardware (think: a Chrome extension for developers, a tool for designers on desktop), the use case fits well. For a consumer app with heterogeneous users, I'd wait another 12-18 months.


The technology is there. The maturity, not so much.

What I'm taking away from three days of this experiment: browser inference works. It's not marketing, it's not smoke and mirrors. You can put Gemma in a Chrome tab, ask it questions, and get answers back without sending anything to any server. That's genuinely remarkable.

But there's a big jump between "it works" and "it's ready for real users." The 1.5GB initial download, the WebGPU dependency, the brutal variability between devices — those are product problems, not just implementation problems.

My read: it's the perfect moment to learn this, too early to ship it to mainstream production. I have it on active radar, with working code, waiting for the ecosystem to mature. I'll revisit this question in 2026 and I'm betting the answer will be different.

If you want to reproduce the experiment, the code is in my repo and the notes in this post are the honest map of where you're going to spend your time.

Top comments (0)