DEV Community

Pavel Fits
Pavel Fits

Posted on

Run an LLM directly in the user's browser via WebGPU with React in 50 lines of code (no API keys or token fees)

Overview

Adding AI to a React app usually means signing up for OpenAI, getting an API key, setting up billing, hiding the key in .env, spinning up a backend proxy so it doesn't leak to the client... And that's before you've written a single line of business logic. Then your prototype gets traffic, and you receive a $200 bill because someone decided to stress-test prompt injections on your demo.

If you go the local WebGPU route instead, you're stuck manually wiring async Web Workers to component lifecycles and hooks - easily 200+ lines of infrastructure before the first feature.

Today we'll do it cleanly. I'll show how to build a fully local LLM chatbot in 48 lines of React, running entirely in the user's browser with inferis-react - a thin wrapper that handles workers, memory, and WebGPU/WASM fallbacks under the hood.

  • No API keys
  • No server
  • No per-token charges
  • User data never leaves their device

WebGPU is supported in Chrome, Edge, and Firefox, and quantized 1-3B parameter models run on an average laptop at 15-30 tokens/sec.

Let's build a chatbot.


Without a Library: The Real Cost

Let's be honest about what a production-ready local chatbot requires with raw @mlc-ai/web-llm. Here's your file tree:

src/
  llm.worker.ts        ~200 lines - Web Worker: load model, stream tokens via postMessage
  gpu-detect.ts        ~100 lines - check WebGPU, fall back to WASM, pick the right backend
  worker-bridge.ts    ~120 lines - typed postMessage protocol, promise wrappers, error forwarding
  useLLM.ts           ~100 lines - React hook: subscribe to worker events, manage state & cleanup
  Chat.tsx             ~140 lines - the actual UI component
Enter fullscreen mode Exit fullscreen mode

That's 5 files, ~600+ lines of infrastructure to get a basic streaming chatbot that doesn't freeze the UI. And you still don't have:

  • Abort/cancel (needs a MessagePort or separate signal channel)
  • Memory cleanup when switching between models
  • Cross-tab deduplication (5 tabs = 5 copies of the model = 10 GB RAM)
  • Progress tracking with download phases and byte counts
  • Retry logic when the worker crashes

Add those and you're past 500 lines before writing your first feature.


With inferis-ml: This Is the Entire File

npm install inferis-ml inferis-react @mlc-ai/web-llm
Enter fullscreen mode Exit fullscreen mode
// App.tsx - that's it, no worker files, no bridge, no GPU detection
import { useState } from 'react';
import { InferisProvider, useModel, useStream } from 'inferis-react';
import { webLlmAdapter } from 'inferis-ml/adapters/web-llm';

function Chatbot() {
  const [messages, setMessages] = useState<{ role: string; content: string }[]>([]);
  const [input, setInput] = useState('');
  const { model, state, progress } = useModel<string>('text-generation', {
    model: 'SmolLM2-360M-Instruct-q4f16_1-MLC', autoLoad: true,
  });
  const { text, isStreaming, start, stop, reset } = useStream<string>(model);

  if (state === 'loading')
    return <p>Loading... {progress ? Math.round(progress.loaded / (progress.total || 1) * 100) : 0}%</p>;

  const send = () => {
    if (!input.trim() || isStreaming) return;
    const next = [...messages, { role: 'user', content: input }];
    if (text) next.push({ role: 'assistant', content: text });
    setMessages(next);
    setInput('');
    reset();
    start({ messages: [{ role: 'system', content: 'You are a helpful assistant.' }, ...next] });
  };

  return (
    <div>
      {messages.map((m, i) => <div key={i}><b>{m.role === 'user' ? 'You' : 'AI'}:</b> {m.content}</div>)}
      {text && <div><b>AI:</b> {text}</div>}
      <input value={input} onChange={e => setInput(e.target.value)}
        onKeyDown={e => e.key === 'Enter' && send()} disabled={isStreaming} />
      <button onClick={send} disabled={isStreaming}>Send</button>
      {isStreaming && <button onClick={stop}>Stop</button>}
    </div>
  );
}

export default () => <InferisProvider adapter={webLlmAdapter()}><Chatbot /></InferisProvider>;
Enter fullscreen mode Exit fullscreen mode

1 file. 28 lines. No worker files, no bridge code, no GPU detection. It just works.

All those 5 files and 160+ lines from the previous section? They're handled inside inferis-ml:

Your 5 files inferis-ml
llm.worker.ts + worker-bridge.ts Built-in worker pool with typed async API
gpu-detect.ts defaultDevice: 'auto' - one line
useLLM.ts useModel() + useStream() - two hooks, zero boilerplate
abort, memory, cross-tab stop(), LRU eviction, crossTab: true

What Happens Under the Hood

When you wrap your app in <InferisProvider>, the library:

┌─────────────┐     postMessage     ┌──────────────────┐
│  React UI   │ ◄─────────────────► │   Web Worker     │
│  (main      │                     │  ┌────────────┐  │
│   thread)   │                     │  │  web-llm   │  │
│             │                     │  │  adapter   │  │
│  useModel() │                     │  └────────────┘  │
│  useStream()│                     │  ┌────────────┐  │
│             │                     │  │  WebGPU /  │  │
│             │                     │  │  WASM      │  │
└─────────────┘                     │  └────────────┘  │
                                    └──────────────────┘
Enter fullscreen mode Exit fullscreen mode
  1. createPool() spins up a Web Worker pool. The model loads and runs in a separate thread - the main React thread stays free for rendering.

  2. pool.load() downloads model weights (cached in the browser's Cache API), initializes the runtime, and returns a ModelHandle.

  3. model.stream() starts inference and returns a standard ReadableStream. The useStream hook subscribes to the stream and updates React state on every token.

  4. WebGPU auto-detection: if a GPU is available - WebGPU is used. If not - WASM with SIMD. No manual navigator.gpu checks needed.

  5. Cross-tab: with crossTab: true, inferis-ml uses a SharedWorker. Five tabs with the same model = one copy in memory.


What Models Can You Use?

inferis-ml supports three adapters:

Adapter Runtime Models
webLlmAdapter() @mlc-ai/web-llm Llama 3.2 (1B/3B), SmolLM2, Phi-3, Gemma 2, Qwen 2.5
transformersAdapter() @huggingface/transformers Any model from HuggingFace Hub
onnxAdapter() onnxruntime-web Any ONNX model

For chatbots, the best choice is webLlmAdapter() with MLC-compiled models:

  • SmolLM2-360M - 200 MB, runs even on low-end devices
  • Llama-3.2-1B - 700 MB, solid quality for its size
  • Llama-3.2-3B - 1.8 GB, approaching GPT-3.5 quality, requires a GPU with 4+ GB VRAM

Bonus: RAG in 10 Minutes

Want the chatbot to answer questions about your documents? inferis-ml includes a full RAG pipeline that also runs entirely in the browser:

import { useRAG, useRAGStream } from 'inferis-react';
import { transformersAdapter } from 'inferis-ml/adapters/transformers';
import { webLlmAdapter } from 'inferis-ml/adapters/web-llm';

function RAGChat() {
  const rag = useRAG({
    embeddingModel: 'Xenova/multilingual-e5-small',
    embeddingAdapter: transformersAdapter(),
    llmModel: 'SmolLM2-360M-Instruct-q4f16_1-MLC',
    llmAdapter: webLlmAdapter(),
    autoInit: true,
  });

  const { answer, isStreaming, ask } = useRAGStream(rag.pipeline);

  const handleUpload = async (files: FileList) => {
    const docs = await Promise.all(
      Array.from(files).map(async (f) => ({
        id: f.name,
        text: await f.text(),
      }))
    );
    await rag.index(docs);
  };

  return (
    <div>
      <input type="file" multiple onChange={(e) => e.target.files && handleUpload(e.target.files)} />
      <p>Indexed: {rag.indexedCount} chunks</p>
      <button onClick={() => ask('What is this document about?')}>Ask</button>
      {isStreaming && <p>Thinking...</p>}
      <p>{answer}</p>
    </div>
  );
}
Enter fullscreen mode Exit fullscreen mode

Documents are chunked, embedded, and stored in IndexedDB - all locally.


Limitations (Honestly)

It would be dishonest not to mention:

  • First load is slow. A 1B parameter model is ~700 MB to download. After the first time, weights are cached in the browser.
  • WebGPU is needed for decent speed. The WASM fallback works but is 5-10x slower. In 2026, WebGPU is available in Chrome, Edge, and Firefox (93%+ coverage).
  • Quality != GPT-4. Models with 1-3B parameters handle simple tasks, summarization, and Q&A well. For complex reasoning, you need larger models (and therefore, an API).
  • Mobile devices. Works on iPhone/Android via WASM, but slowly. Desktop is the primary use case.

When this is the right choice:

  • Prototypes and demos without infrastructure
  • Privacy-first applications (healthcare, legal, finance)
  • Offline scenarios
  • Any case where you don't want to pay for an API

When an API is better:

  • You need GPT-4/Claude-level quality
  • Mobile users are your primary audience
  • Complex multi-step reasoning

Summary

Raw web-llm inferis-ml
Files 5 1
Lines of code ~450 (500+ with edge cases) 28
Web Workers Manual setup Built-in
WebGPU fallback Manual Automatic
Cross-tab Not supported SharedWorker
Memory Manual management LRU eviction
Streaming Manual async iterator useStream() hook
React integration Custom hooks useModel() + useStream()
RAG - Built-in pipeline
Bundle size - 6.7 kB (minzip)

Links:


If this was useful - drop a star on GitHub. It's the only metric that keeps open source maintainers going.

Logo

Top comments (0)