Run an LLM directly in the user's browser via WebGPU with React in 50 lines of code (no API keys or token fees)

#ai #opensource #react #performance

Overview

Adding AI to a React app usually means signing up for OpenAI, getting an API key, setting up billing, hiding the key in .env, spinning up a backend proxy so it doesn't leak to the client... And that's before you've written a single line of business logic. Then your prototype gets traffic, and you receive a $200 bill because someone decided to stress-test prompt injections on your demo.

If you go the local WebGPU route instead, you're stuck manually wiring async Web Workers to component lifecycles and hooks - easily 200+ lines of infrastructure before the first feature.

Today we'll do it cleanly. I'll show how to build a fully local LLM chatbot in 48 lines of React, running entirely in the user's browser with inferis-react - a thin wrapper that handles workers, memory, and WebGPU/WASM fallbacks under the hood.

No API keys
No server
No per-token charges
User data never leaves their device

WebGPU is supported in Chrome, Edge, and Firefox, and quantized 1-3B parameter models run on an average laptop at 15-30 tokens/sec.

Let's build a chatbot.

Without a Library: The Real Cost

Let's be honest about what a production-ready local chatbot requires with raw @mlc-ai/web-llm. Here's your file tree:

src/
  llm.worker.ts        ~200 lines - Web Worker: load model, stream tokens via postMessage
  gpu-detect.ts        ~100 lines - check WebGPU, fall back to WASM, pick the right backend
  worker-bridge.ts    ~120 lines - typed postMessage protocol, promise wrappers, error forwarding
  useLLM.ts           ~100 lines - React hook: subscribe to worker events, manage state & cleanup
  Chat.tsx             ~140 lines - the actual UI component

That's 5 files, ~600+ lines of infrastructure to get a basic streaming chatbot that doesn't freeze the UI. And you still don't have:

Abort/cancel (needs a MessagePort or separate signal channel)
Memory cleanup when switching between models
Cross-tab deduplication (5 tabs = 5 copies of the model = 10 GB RAM)
Progress tracking with download phases and byte counts
Retry logic when the worker crashes

Add those and you're past 500 lines before writing your first feature.

With inferis-ml: This Is the Entire File

npm install inferis-ml inferis-react @mlc-ai/web-llm

// App.tsx - that's it, no worker files, no bridge, no GPU detection
import { useState } from 'react';
import { InferisProvider, useModel, useStream } from 'inferis-react';
import { webLlmAdapter } from 'inferis-ml/adapters/web-llm';

function Chatbot() {
  const [messages, setMessages] = useState<{ role: string; content: string }[]>([]);
  const [input, setInput] = useState('');
  const { model, state, progress } = useModel<string>('text-generation', {
    model: 'SmolLM2-360M-Instruct-q4f16_1-MLC', autoLoad: true,
  });
  const { text, isStreaming, start, stop, reset } = useStream<string>(model);

  if (state === 'loading')
    return <p>Loading... {progress ? Math.round(progress.loaded / (progress.total || 1) * 100) : 0}%</p>;

  const send = () => {
    if (!input.trim() || isStreaming) return;
    const next = [...messages, { role: 'user', content: input }];
    if (text) next.push({ role: 'assistant', content: text });
    setMessages(next);
    setInput('');
    reset();
    start({ messages: [{ role: 'system', content: 'You are a helpful assistant.' }, ...next] });
  };

  return (
    <div>
      {messages.map((m, i) => <div key={i}><b>{m.role === 'user' ? 'You' : 'AI'}:</b> {m.content}</div>)}
      {text && <div><b>AI:</b> {text}</div>}
      <input value={input} onChange={e => setInput(e.target.value)}
        onKeyDown={e => e.key === 'Enter' && send()} disabled={isStreaming} />
      <button onClick={send} disabled={isStreaming}>Send</button>
      {isStreaming && <button onClick={stop}>Stop</button>}
    </div>
  );
}

export default () => <InferisProvider adapter={webLlmAdapter()}><Chatbot /></InferisProvider>;

1 file. 28 lines. No worker files, no bridge code, no GPU detection. It just works.

All those 5 files and 160+ lines from the previous section? They're handled inside inferis-ml:

Your 5 files	inferis-ml
`llm.worker.ts` + `worker-bridge.ts`	Built-in worker pool with typed async API
`gpu-detect.ts`	`defaultDevice: 'auto'` - one line
`useLLM.ts`	`useModel()` + `useStream()` - two hooks, zero boilerplate
abort, memory, cross-tab	`stop()`, LRU eviction, `crossTab: true`

What Happens Under the Hood

When you wrap your app in <InferisProvider>, the library:

┌─────────────┐     postMessage     ┌──────────────────┐
│  React UI   │ ◄─────────────────► │   Web Worker     │
│  (main      │                     │  ┌────────────┐  │
│   thread)   │                     │  │  web-llm   │  │
│             │                     │  │  adapter   │  │
│  useModel() │                     │  └────────────┘  │
│  useStream()│                     │  ┌────────────┐  │
│             │                     │  │  WebGPU /  │  │
│             │                     │  │  WASM      │  │
└─────────────┘                     │  └────────────┘  │
                                    └──────────────────┘

createPool() spins up a Web Worker pool. The model loads and runs in a separate thread - the main React thread stays free for rendering.
pool.load() downloads model weights (cached in the browser's Cache API), initializes the runtime, and returns a ModelHandle.
model.stream() starts inference and returns a standard ReadableStream. The useStream hook subscribes to the stream and updates React state on every token.
WebGPU auto-detection: if a GPU is available - WebGPU is used. If not - WASM with SIMD. No manual navigator.gpu checks needed.
Cross-tab: with crossTab: true, inferis-ml uses a SharedWorker. Five tabs with the same model = one copy in memory.

What Models Can You Use?

inferis-ml supports three adapters:

Adapter	Runtime	Models
`webLlmAdapter()`	@mlc-ai/web-llm	Llama 3.2 (1B/3B), SmolLM2, Phi-3, Gemma 2, Qwen 2.5
`transformersAdapter()`	@huggingface/transformers	Any model from HuggingFace Hub
`onnxAdapter()`	onnxruntime-web	Any ONNX model

For chatbots, the best choice is webLlmAdapter() with MLC-compiled models:

SmolLM2-360M - 200 MB, runs even on low-end devices
Llama-3.2-1B - 700 MB, solid quality for its size
Llama-3.2-3B - 1.8 GB, approaching GPT-3.5 quality, requires a GPU with 4+ GB VRAM

Bonus: RAG in 10 Minutes

Want the chatbot to answer questions about your documents? inferis-ml includes a full RAG pipeline that also runs entirely in the browser:

import { useRAG, useRAGStream } from 'inferis-react';
import { transformersAdapter } from 'inferis-ml/adapters/transformers';
import { webLlmAdapter } from 'inferis-ml/adapters/web-llm';

function RAGChat() {
  const rag = useRAG({
    embeddingModel: 'Xenova/multilingual-e5-small',
    embeddingAdapter: transformersAdapter(),
    llmModel: 'SmolLM2-360M-Instruct-q4f16_1-MLC',
    llmAdapter: webLlmAdapter(),
    autoInit: true,
  });

  const { answer, isStreaming, ask } = useRAGStream(rag.pipeline);

  const handleUpload = async (files: FileList) => {
    const docs = await Promise.all(
      Array.from(files).map(async (f) => ({
        id: f.name,
        text: await f.text(),
      }))
    );
    await rag.index(docs);
  };

  return (
    <div>
      <input type="file" multiple onChange={(e) => e.target.files && handleUpload(e.target.files)} />
      <p>Indexed: {rag.indexedCount} chunks</p>
      <button onClick={() => ask('What is this document about?')}>Ask</button>
      {isStreaming && <p>Thinking...</p>}
      <p>{answer}</p>
    </div>
  );
}

Documents are chunked, embedded, and stored in IndexedDB - all locally.

Limitations (Honestly)

It would be dishonest not to mention:

First load is slow. A 1B parameter model is ~700 MB to download. After the first time, weights are cached in the browser.
WebGPU is needed for decent speed. The WASM fallback works but is 5-10x slower. In 2026, WebGPU is available in Chrome, Edge, and Firefox (93%+ coverage).
Quality != GPT-4. Models with 1-3B parameters handle simple tasks, summarization, and Q&A well. For complex reasoning, you need larger models (and therefore, an API).
Mobile devices. Works on iPhone/Android via WASM, but slowly. Desktop is the primary use case.

When this is the right choice:

Prototypes and demos without infrastructure
Privacy-first applications (healthcare, legal, finance)
Offline scenarios
Any case where you don't want to pay for an API

When an API is better:

You need GPT-4/Claude-level quality
Mobile users are your primary audience
Complex multi-step reasoning

Summary

	Raw web-llm	inferis-ml
Files	5	1
Lines of code	~450 (500+ with edge cases)	28
Web Workers	Manual setup	Built-in
WebGPU fallback	Manual	Automatic
Cross-tab	Not supported	SharedWorker
Memory	Manual management	LRU eviction
Streaming	Manual async iterator	`useStream()` hook
React integration	Custom hooks	`useModel()` + `useStream()`
RAG	-	Built-in pipeline
Bundle size	-	6.7 kB (minzip)