Overview
Adding AI to a React app usually means signing up for OpenAI, getting an API key, setting up billing, hiding the key in .env, spinning up a backend proxy so it doesn't leak to the client... And that's before you've written a single line of business logic. Then your prototype gets traffic, and you receive a $200 bill because someone decided to stress-test prompt injections on your demo.
If you go the local WebGPU route instead, you're stuck manually wiring async Web Workers to component lifecycles and hooks - easily 200+ lines of infrastructure before the first feature.
Today we'll do it cleanly. I'll show how to build a fully local LLM chatbot in 48 lines of React, running entirely in the user's browser with inferis-react - a thin wrapper that handles workers, memory, and WebGPU/WASM fallbacks under the hood.
- No API keys
- No server
- No per-token charges
- User data never leaves their device
WebGPU is supported in Chrome, Edge, and Firefox, and quantized 1-3B parameter models run on an average laptop at 15-30 tokens/sec.
Let's build a chatbot.
Without a Library: The Real Cost
Let's be honest about what a production-ready local chatbot requires with raw @mlc-ai/web-llm. Here's your file tree:
src/
llm.worker.ts ~200 lines - Web Worker: load model, stream tokens via postMessage
gpu-detect.ts ~100 lines - check WebGPU, fall back to WASM, pick the right backend
worker-bridge.ts ~120 lines - typed postMessage protocol, promise wrappers, error forwarding
useLLM.ts ~100 lines - React hook: subscribe to worker events, manage state & cleanup
Chat.tsx ~140 lines - the actual UI component
That's 5 files, ~600+ lines of infrastructure to get a basic streaming chatbot that doesn't freeze the UI. And you still don't have:
- Abort/cancel (needs a
MessagePortor separate signal channel) - Memory cleanup when switching between models
- Cross-tab deduplication (5 tabs = 5 copies of the model = 10 GB RAM)
- Progress tracking with download phases and byte counts
- Retry logic when the worker crashes
Add those and you're past 500 lines before writing your first feature.
With inferis-ml: This Is the Entire File
npm install inferis-ml inferis-react @mlc-ai/web-llm
// App.tsx - that's it, no worker files, no bridge, no GPU detection
import { useState } from 'react';
import { InferisProvider, useModel, useStream } from 'inferis-react';
import { webLlmAdapter } from 'inferis-ml/adapters/web-llm';
function Chatbot() {
const [messages, setMessages] = useState<{ role: string; content: string }[]>([]);
const [input, setInput] = useState('');
const { model, state, progress } = useModel<string>('text-generation', {
model: 'SmolLM2-360M-Instruct-q4f16_1-MLC', autoLoad: true,
});
const { text, isStreaming, start, stop, reset } = useStream<string>(model);
if (state === 'loading')
return <p>Loading... {progress ? Math.round(progress.loaded / (progress.total || 1) * 100) : 0}%</p>;
const send = () => {
if (!input.trim() || isStreaming) return;
const next = [...messages, { role: 'user', content: input }];
if (text) next.push({ role: 'assistant', content: text });
setMessages(next);
setInput('');
reset();
start({ messages: [{ role: 'system', content: 'You are a helpful assistant.' }, ...next] });
};
return (
<div>
{messages.map((m, i) => <div key={i}><b>{m.role === 'user' ? 'You' : 'AI'}:</b> {m.content}</div>)}
{text && <div><b>AI:</b> {text}</div>}
<input value={input} onChange={e => setInput(e.target.value)}
onKeyDown={e => e.key === 'Enter' && send()} disabled={isStreaming} />
<button onClick={send} disabled={isStreaming}>Send</button>
{isStreaming && <button onClick={stop}>Stop</button>}
</div>
);
}
export default () => <InferisProvider adapter={webLlmAdapter()}><Chatbot /></InferisProvider>;
1 file. 28 lines. No worker files, no bridge code, no GPU detection. It just works.
All those 5 files and 160+ lines from the previous section? They're handled inside inferis-ml:
| Your 5 files | inferis-ml |
|---|---|
llm.worker.ts + worker-bridge.ts
|
Built-in worker pool with typed async API |
gpu-detect.ts |
defaultDevice: 'auto' - one line |
useLLM.ts |
useModel() + useStream() - two hooks, zero boilerplate |
| abort, memory, cross-tab |
stop(), LRU eviction, crossTab: true
|
What Happens Under the Hood
When you wrap your app in <InferisProvider>, the library:
┌─────────────┐ postMessage ┌──────────────────┐
│ React UI │ ◄─────────────────► │ Web Worker │
│ (main │ │ ┌────────────┐ │
│ thread) │ │ │ web-llm │ │
│ │ │ │ adapter │ │
│ useModel() │ │ └────────────┘ │
│ useStream()│ │ ┌────────────┐ │
│ │ │ │ WebGPU / │ │
│ │ │ │ WASM │ │
└─────────────┘ │ └────────────┘ │
└──────────────────┘
createPool()spins up a Web Worker pool. The model loads and runs in a separate thread - the main React thread stays free for rendering.pool.load()downloads model weights (cached in the browser's Cache API), initializes the runtime, and returns aModelHandle.model.stream()starts inference and returns a standardReadableStream. TheuseStreamhook subscribes to the stream and updates React state on every token.WebGPU auto-detection: if a GPU is available - WebGPU is used. If not - WASM with SIMD. No manual
navigator.gpuchecks needed.Cross-tab: with
crossTab: true, inferis-ml uses a SharedWorker. Five tabs with the same model = one copy in memory.
What Models Can You Use?
inferis-ml supports three adapters:
| Adapter | Runtime | Models |
|---|---|---|
webLlmAdapter() |
@mlc-ai/web-llm | Llama 3.2 (1B/3B), SmolLM2, Phi-3, Gemma 2, Qwen 2.5 |
transformersAdapter() |
@huggingface/transformers | Any model from HuggingFace Hub |
onnxAdapter() |
onnxruntime-web | Any ONNX model |
For chatbots, the best choice is webLlmAdapter() with MLC-compiled models:
- SmolLM2-360M - 200 MB, runs even on low-end devices
- Llama-3.2-1B - 700 MB, solid quality for its size
- Llama-3.2-3B - 1.8 GB, approaching GPT-3.5 quality, requires a GPU with 4+ GB VRAM
Bonus: RAG in 10 Minutes
Want the chatbot to answer questions about your documents? inferis-ml includes a full RAG pipeline that also runs entirely in the browser:
import { useRAG, useRAGStream } from 'inferis-react';
import { transformersAdapter } from 'inferis-ml/adapters/transformers';
import { webLlmAdapter } from 'inferis-ml/adapters/web-llm';
function RAGChat() {
const rag = useRAG({
embeddingModel: 'Xenova/multilingual-e5-small',
embeddingAdapter: transformersAdapter(),
llmModel: 'SmolLM2-360M-Instruct-q4f16_1-MLC',
llmAdapter: webLlmAdapter(),
autoInit: true,
});
const { answer, isStreaming, ask } = useRAGStream(rag.pipeline);
const handleUpload = async (files: FileList) => {
const docs = await Promise.all(
Array.from(files).map(async (f) => ({
id: f.name,
text: await f.text(),
}))
);
await rag.index(docs);
};
return (
<div>
<input type="file" multiple onChange={(e) => e.target.files && handleUpload(e.target.files)} />
<p>Indexed: {rag.indexedCount} chunks</p>
<button onClick={() => ask('What is this document about?')}>Ask</button>
{isStreaming && <p>Thinking...</p>}
<p>{answer}</p>
</div>
);
}
Documents are chunked, embedded, and stored in IndexedDB - all locally.
Limitations (Honestly)
It would be dishonest not to mention:
- First load is slow. A 1B parameter model is ~700 MB to download. After the first time, weights are cached in the browser.
- WebGPU is needed for decent speed. The WASM fallback works but is 5-10x slower. In 2026, WebGPU is available in Chrome, Edge, and Firefox (93%+ coverage).
- Quality != GPT-4. Models with 1-3B parameters handle simple tasks, summarization, and Q&A well. For complex reasoning, you need larger models (and therefore, an API).
- Mobile devices. Works on iPhone/Android via WASM, but slowly. Desktop is the primary use case.
When this is the right choice:
- Prototypes and demos without infrastructure
- Privacy-first applications (healthcare, legal, finance)
- Offline scenarios
- Any case where you don't want to pay for an API
When an API is better:
- You need GPT-4/Claude-level quality
- Mobile users are your primary audience
- Complex multi-step reasoning
Summary
| Raw web-llm | inferis-ml | |
|---|---|---|
| Files | 5 | 1 |
| Lines of code | ~450 (500+ with edge cases) | 28 |
| Web Workers | Manual setup | Built-in |
| WebGPU fallback | Manual | Automatic |
| Cross-tab | Not supported | SharedWorker |
| Memory | Manual management | LRU eviction |
| Streaming | Manual async iterator |
useStream() hook |
| React integration | Custom hooks |
useModel() + useStream()
|
| RAG | - | Built-in pipeline |
| Bundle size | - | 6.7 kB (minzip) |
Links:
If this was useful - drop a star on GitHub. It's the only metric that keeps open source maintainers going.

Top comments (0)