Intro
Most LLM apps have the same shape. Ship text to a server, pay per token, and pray the Wi-Fi stays up.
WebLLM is the fun twist. It runs an LLM inside the browser using WebGPU. That unlocks privacy-friendly demos, offline-ish behavior, and a new kind of “deployment” where your biggest backend cost is your user’s laptop fan spinning up like it just saw a Dark Souls boss.
The goal here is simple: one chat UI, one message format, two interchangeable “brains”:
- Local provider: WebLLM in the browser (WebGPU)
- Remote provider: a server endpoint with an OpenAI-compatible shape (Next.js route handler)
tldr; Build a tiny chat app where switching between local WebLLM and a remote model is just a dropdown.
Setting Up / Prerequisites
- Node 18+ (20+ preferred)
- A modern Chromium browser with WebGPU enabled (Chrome or Edge is easiest)
- Basic React + TypeScript comfort
Optional but recommended:
- A machine with decent RAM. Smaller laptops can run it, but you will feel the pain sooner.
- Patience for the first model download.
Implementation Steps
Step 1: Create the app (Vite + React)
npm create vite@latest webllm-dual-provider-chat -- --template react-ts
cd webllm-dual-provider-chat
npm i
npm i @mlc-ai/web-llm
npm run dev
You now have a normal React app that will become a “two-brain” chat UI.
Step 2: Define a provider interface
This interface is the entire trick. The UI does not care how tokens appear, only that they stream in.
In src/ai/types.ts:
export type Role = "system" | "user" | "assistant";
export type ChatMessage = {
role: Role;
content: string;
};
export type StreamChunk = {
delta: string;
done?: boolean;
};
export type ChatProvider = {
id: string;
label: string;
// Called once when selecting this provider (load model, warmup, etc.)
init?: (opts?: {
signal?: AbortSignal;
onStatus?: (s: string) => void;
}) => Promise<void>;
// Stream response tokens/chunks
streamChat: (args: {
messages: ChatMessage[];
signal?: AbortSignal;
onChunk: (chunk: StreamChunk) => void;
onStatus?: (s: string) => void;
}) => Promise<void>;
// Optional cleanup
dispose?: () => Promise<void>;
};
From this point forward, everything is just “implement the interface.”
Step 3: Implement the WebLLM local provider
Two important realities:
- First run downloads a model. This can be big. Show status text so it does not look frozen.
- WebGPU is not universal. Feature detect and fall back.
Also, use a model ID that actually works, this one worked at time of writing:
Llama-3.1-8B-Instruct-q4f32_1-MLC
In src/ai/webllmProvider.ts
import type { ChatMessage, ChatProvider } from "./types";
import * as webllm from "@mlc-ai/web-llm";
function toWebLLMMessages(
messages: ChatMessage[]
): webllm.ChatCompletionMessageParam[] {
return messages.map((m) => ({ role: m.role, content: m.content }));
}
export function createWebLLMProvider(
modelId = "Llama-3.1-8B-Instruct-q4f32_1-MLC"
): ChatProvider {
let engine: webllm.MLCEngineInterface | null = null;
const init: ChatProvider["init"] = async ({ signal, onStatus } = {}) => {
if (!("gpu" in navigator)) {
throw new Error("WebGPU not available in this browser.");
}
if (engine) return;
onStatus?.("Initializing WebLLM engine...");
engine = await webllm.CreateMLCEngine(modelId, {
initProgressCallback: (p) => {
const msg =
typeof p === "string" ? p : (p as any)?.text ?? "Loading model...";
onStatus?.(msg);
},
});
onStatus?.("Warming up...");
await engine.chat.completions.create({
messages: [{ role: "user", content: "Say 'ready'." }],
temperature: 0,
});
onStatus?.("Ready.");
signal?.throwIfAborted?.();
};
return {
id: "local-webllm",
label: "Local (WebLLM)",
init,
streamChat: async ({ messages, signal, onChunk, onStatus }) => {
if (!engine) {
onStatus?.("Engine not initialized. Initializing now...");
await init({ signal, onStatus });
}
if (!engine) throw new Error("WebLLM engine failed to initialize.");
onStatus?.("Generating...");
const resp = await engine.chat.completions.create({
messages: toWebLLMMessages(messages),
stream: true,
temperature: 0.7,
});
for await (const event of resp) {
signal?.throwIfAborted?.();
const delta = event.choices?.[0]?.delta?.content ?? "";
// Optional cleanup if your model spits template markers
const cleaned = delta
.replaceAll("<|start_header_id|>", "")
.replaceAll("<|end_header_id|>", "");
if (cleaned) onChunk({ delta: cleaned });
}
onChunk({ delta: "", done: true });
onStatus?.("Done.");
},
dispose: async () => {
// Some builds expose engine.dispose(). If not, dropping the reference is fine.
// @ts-expect-error optional
await engine?.dispose?.();
engine = null;
},
};
}
Model IDs can change across releases and builds. If a model ID fails to load you just need to find the updated one.
Step 4: Implement the remote provider client
Same contract, same streaming shape. The UI should not have to care if the text came from WebGPU wizardry or a server in a trench coat.
You can skip this step if you prefer to only have a local provider.
In src/ai/remoteProvider.ts
import type { ChatProvider } from "./types";
export function createRemoteProvider(endpoint = "/api/chat"): ChatProvider {
return {
id: "remote",
label: "Remote (Server)",
streamChat: async ({ messages, signal, onChunk, onStatus }) => {
onStatus?.("Contacting server...");
const res = await fetch(endpoint, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ messages }),
signal,
});
if (!res.ok || !res.body) {
throw new Error(`Remote provider error: ${res.status}`);
}
onStatus?.("Streaming...");
const reader = res.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { value, done } = await reader.read();
if (done) break;
const text = decoder.decode(value, { stream: true });
if (text) onChunk({ delta: text });
}
onChunk({ delta: "", done: true });
onStatus?.("Done.");
},
};
}
Step 5: Add a Next.js route handler for /api/chat
You can skip this step if you prefer to only have a local provider.
This route handler:
- receives
{ messages }from the client - calls OpenAI’s Responses API with
stream: true - converts the SSE stream into a plain text stream your Vite client already understands
In app/api/chat/route.ts
export const runtime = "edge";
type Role = "system" | "user" | "assistant";
type ChatMessage = { role: Role; content: string };
export async function POST(req: Request) {
const { messages } = (await req.json()) as { messages: ChatMessage[] };
const apiKey = process.env.OPENAI_API_KEY;
if (!apiKey) {
return new Response("Missing OPENAI_API_KEY", { status: 500 });
}
const model = process.env.OPENAI_MODEL || "gpt-4o-mini";
const upstream = await fetch("https://api.openai.com/v1/responses", {
method: "POST",
headers: {
Authorization: `Bearer ${apiKey}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model,
stream: true,
input: messages.map((m) => ({
role: m.role,
content: [{ type: "input_text", text: m.content }],
})),
text: { format: { type: "text" } },
}),
});
if (!upstream.ok || !upstream.body) {
const errText = await upstream.text().catch(() => "");
return new Response(`Upstream error (${upstream.status}): ${errText}`, {
status: 500,
});
}
const encoder = new TextEncoder();
const decoder = new TextDecoder();
let buffer = "";
const stream = new ReadableStream<Uint8Array>({
async start(controller) {
const reader = upstream.body!.getReader();
try {
while (true) {
const { value, done } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// SSE events are separated by a blank line
let idx;
while ((idx = buffer.indexOf("\n\n")) !== -1) {
const rawEvent = buffer.slice(0, idx);
buffer = buffer.slice(idx + 2);
const dataLines = rawEvent
.split("\n")
.filter((line) => line.startsWith("data:"))
.map((line) => line.replace(/^data:\s?/, "").trim());
for (const data of dataLines) {
if (!data) continue;
if (data === "[DONE]") {
controller.close();
return;
}
let evt: any;
try {
evt = JSON.parse(data);
} catch {
continue;
}
if (
evt.type === "response.output_text.delta" &&
typeof evt.delta === "string"
) {
controller.enqueue(encoder.encode(evt.delta));
}
}
}
}
} catch (e) {
controller.error(e);
} finally {
controller.close();
}
},
});
return new Response(stream, {
headers: {
"Content-Type": "text/plain; charset=utf-8",
"Cache-Control": "no-cache, no-transform",
},
});
}
Running Next.js alongside Vite without CORS pain
If the chat UI is running on Vite (localhost:5173) and Next.js is running on localhost:3000, calling /api/chat from Vite will hit Vite’s server, not Next. The easy fix is a dev proxy.
Update vite.config.ts:
import { defineConfig } from "vite";
import react from "@vitejs/plugin-react";
export default defineConfig({
plugins: [react()],
server: {
proxy: {
"/api": "http://localhost:3000",
},
},
});
Now the client can keep using createRemoteProvider("/api/chat") and Vite will forward it to Next.
Environment variables for Next.js
Create .env.local in the Next.js project:
OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-4o-mini
Step 6: Build the chat hook (provider-agnostic brain socket)
The whole job of this hook is to:
- manage messages
- manage streaming state
- route the “append these tokens” events into the last assistant message
The sharp edge: streaming makes state bugs very obvious. If you mutate the last message in place, React will punish you with duplication weirdness, especially in dev.
So we update the last message immutably.
In src/hooks/useChat.ts
import { useMemo, useRef, useState } from "react";
import type { ChatMessage, ChatProvider } from "../ai/types";
export function useChat(providers: ChatProvider[]) {
const [providerId, setProviderId] = useState(providers[0]?.id ?? "");
const provider = useMemo(
() => providers.find((p) => p.id === providerId)!,
[providers, providerId]
);
const [messages, setMessages] = useState<ChatMessage[]>([
{ role: "system", content: "You are a helpful assistant." },
]);
const [status, setStatus] = useState<string>("");
const [isStreaming, setIsStreaming] = useState(false);
const abortRef = useRef<AbortController | null>(null);
async function selectProvider(nextId: string) {
abortRef.current?.abort();
setProviderId(nextId);
const next = providers.find((p) => p.id === nextId);
if (next?.init) {
setStatus("Preparing provider...");
try {
await next.init({ onStatus: setStatus });
} catch (e: any) {
setStatus(e?.message ?? "Failed to initialize provider.");
}
}
}
async function send(userText: string) {
if (!userText.trim()) return;
if (isStreaming) return;
abortRef.current?.abort();
abortRef.current = new AbortController();
const userMsg: ChatMessage = { role: "user", content: userText };
// Add user + placeholder assistant
setMessages((prev) => [...prev, userMsg, { role: "assistant", content: "" }]);
setIsStreaming(true);
setStatus("");
try {
await provider.streamChat({
messages: [...messages, userMsg], // good enough for a demo
signal: abortRef.current.signal,
onStatus: setStatus,
onChunk: ({ delta, done }) => {
if (delta) {
setMessages((prev) => {
const last = prev[prev.length - 1];
if (!last || last.role !== "assistant") return prev;
// Immutable update
const updatedLast = { ...last, content: last.content + delta };
return [...prev.slice(0, -1), updatedLast];
});
}
if (done) setIsStreaming(false);
},
});
} catch (e: any) {
setIsStreaming(false);
setStatus(e?.message ?? "Error while streaming.");
}
}
function stop() {
abortRef.current?.abort();
setIsStreaming(false);
setStatus("Stopped.");
}
return {
providers,
providerId,
provider,
messages,
status,
isStreaming,
selectProvider,
send,
stop,
};
}
React state closure note:
messages: [...messages, userMsg]uses the current render’smessages. For normal chat usage, that is fine. If you want to harden it, storemessagesin a ref and read from that when starting the stream.
Step 7: UI component
Keep it simple. Treat the provider dropdown as a “brain toggle” and let the rest of the UI stay boring on purpose.
In src/App.tsx
import { useEffect, useMemo, useState } from "react";
import { createWebLLMProvider } from "./ai/webllmProvider";
import { createRemoteProvider } from "./ai/remoteProvider";
import { useChat } from "./hooks/useChat";
export default function App() {
const providers = useMemo(
() => [createWebLLMProvider(), createRemoteProvider("/api/chat")],
[]
);
const chat = useChat(providers);
const [input, setInput] = useState("");
useEffect(() => {
chat.selectProvider(chat.providerId);
// eslint-disable-next-line react-hooks/exhaustive-deps
}, []);
return (
<div style={{ maxWidth: 900, margin: "0 auto", padding: 16, fontFamily: "system-ui" }}>
<h1>Dual Provider Chat</h1>
<div style={{ display: "flex", gap: 12, alignItems: "center" }}>
<label>
Provider{" "}
<select
value={chat.providerId}
onChange={(e) => chat.selectProvider(e.target.value)}
disabled={chat.isStreaming}
>
{chat.providers.map((p) => (
<option key={p.id} value={p.id}>
{p.label}
</option>
))}
</select>
</label>
<div style={{ opacity: 0.8 }}>{chat.status}</div>
{chat.isStreaming && <button onClick={chat.stop}>Stop</button>}
</div>
<div style={{ marginTop: 16, border: "1px solid #ddd", borderRadius: 8, padding: 12, minHeight: 300 }}>
{chat.messages
.filter((m) => m.role !== "system")
.map((m, idx) => (
<div key={idx} style={{ marginBottom: 12 }}>
<div style={{ fontWeight: 700 }}>{m.role}</div>
<div style={{ whiteSpace: "pre-wrap" }}>{m.content}</div>
</div>
))}
</div>
<form
onSubmit={(e) => {
e.preventDefault();
chat.send(input);
setInput("");
}}
style={{ display: "flex", gap: 8, marginTop: 12 }}
>
<input
value={input}
onChange={(e) => setInput(e.target.value)}
placeholder="Say something..."
style={{ flex: 1, padding: 10 }}
disabled={chat.isStreaming}
/>
<button type="submit" disabled={chat.isStreaming}>
Send
</button>
</form>
</div>
);
}
At this point you have:
- a local WebGPU chat provider
- a remote API chat provider
- a UI that can swap between them without rewriting anything
Why this setup is worth having
1) Privacy-first features without a backend
If the user’s text is sensitive (journaling, medical notes, internal docs), local mode keeps content on-device by default.
2) Cost control and “free” demos
Local mode is effectively “free per token” after the download. It is great for:
- prototypes
- workshops
- dev tooling
- weekend projects that should not come with a monthly bill
3) Graceful degradation
Local mode can be an upgrade path instead of a requirement:
- WebGPU available: local
- WebGPU missing: remote fallback
4) Offline-ish UX for specific workflows
Full offline is tricky, but “no server call needed for this” is still a huge win for:
- rewriting text
- summarizing
- quick Q&A over content already in the browser
Real Talk
- Download time: first load can be chunky. People will assume it is broken unless you show progress.
- Device limits: mobile can struggle. Low-RAM machines can crash tabs or throttle hard.
- WebGPU support: treat local mode as progressive enhancement, not a hard dependency.
- Privacy win: local mode avoids shipping user text to your server by default.
- Cost win: local mode shifts the cost to user compute, which is nice until it is not.
Watch outs and gotchas
Token junk like <|start_header_id|>
Some model builds emit template markers. Filtering them out is fine for demos. For cleaner output long-term, experiment with model choices and chat templates.
Local models are not remote models
Expect differences:
- weaker instruction following
- more formatting quirks
- occasional “why are you like this” moments
Possible Improvements
- Model picker UI
- dropdown of model IDs
- persist selection in
localStorage - show estimated download size if available
- Provider router
- auto-pick local if WebGPU exists
- auto-fallback to remote if init fails
- show a small badge: “Local” or “Remote”
- Conversation memory controls
- send last N messages only
- auto-summarize older messages (local if possible)
- Structured output mode
- have the assistant return JSON “actions”
- validate with zod before rendering anything
Outro
A provider boundary is one of those small architectural choices that pays rent forever. Models change, vendors change, pricing changes, browser capabilities evolve. A chat UI that can swap brains is a lot harder to paint into a corner.
Also, it is extremely satisfying to flip a dropdown and make your browser turn into a tiny AI workstation.
Top comments (0)