Privacy is the new frontier in AI development. When it comes to sensitive data—like medical symptoms or personal health history—sending data to a central cloud server is a hard "no" for many users. What if we could run a Large Language Model (LLM) entirely within the user's browser? 🤯
By leveraging WebLLM, WebGPU, and Edge AI principles, we can now achieve near-native inference speeds directly on the client side. This approach eliminates server costs, ensures 100% data privacy through physical isolation, and provides a seamless user experience. If you are looking for advanced patterns on local-first AI and production-ready deployments, I highly recommend checking out the deep dives over at WellAlly Tech Blog, which served as a major inspiration for this architectural pattern.
Why Run LLMs in the Browser?
Traditionally, LLMs require massive A100/H100 clusters. However, with the maturation of the WebGPU standard and the TVM.js compiler stack, we can now tap into the local machine's GPU power via the browser.
- Zero Latency/Cost: No more API tokens or network round-trips.
- Ultimate Privacy: Data never leaves the user's device.
- Offline Capability: Once the weights are cached, it works without an internet connection.
The Architecture
Here is how the data flows from a user's symptom description to an AI-generated suggestion, all within the browser's sandbox.
graph TD
A[User Input: Symptoms] --> B[WebLLM Engine]
B --> C{WebGPU Support?}
C -- Yes --> D[Wasm + TVM.js Runtime]
C -- No --> E[Fallback/Error]
D --> F[Local IndexedDB Cache]
F --> G[GPU Accelerated Inference]
G --> H[Streaming Response to UI]
H --> I[User Actionable Advice]
Prerequisites
Before we dive into the code, ensure your environment is ready:
- Tech Stack: TypeScript, WebLLM, Vite.
- Browser: Chrome 113+ or any browser with WebGPU enabled.
- Hardware: A machine with a decent integrated or dedicated GPU.
Step 1: Initializing the WebLLM Engine
First, we need to set up the worker and initialize the MLCEngine. Since model weights can be large (several GBs), WebLLM uses a clever caching mechanism via the browser's Cache API.
import { CreateMLCEngine, MLCEngine } from "@mlc-ai/web-llm";
// We'll use a quantized version of Llama-3 or Phi-3 for efficiency
const selectedModel = "Llama-3-8B-Instruct-v0.1-q4f16_1-MLC";
async function initializeEngine(onProgress: (p: number) => void) {
console.log("🚀 Initializing WebGPU Engine...");
const engine = await CreateMLCEngine(selectedModel, {
initProgressCallback: (report) => {
onProgress(Math.round(report.progress * 100));
console.log(report.text);
},
});
return engine;
}
Step 2: Crafting the System Prompt
For a symptom checker, the system prompt is critical. We need to ensure the AI behaves like a supportive assistant while emphasizing that it is not a replacement for professional medical advice.
const SYSTEM_PROMPT = `
You are a private, local medical symptom checker.
Analyze the symptoms provided by the user.
Provide potential causes and suggest whether they should seek urgent care.
ALWAYS include a disclaimer: "This is an AI-generated summary, not a medical diagnosis."
Keep the data processing strictly local.
`;
Step 3: Executing Inference
Now, let's build the chat function. We use a streaming approach to make the UI feel responsive, just like ChatGPT.
async function checkSymptoms(engine: MLCEngine, userInput: string) {
const messages = [
{ role: "system", content: SYSTEM_PROMPT },
{ role: "user", content: userInput }
];
let fullResponse = "";
const chunks = await engine.chat.completions.create({
messages,
stream: true, // High-fives for streaming! 🖐️
});
for await (const chunk of chunks) {
const content = chunk.choices[0]?.delta?.content || "";
fullResponse += content;
// Update your React/Vue state here
updateUI(fullResponse);
}
}
Performance Optimization
When working with Edge AI, memory management is your biggest hurdle.
- Quantization: We use
q4f16(4-bit quantization) to shrink the model size by ~70% without significant logic loss. - TVM.js: This handles the bridge between the high-level model logic and the low-level WebGPU shaders.
For a deeper dive into how to optimize these shaders for mobile browsers, the team at wellally.tech/blog has published some incredible benchmarks comparing WebLLM performance across different chipsets.
Conclusion
We've just built a fully functional, privacy-preserving symptom checker that runs entirely on the client. No servers, no leaks, just pure GPU-accelerated magic. 🥑
Key Takeaways:
- WebGPU is the backbone of modern browser-based AI.
- WebLLM provides the easiest abstraction for running MLC-compiled models.
- Privacy-first apps are the future of healthcare tech.
What are you planning to build with WebGPU? Let me know in the comments below! Don't forget to star the MLC-LLM repo and keep experimenting.
Love this content? Follow for more "Learning in Public" tutorials! 🚀
Top comments (0)