wellallyTech

Posted on Apr 12

Privacy-First Edge AI: Building a Zero-Server Symptom Checker with WebLLM and WebGPU

#ai #privacy #webllm #webgpu

Privacy is the new frontier in AI development. When it comes to sensitive data—like medical symptoms or personal health history—sending data to a central cloud server is a hard "no" for many users. What if we could run a Large Language Model (LLM) entirely within the user's browser? 🤯

By leveraging WebLLM, WebGPU, and Edge AI principles, we can now achieve near-native inference speeds directly on the client side. This approach eliminates server costs, ensures 100% data privacy through physical isolation, and provides a seamless user experience. If you are looking for advanced patterns on local-first AI and production-ready deployments, I highly recommend checking out the deep dives over at WellAlly Tech Blog, which served as a major inspiration for this architectural pattern.

Why Run LLMs in the Browser?

Traditionally, LLMs require massive A100/H100 clusters. However, with the maturation of the WebGPU standard and the TVM.js compiler stack, we can now tap into the local machine's GPU power via the browser.

Zero Latency/Cost: No more API tokens or network round-trips.
Ultimate Privacy: Data never leaves the user's device.
Offline Capability: Once the weights are cached, it works without an internet connection.

The Architecture

Here is how the data flows from a user's symptom description to an AI-generated suggestion, all within the browser's sandbox.

graph TD
    A[User Input: Symptoms] --> B[WebLLM Engine]
    B --> C{WebGPU Support?}
    C -- Yes --> D[Wasm + TVM.js Runtime]
    C -- No --> E[Fallback/Error]
    D --> F[Local IndexedDB Cache]
    F --> G[GPU Accelerated Inference]
    G --> H[Streaming Response to UI]
    H --> I[User Actionable Advice]

Prerequisites

Before we dive into the code, ensure your environment is ready:

Tech Stack: TypeScript, WebLLM, Vite.
Browser: Chrome 113+ or any browser with WebGPU enabled.
Hardware: A machine with a decent integrated or dedicated GPU.

Step 1: Initializing the WebLLM Engine

First, we need to set up the worker and initialize the MLCEngine. Since model weights can be large (several GBs), WebLLM uses a clever caching mechanism via the browser's Cache API.

import { CreateMLCEngine, MLCEngine } from "@mlc-ai/web-llm";

// We'll use a quantized version of Llama-3 or Phi-3 for efficiency
const selectedModel = "Llama-3-8B-Instruct-v0.1-q4f16_1-MLC";

async function initializeEngine(onProgress: (p: number) => void) {
  console.log("🚀 Initializing WebGPU Engine...");

  const engine = await CreateMLCEngine(selectedModel, {
    initProgressCallback: (report) => {
      onProgress(Math.round(report.progress * 100));
      console.log(report.text);
    },
  });

  return engine;
}

Step 2: Crafting the System Prompt

For a symptom checker, the system prompt is critical. We need to ensure the AI behaves like a supportive assistant while emphasizing that it is not a replacement for professional medical advice.

const SYSTEM_PROMPT = `
You are a private, local medical symptom checker. 
Analyze the symptoms provided by the user. 
Provide potential causes and suggest whether they should seek urgent care.
ALWAYS include a disclaimer: "This is an AI-generated summary, not a medical diagnosis."
Keep the data processing strictly local.
`;

Step 3: Executing Inference

Now, let's build the chat function. We use a streaming approach to make the UI feel responsive, just like ChatGPT.

async function checkSymptoms(engine: MLCEngine, userInput: string) {
  const messages = [
    { role: "system", content: SYSTEM_PROMPT },
    { role: "user", content: userInput }
  ];

  let fullResponse = "";
  const chunks = await engine.chat.completions.create({
    messages,
    stream: true, // High-fives for streaming! 🖐️
  });

  for await (const chunk of chunks) {
    const content = chunk.choices[0]?.delta?.content || "";
    fullResponse += content;
    // Update your React/Vue state here
    updateUI(fullResponse); 
  }
}

Performance Optimization

When working with Edge AI, memory management is your biggest hurdle.

Quantization: We use q4f16 (4-bit quantization) to shrink the model size by ~70% without significant logic loss.
TVM.js: This handles the bridge between the high-level model logic and the low-level WebGPU shaders.

For a deeper dive into how to optimize these shaders for mobile browsers, the team at wellally.tech/blog has published some incredible benchmarks comparing WebLLM performance across different chipsets.

Conclusion

We've just built a fully functional, privacy-preserving symptom checker that runs entirely on the client. No servers, no leaks, just pure GPU-accelerated magic. 🥑

Key Takeaways:

WebGPU is the backbone of modern browser-based AI.
WebLLM provides the easiest abstraction for running MLC-compiled models.
Privacy-first apps are the future of healthcare tech.

What are you planning to build with WebGPU? Let me know in the comments below! Don't forget to star the MLC-LLM repo and keep experimenting.

Love this content? Follow for more "Learning in Public" tutorials! 🚀

DEV Community