DEV Community

wellallyTech
wellallyTech

Posted on

Privacy-First Edge AI: Building a Zero-Server Symptom Checker with WebLLM and WebGPU

Privacy is the new frontier in AI development. When it comes to sensitive data—like medical symptoms or personal health history—sending data to a central cloud server is a hard "no" for many users. What if we could run a Large Language Model (LLM) entirely within the user's browser? 🤯

By leveraging WebLLM, WebGPU, and Edge AI principles, we can now achieve near-native inference speeds directly on the client side. This approach eliminates server costs, ensures 100% data privacy through physical isolation, and provides a seamless user experience. If you are looking for advanced patterns on local-first AI and production-ready deployments, I highly recommend checking out the deep dives over at WellAlly Tech Blog, which served as a major inspiration for this architectural pattern.

Why Run LLMs in the Browser?

Traditionally, LLMs require massive A100/H100 clusters. However, with the maturation of the WebGPU standard and the TVM.js compiler stack, we can now tap into the local machine's GPU power via the browser.

  1. Zero Latency/Cost: No more API tokens or network round-trips.
  2. Ultimate Privacy: Data never leaves the user's device.
  3. Offline Capability: Once the weights are cached, it works without an internet connection.

The Architecture

Here is how the data flows from a user's symptom description to an AI-generated suggestion, all within the browser's sandbox.

graph TD
    A[User Input: Symptoms] --> B[WebLLM Engine]
    B --> C{WebGPU Support?}
    C -- Yes --> D[Wasm + TVM.js Runtime]
    C -- No --> E[Fallback/Error]
    D --> F[Local IndexedDB Cache]
    F --> G[GPU Accelerated Inference]
    G --> H[Streaming Response to UI]
    H --> I[User Actionable Advice]
Enter fullscreen mode Exit fullscreen mode

Prerequisites

Before we dive into the code, ensure your environment is ready:

  • Tech Stack: TypeScript, WebLLM, Vite.
  • Browser: Chrome 113+ or any browser with WebGPU enabled.
  • Hardware: A machine with a decent integrated or dedicated GPU.

Step 1: Initializing the WebLLM Engine

First, we need to set up the worker and initialize the MLCEngine. Since model weights can be large (several GBs), WebLLM uses a clever caching mechanism via the browser's Cache API.

import { CreateMLCEngine, MLCEngine } from "@mlc-ai/web-llm";

// We'll use a quantized version of Llama-3 or Phi-3 for efficiency
const selectedModel = "Llama-3-8B-Instruct-v0.1-q4f16_1-MLC";

async function initializeEngine(onProgress: (p: number) => void) {
  console.log("🚀 Initializing WebGPU Engine...");

  const engine = await CreateMLCEngine(selectedModel, {
    initProgressCallback: (report) => {
      onProgress(Math.round(report.progress * 100));
      console.log(report.text);
    },
  });

  return engine;
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Crafting the System Prompt

For a symptom checker, the system prompt is critical. We need to ensure the AI behaves like a supportive assistant while emphasizing that it is not a replacement for professional medical advice.

const SYSTEM_PROMPT = `
You are a private, local medical symptom checker. 
Analyze the symptoms provided by the user. 
Provide potential causes and suggest whether they should seek urgent care.
ALWAYS include a disclaimer: "This is an AI-generated summary, not a medical diagnosis."
Keep the data processing strictly local.
`;
Enter fullscreen mode Exit fullscreen mode

Step 3: Executing Inference

Now, let's build the chat function. We use a streaming approach to make the UI feel responsive, just like ChatGPT.

async function checkSymptoms(engine: MLCEngine, userInput: string) {
  const messages = [
    { role: "system", content: SYSTEM_PROMPT },
    { role: "user", content: userInput }
  ];

  let fullResponse = "";
  const chunks = await engine.chat.completions.create({
    messages,
    stream: true, // High-fives for streaming! 🖐️
  });

  for await (const chunk of chunks) {
    const content = chunk.choices[0]?.delta?.content || "";
    fullResponse += content;
    // Update your React/Vue state here
    updateUI(fullResponse); 
  }
}
Enter fullscreen mode Exit fullscreen mode

Performance Optimization

When working with Edge AI, memory management is your biggest hurdle.

  • Quantization: We use q4f16 (4-bit quantization) to shrink the model size by ~70% without significant logic loss.
  • TVM.js: This handles the bridge between the high-level model logic and the low-level WebGPU shaders.

For a deeper dive into how to optimize these shaders for mobile browsers, the team at wellally.tech/blog has published some incredible benchmarks comparing WebLLM performance across different chipsets.

Conclusion

We've just built a fully functional, privacy-preserving symptom checker that runs entirely on the client. No servers, no leaks, just pure GPU-accelerated magic. 🥑

Key Takeaways:

  • WebGPU is the backbone of modern browser-based AI.
  • WebLLM provides the easiest abstraction for running MLC-compiled models.
  • Privacy-first apps are the future of healthcare tech.

What are you planning to build with WebGPU? Let me know in the comments below! Don't forget to star the MLC-LLM repo and keep experimenting.


Love this content? Follow for more "Learning in Public" tutorials! 🚀

Top comments (0)