wellallyTech

Posted on Apr 26

Stop Sending Your Health Data to the Cloud: Build a 100% Private Symptom Checker with WebLLM & WebGPU 🚀

#react #webgpu #webllm #typescript

Privacy is no longer just a feature; it’s a human right—especially when it comes to medical data. With the rise of Edge AI and the maturity of WebGPU, we are witnessing a paradigm shift where "local-first" isn't just a dream for small scripts, but a reality for Large Language Models (LLMs).

In this tutorial, we will build a Privacy-Preserving Symptom Screening Engine. By leveraging WebLLM, WebGPU, and TypeScript, we will run a powerful model directly in the browser. This ensures that sensitive health queries never leave the user's device, providing a 100% offline-capable, secure medical common-sense index.

Why Browser-Based AI? 🥑

Most AI applications rely on sending prompts to a centralized server (like OpenAI or Anthropic). This is a privacy nightmare for health data. By using WebLLM and TVM Runtime, we can execute models like Llama 3 or Mistral directly on the client's GPU via the WebGPU API. This results in:

Zero Latency: No round-trips to a server.
Zero Cost: The user provides the compute.
Total Privacy: Data stays in the RAM/VRAM of the local machine.

The Architecture 🏗️

The flow is straightforward but technically sophisticated. We use TVM (Tensor Intermediate Representation) to compile models into high-performance WASM and WebGPU shaders.

graph TD
    A[User Input: Symptoms] --> B[React Frontend]
    B --> C{WebLLM Engine}
    C --> D[WebGPU API]
    C --> E[WASM Runtime]
    D --> F[Local GPU / VRAM]
    E --> F
    F --> G[Local LLM Weights - Cached]
    G --> H[Inference Result]
    H --> B
    B --> I[100% Private Output]

Prerequisites

Before we dive in, ensure your environment meets the following:

Tech Stack: WebLLM, TVM Runtime, TypeScript, React.
Browser: Chrome 113+ or any browser with WebGPU enabled.
Hardware: A dedicated or integrated GPU (Apple Silicon M-series works like a charm).

Step 1: Setting up the WebLLM Engine

First, let's install the core dependencies:

npm install @mlc-ai/web-llm react lucide-react

Now, let's create a custom hook to manage the model lifecycle. We need to handle the downloading of weights (cached in the browser's Cache API) and the initialization of the worker.

// useWebLLM.ts
import { useState, useEffect } from 'react';
import * as webllm from "@mlc-ai/web-llm";

export function useWebLLM(modelId: string) {
  const [engine, setEngine] = useState<webllm.MLCEngineInterface | null>(null);
  const [progress, setProgress] = useState(0);
  const [isReady, setIsReady] = useState(false);

  useEffect(() => {
    async function init() {
      // Create an engine instance
      const newEngine = await webllm.CreateMLCEngine(modelId, {
        initProgressCallback: (report) => {
          setProgress(Math.round(report.progress * 100));
        }
      });
      setEngine(newEngine);
      setIsReady(true);
    }
    init();
  }, [modelId]);

  return { engine, progress, isReady };
}

Step 2: Building the Symptom Checker Logic

In this advanced implementation, we aren't just chatting; we are constraining the model to act as a Medical Common Sense Indexer. We'll use a specific system prompt to prevent the model from giving definitive diagnoses while providing helpful, localized information.

// SymptomChecker.tsx
import React, { useState } from 'react';
import { useWebLLM } from './useWebLLM';

const SYSTEM_PROMPT = `You are a private medical common-sense assistant. 
Analyze the symptoms provided. 
Provide 3 potential causes and urgency levels. 
ALWAYS include a disclaimer that this is not a medical diagnosis. 
Do not store or transmit data.`;

export const SymptomChecker = () => {
  const [input, setInput] = useState("");
  const [response, setResponse] = useState("");
  const { engine, progress, isReady } = useWebLLM("Llama-3-8B-Instruct-v0.1-q4f16_1-MLC");

  const handleScreening = async () => {
    if (!engine) return;

    const messages = [
      { role: "system", content: SYSTEM_PROMPT },
      { role: "user", content: `Symptoms: ${input}` }
    ];

    // Use streaming for better UX
    const chunks = await engine.chat.completions.create({
      messages,
      stream: true,
    });

    let fullText = "";
    for await (const chunk of chunks) {
      const content = chunk.choices[0]?.delta?.content || "";
      fullText += content;
      setResponse(fullText);
    }
  };

  if (!isReady) return <div>Loading Local Model: {progress}%</div>;

  return (
    <div className="p-6 max-w-2xl mx-auto bg-white rounded-xl shadow-md">
      <h2 className="text-2xl font-bold mb-4">Local Health Screener 🩺</h2>
      <textarea 
        className="w-full p-3 border rounded-lg"
        placeholder="Describe your symptoms (e.g., mild fever, persistent cough for 2 days)..."
        value={input}
        onChange={(e) => setInput(e.target.value)}
      />
      <button 
        onClick={handleScreening}
        className="mt-4 px-6 py-2 bg-blue-600 text-white rounded-full hover:bg-blue-700 transition"
      >
        Analyze Locally
      </button>
      {response && (
        <div className="mt-6 p-4 bg-gray-50 rounded-lg border-l-4 border-blue-500">
          <p className="whitespace-pre-wrap text-gray-700">{response}</p>
        </div>
      )}
    </div>
  );
};

Mastering the "Local-First" AI Pattern 🧠

Running models in the browser is more than just importing a library. To make it production-ready, you need to handle memory management (WebGPU memory limits), model quantization, and hybrid fallback strategies.

For a deeper dive into Advanced Edge AI Patterns—including how to optimize WebGPU memory for mobile devices and implementing RAG (Retrieval Augmented Generation) entirely on the client side—check out the engineering deep-dives on the WellAlly Technology Blog. It’s a fantastic resource for developers looking to push the boundaries of what's possible with in-browser machine learning.

Performance Considerations ⚡

When building with WebLLM, keep these three things in mind:

Quantization is Key: We used a q4f16 (4-bit) quantization. This reduces the 8B model size from ~15GB to ~5GB, making it feasible for modern laptops to load in the browser cache.
VRAM Management: High-resolution displays and heavy GPU tasks in other tabs can lead to "Context Lost" errors. Always implement an error boundary.
The First Load: The first visit will download several gigabytes of weights. Use a ServiceWorker to manage this background download or inform the user clearly about the initial setup.

Conclusion

By combining WebGPU and WebLLM, we've built a tool that provides the intelligence of a modern LLM with the privacy of a piece of paper in a safe. No trackers, no data harvesting, just pure local compute.

The future of AI isn't just in the cloud—it's right there in your browser's console.

What will you build with local WebGPU? Let me know in the comments below! 👇

DEV Community