Beck_Moulton

Posted on Apr 6

Privacy First: Running Llama-3 Locally in Your Browser for Medical Report Analysis via WebGPU

#react #typescript #ai #privacy

We’ve all been there—staring at a complex medical report filled with cryptic numbers and Latin terminology. Our first instinct is to paste it into ChatGPT. But wait... do you really want your sensitive health data sitting on a corporate server forever?

In the era of WebGPU acceleration and local LLM inference, you no longer have to choose between AI power and data privacy. Today, we are building a browser-based AI medical analyzer. By leveraging Llama-3 WebLLM and edge computing, we will transform a quantized 8B parameter model into a local powerhouse that processes medical documents with zero data leakage.

If you are looking for more production-ready patterns on data privacy and AI, be sure to check out the advanced guides over at WellAlly Tech Blog, which served as a major inspiration for this local-first architecture.

The Architecture: From Pixels to Private Insights

The magic happens through the WebGPU API, which allows the browser to tap directly into your device's GPU. Unlike traditional WebGL, WebGPU is designed for modern compute workloads, making it perfect for running 4-bit quantized LLMs.

graph TD
    A[User Uploads Health Report] --> B[React Frontend]
    B --> C{WebGPU Support?}
    C -- Yes --> D[WebLLM Engine Initialization]
    C -- No --> E[Fallback: CPU/WASM or Error]
    D --> F[Load Llama-3-8B-q4f16_1]
    F --> G[Local Inference]
    G --> H[Structured JSON Output]
    H --> I[UI Display & Visualization]
    subgraph Browser_Sandbox
    D
    F
    G
    end

Prerequisites

To follow this "Learning in Public" journey, you’ll need:

Tech Stack: React (Vite), TypeScript, WebLLM.
Hardware: A device with a modern GPU (M1/M2/M3 Mac, or RTX-series Windows).
Browser: Chrome/Edge (v113+) with WebGPU enabled.

Step-by-Step Implementation

1. Initializing the WebLLM Engine

First, we need to set up the engine that manages the model weights and the WebGPU pipeline. We'll use Llama-3-8B-Instruct-v0.1-q4f16_1-MLC, a version optimized for the web.

// engine.ts
import { CreateMLCEngine, MLCEngine } from "@mlc-ai/web-llm";

export async function initEngine(onProgress: (p: number) => void): Promise<MLCEngine> {
  const modelId = "Llama-3-8B-Instruct-v0.1-q4f16_1-MLC";

  // This downloads the model to the browser cache (IDB)
  const engine = await CreateMLCEngine(modelId, {
    initProgressCallback: (report) => {
      onProgress(Math.round(report.progress * 100));
      console.log(report.text);
    },
  });

  return engine;
}

2. The Structured Prompting Logic

For medical reports, we don't want a "chatty" AI. We want a structured extractor. We'll use a system prompt to force Llama-3 to output clean JSON.

const SYSTEM_PROMPT = `
You are a medical data extractor. 
Analyze the provided text and return ONLY a JSON object containing:
1. Patient_Summary
2. Abnormal_Indicators (Array)
3. Suggested_Questions_For_Doctor
Do not provide any preamble.
`;

export async function analyzeReport(engine: MLCEngine, reportText: string) {
  const messages = [
    { role: "system", content: SYSTEM_PROMPT },
    { role: "user", content: `Analyze this report: ${reportText}` }
  ];

  const reply = await engine.chat.completions.create({
    messages,
    temperature: 0.1, // Keep it deterministic
  });

  return reply.choices[0].message.content;
}

3. Creating the React Interface

We need a clean UI to handle the heavy lifting of loading gigabytes of weights into the GPU.

import React, { useState } from 'react';
import { initEngine, analyzeReport } from './engine';

const MedicalAnalyzer = () => {
  const [loading, setLoading] = useState(0);
  const [result, setResult] = useState("");

  const handleStart = async () => {
    const engine = await initEngine(setLoading);
    const rawText = "WBC: 12.5 (High), RBC: 4.5, Glucose: 110 mg/dL...";
    const jsonOutput = await analyzeReport(engine, rawText);
    setResult(jsonOutput);
  };

  return (
    <div className="p-8 max-w-2xl mx-auto">
      <h2 className="text-2xl font-bold">🏥 Local Health AI</h2>
      {loading > 0 && loading < 100 && (
        <progress className="w-full" value={loading} max="100" />
      )}
      <button 
        onClick={handleStart}
        className="bg-blue-600 text-white px-4 py-2 rounded mt-4"
      >
        Analyze Privately
      </button>
      <pre className="bg-gray-100 p-4 mt-4 rounded">{result}</pre>
    </div>
  );
};

Why This Matters (The "Official" Way)

Running 8-billion parameter models in a browser tab was science fiction two years ago. Today, it’s a privacy requirement. By keeping the inference on the "Edge" (the user's machine), we eliminate latency, server costs, and—most importantly—data vulnerability.

While this tutorial covers the basics of WebLLM implementation, scaling this to handle complex PDF parsing or multi-modal vision (like analyzing X-ray images locally) requires more advanced design patterns. For deep dives into Production Edge AI and WebGPU optimization, I highly recommend exploring the resources at wellally.tech/blog. They specialize in bridging the gap between "cool demos" and "secure enterprise applications."

Conclusion

We’ve successfully:

Bootstrapped WebGPU to talk to our hardware.
Quantized Llama-3 to fit into the browser's memory.
Built a Privacy-First pipeline for sensitive medical data.

The future of AI isn't just in the cloud; it's right here, in your browser. Stop leaking your data and start building locally!

What are you building with WebGPU? Let me know in the comments below!

DEV Community