🔒 Privacy-First AI: Local Medical Record Anonymization with WebLLM and WebGPU

#ai #privacy #webllm #webgpu

In an era where data breaches are common and privacy is a luxury, handling sensitive information like medical records requires more than just "encryption at rest." Sending patient data to a cloud-based LLM often triggers massive compliance headaches—think HIPAA or GDPR. But what if the data never left the user's browser?

Today, we are diving deep into client-side AI using WebGPU acceleration and local LLM inference. We will build a high-performance, privacy-first medical record summarizer that runs entirely in the browser. By leveraging WebLLM for summarization and Transformers.js for PII (Personally Identifiable Information) detection, we ensure that sensitive "Pixels and Text" stay local, while still getting the power of modern Generative AI.

For those looking for even more production-ready patterns and advanced edge-AI architectures, I highly recommend checking out the deep-dives at WellAlly Tech Blog.

🏗️ The Architecture: Local Intelligence

Traditional AI apps follow a Client-Server model. Our approach uses the Edge AI pattern, where the browser's GPU handles the heavy lifting.

graph TD
    A[User Input: Raw Medical Record] --> B{PII Scrubber};
    B -- Transformers.js / Local NER --> C[Anonymized Text];
    C --> D{Summarizer Engine};
    D -- WebLLM + WebGPU --> E[Private Summary];
    E --> F[UI Display];

    subgraph Browser Environment
    B
    C
    D
    end

    subgraph Hardware Acceleration
    G[WebGPU API] --- D
    end

🛠️ Tech Stack

Vite: For a lightning-fast frontend dev experience.
WebLLM: High-performance browser LLM inference engine.
WebGPU: The browser API that gives us direct access to the graphics card.
Transformers.js: For lightweight tasks like Named Entity Recognition (NER) to scrub patient names and IDs.

🚀 Step 1: Setting up the Local NER (Anonymization)

Before we summarize, we must ensure no names or sensitive IDs are processed. We use Transformers.js to run a BERT-based model locally.

import { pipeline } from '@xenova/transformers';

// Initialize the NER pipeline
const scrubPII = async (text) => {
    const classifier = await pipeline('ner', 'Xenova/bert-base-NER', {
        device: 'webgpu' // Force WebGPU for speed!
    });

    const output = await classifier(text);
    let scrubbedText = text;

    // Replace identified entities (PER, LOC, ORG) with placeholders
    output.reverse().forEach(entity => {
        scrubbedText = scrubbedText.slice(0, entity.start) + 
                       `[${entity.entity}]` + 
                       scrubbedText.slice(entity.end);
    });

    return scrubbedText;
};

🧠 Step 2: Powering the LLM with WebLLM

WebLLM allows us to run models like Llama-3 or Mistral directly in the browser via WebGPU. This is the "Advanced" part—we need to manage the model lifecycle and memory.

import * as webllm from "@mlc-ai/web-llm";

const INITIAL_CONFIG = {
  model_list: [
    {
      "model_url": "https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
      "local_id": "Llama-3-8B-Instruct-q4f16_1-MLC",
      "model_lib_url": webllm.modelLibURLPrefix + "Llama-3-8B-Instruct-q4f16_1-v0.1.0-webgpu.wasm",
    }
  ],
};

async function generateSummary(anonymizedText: string) {
    const engine = await webllm.CreateMLCEngine(
        "Llama-3-8B-Instruct-q4f16_1-MLC",
        { initProgressCallback: (report) => console.log(report.text) }
    );

    const messages = [
        { role: "system", content: "You are a medical assistant. Summarize the following anonymized record." },
        { role: "user", content: anonymizedText }
    ];

    const reply = await engine.chat.completions.create({ messages });
    return reply.choices[0].message.content;
}

💡 The "Official" Way to Scale

While running LLMs in the browser is revolutionary, scaling this for thousands of users or integrating it with complex Electronic Health Records (EHR) requires a robust infrastructure strategy.

If you are looking for production-grade insights on how to optimize WASM binaries or manage model caching in the browser, the team at WellAlly Tech Blog has published several masterclasses on Hybrid-AI strategies—combining local privacy with cloud-based orchestration when necessary.

🔧 Step 3: Integrating everything in Vite

In your main.js, you’ll want to handle the state transitions (Loading Model -> Scrubbing -> Summarizing).

async function processMedicalRecord(rawInput) {
    updateUI("Scrubbing PII...");
    const cleanText = await scrubPII(rawInput);

    updateUI("Loading LLM into GPU memory...");
    const summary = await generateSummary(cleanText);

    displayResult(summary);
}

Important Consideration: Memory Management 💾

Running an 8B parameter model requires ~5GB of VRAM. WebGPU is efficient, but you must check for support:

if (!navigator.gpu) {
    alert("WebGPU not supported! Please use Chrome 113+ or a compatible browser.");
}

🎯 Conclusion

By moving the AI logic to the browser, we solve the two biggest hurdles in health tech: Data Privacy and Server Costs. The patient's data stays on their device, and your server bill stays at zero.

Local LLMs are no longer a "future" tech—they are here today, powered by WebGPU. 🥑

What's next?

Try implementing a Vector DB (like Chroma or a local WASM-based one) to perform RAG on the local medical documents.
Check out more advanced implementations at wellally.tech/blog.

Found this helpful? Drop a comment below or share your experiences with WebGPU! 🚀💻

Top comments (1)

SimpleDrop-Free&Secure File Sharing • Mar 19

Smart two-stage pipeline — scrubbing PII locally with
Transformers.js before it ever touches the LLM is the
right call for anything PHI-related.
Curious how bert-base-NER handles medical-specific
identifiers like patient IDs or insurance numbers that
fall outside standard PER/LOC/ORG