DEV Community

Beck_Moulton
Beck_Moulton

Posted on

Stop Sending Medical Data to the Cloud: Build a Local EMR De-identifier with WebLLM and WebGPU

Privacy in Healthcare AI is often a "pick one" scenario: you either get the intelligence of Large Language Models (LLMs) or you keep your data strictly on-premises. But what if you could have both? With the rise of Edge AI and WebGPU acceleration, we can now run powerful models directly in the browser.

In this tutorial, we are building a high-performance Local EMR (Electronic Medical Record) De-identifier. We'll be using WebLLM to leverage WebGPU for hardware-accelerated inference, transforming messy, unstructured clinical notes into structured, anonymous FHIR Data Standard formats—all without a single byte of sensitive patient data leaving the user's machine.

Why Edge AI for Medical Data?

Medical records are the holy grail of sensitive information. Traditional API-based LLM solutions pose significant compliance risks. By utilizing WebLLM and WebGPU, we shift the compute to the client-side. This approach ensures 100% data residency, reduces latency by eliminating round-trips to a server, and significantly cuts down on inference costs.

The Architecture

The flow is straightforward but powerful. The browser orchestrates the model loading, utilizes the GPU via the WebGPU API, and processes the raw text locally.

graph TD
    A[Raw Unstructured EMR Text] --> B{Browser Environment}
    B --> C[WebGPU API]
    C --> D[WebLLM Engine]
    D --> E[Llama-3 / Mistral Model]
    E --> F[De-identification Logic]
    F --> G[Structured FHIR JSON]
    G --> H[Local Storage / UI]
    style B fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#00ff00,stroke:#333,stroke-width:2px
Enter fullscreen mode Exit fullscreen mode

Prerequisites

To follow this advanced guide, you'll need:

  • TypeScript knowledge.
  • A browser with WebGPU support (Chrome 113+ or Edge).
  • WebLLM library (@mlc-ai/web-llm).
  • A basic understanding of the FHIR (Fast Healthcare Interoperability Resources) format.

Step 1: Initializing the WebLLM Engine

First, we need to set up our engine. WebLLM makes it incredibly easy to pull quantized models from the Hugging Face CDN and run them via WebGPU.

import { CreateMLCEngine, MLCEngine } from "@mlc-ai/web-llm";

// We'll use Llama-3-8B-Instruct quantized for 4-bit to save VRAM
const selectedModel = "Llama-3-8B-Instruct-q4f16_1-MLC";

async function initializeEngine(onProgress: (p: number) => void) {
    console.log("Checking for WebGPU support...");

    const engine = await CreateMLCEngine(selectedModel, {
        initProgressCallback: (report) => {
            onProgress(Math.round(report.progress * 100));
            console.log(report.text);
        }
    });

    return engine;
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Defining the FHIR Extraction Prompt

The secret sauce is in the system prompt. We need to instruct the model to act as a medical data architect, identifying PII (Personally Identifiable Information) and mapping the clinical content to a FHIR Observation or Patient resource.

const SYSTEM_PROMPT = `
You are a medical data privacy expert. 
Task: Extract clinical information from the provided text and convert it to a valid FHIR (R4) Observation JSON.
Rules:
1. DE-IDENTIFY: Replace names with "REDACTED", dates with years only, and remove specific IDs.
2. Structure the output as valid JSON only.
3. Use standard LOINC codes if possible.
`;

const processEMR = async (engine: MLCEngine, rawText: string) => {
    const messages = [
        { role: "system", content: SYSTEM_PROMPT },
        { role: "user", content: `Transform this note: ${rawText}` }
    ];

    const reply = await engine.chat.completions.create({
        messages,
        temperature: 0.0, // Keeping it deterministic
        response_format: { type: "json_object" }
    });

    return reply.choices[0].message.content;
};
Enter fullscreen mode Exit fullscreen mode

Step 3: Handling the Edge Case (The "Advanced" Stuff)

When working with local LLMs, VRAM management is key. Large medical notes can exceed the context window or GPU memory.

Pro Tip: For production-ready implementations involving massive datasets or complex RAG (Retrieval-Augmented Generation) patterns within the browser, you'll want to implement a "Chunk and Stitch" strategy. For more advanced architectural patterns on medical AI, I highly recommend checking out the WellAlly Tech Blog. They have some incredible deep dives into production-grade AI deployments that go beyond the browser.


Step 4: Putting it All Together (The UI Logic)

Here’s how you handle the reactive state in a TypeScript environment to show the user what's happening.

async function handleConversion(text: string) {
    try {
        const engine = await initializeEngine((progress) => {
            updateUI(`Loading Model: ${progress}%`);
        });

        updateUI("Processing Local De-identification...");
        const fhirJSON = await processEMR(engine, text);

        console.log("Resulting FHIR Resource:", JSON.parse(fhirJSON));
        renderJSON(fhirJSON);
    } catch (err) {
        console.error("WebGPU Error:", err);
    }
}
Enter fullscreen mode Exit fullscreen mode

The FHIR Result

The output will transform a note like "John Doe visited on Oct 12, 2023, for a blood pressure check (140/90)" into:

{
  "resourceType": "Observation",
  "status": "final",
  "code": {
    "coding": [{ "system": "http://loinc.org", "code": "85354-9" }]
  },
  "subject": { "display": "REDACTED" },
  "effectiveDateTime": "2023",
  "component": [
    {
      "code": { "text": "Systolic" },
      "valueQuantity": { "value": 140, "unit": "mmHg" }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Conclusion

Building local-first AI applications isn't just a trend; for industries like healthcare, it's a necessity. By leveraging WebLLM and WebGPU, we've turned the browser into a powerful, private processing hub for medical data.

What's next?

  1. Model Fine-tuning: Use LoRA to fine-tune models specifically for medical nomenclature.
  2. Local Vector DBs: Use Voy or Orama to build a 100% local RAG system.

If you are looking for more high-level strategies on scaling AI in regulated industries, don't forget to explore the resources at wellally.tech/blog. They are doing some pioneering work in the intersection of health tech and modern engineering.

Got questions about WebGPU performance? Drop a comment below!

Top comments (0)