DEV Community

wellallyTech
wellallyTech

Posted on

Stop Sending Sensitive Health Data to Servers: Build a Private AI Health Assistant with WebLLM & Transformers.js

Privacy is the "final boss" of healthcare technology. When building digital health tools, the biggest hurdle isn't just the logic—it's the massive responsibility of handling sensitive user data. But what if the data never left the user's device? 🤯

In this tutorial, we’re diving into the world of Edge AI and WebGPU. We will build a 100% Offline Virtual Health Assistant using WebLLM, Transformers.js, and React. This assistant can perform drug interaction checks and basic health consultations directly in the browser.

By leveraging WebGPU machine learning and local LLM browser execution, we eliminate server costs and, more importantly, ensure total user privacy.


The Architecture: Why WebGPU?

Traditional AI apps send a request to a cloud API (like OpenAI), which processes the data and sends it back. Our app keeps everything local. We use the browser's access to the GPU to run heavy computations.

graph TD
    A[User Input: Medication Query] --> B{Browser Environment}
    B --> C[WebGPU API]
    C --> D[WebLLM / Llama-3-8B-Web]
    C --> E[Transformers.js / Feature Extraction]
    D --> F[Local AI Inference]
    E --> F
    F --> G[Health Insights/Drug Interaction Info]
    G --> H[UI Update: No Data Transmitted!]
    style D fill:#f96,stroke:#333,stroke-width:2px
    style H fill:#bbf,stroke:#333,stroke-width:4px
Enter fullscreen mode Exit fullscreen mode

Prerequisites 🛠️

Before we start, ensure you have:

  • Node.js installed (v18+).
  • A WebGPU-enabled browser (Chrome 113+, Edge 113+, or Arc).
  • Basic knowledge of React and Hooks.

Step 1: Setting Up the Project

First, let's bootstrap a React project and install our magic ingredients:

npx create-react-app health-ai-edge --template typescript
cd health-ai-edge
npm install @mlc-ai/web-llm @xenova/transformers
Enter fullscreen mode Exit fullscreen mode
  • WebLLM: For running Large Language Models (like Llama 3 or Mistral) in the browser.
  • Transformers.js: For lightweight tasks like sentiment analysis or named entity recognition (NER).

Step 2: Initializing the WebLLM Engine

We need a custom hook to manage the model's loading state and execution. WebLLM uses a worker-based architecture to keep the UI thread smooth. 🚀

// useWebLLM.ts
import { useState, useEffect } from "react";
import * as webllm from "@mlc-ai/web-llm";

export function useWebLLM() {
  const [engine, setEngine] = useState<webllm.EngineInterface | null>(null);
  const [progress, setProgress] = useState(0);

  const initEngine = async () => {
    const chatOpts = {
      model_list: [
        {
          "model": "https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
          "model_id": "Llama-3-8B-Instruct-v0.1-q4f16_1",
          "model_lib": "https://raw.githubusercontent.com/mlc-ai/binary-mlc-llm-libs/main/Llama-3-8B-Instruct-v0.1-q4f16_1-webgpu.wasm",
        },
      ],
    };

    const engine = await webllm.CreateEngine("Llama-3-8B-Instruct-v0.1-q4f16_1", {
      initProgressCallback: (report) => setProgress(Math.round(report.progress * 100)),
    });
    setEngine(engine);
  };

  return { engine, progress, initEngine };
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Building the Health Assistant Logic

Now, let's create a component that uses the engine to answer health-related queries. We will use a strict "System Prompt" to ensure the AI stays in "Health Assistant" mode.

// HealthAssistant.tsx
import React, { useState } from 'react';
import { useWebLLM } from './useWebLLM';

const HealthAssistant = () => {
  const { engine, progress, initEngine } = useWebLLM();
  const [input, setInput] = useState("");
  const [response, setResponse] = useState("");

  const handleConsultation = async () => {
    if (!engine) return;

    const messages = [
      { role: "system", content: "You are a private virtual health assistant. Provide information on drug interactions and general health tips. Always advise the user to consult a doctor." },
      { role: "user", content: input }
    ];

    const reply = await engine.chat.completions.create({ messages });
    setResponse(reply.choices[0].message.content);
  };

  return (
    <div className="p-6 max-w-2xl mx-auto bg-white rounded-xl shadow-md space-y-4">
      <h2 className="text-xl font-bold">🩺 Offline Health Assistant</h2>
      {progress < 100 && progress > 0 && <p>Loading Models: {progress}%</p>}
      {!engine ? (
        <button onClick={initEngine} className="bg-blue-500 text-white p-2 rounded">Initialize Local AI</button>
      ) : (
        <div className="flex flex-col gap-4">
          <textarea 
            placeholder="e.g., Can I take Ibuprofen with Aspirin?"
            className="border p-2 rounded"
            onChange={(e) => setInput(e.target.value)}
          />
          <button onClick={handleConsultation} className="bg-green-500 text-white p-2 rounded">Check Locally</button>
          <div className="bg-gray-100 p-4 rounded mt-4">
            <strong>AI Response:</strong>
            <p className="mt-2 whitespace-pre-wrap">{response}</p>
          </div>
        </div>
      )}
    </div>
  );
};
Enter fullscreen mode Exit fullscreen mode

🥑 Going Beyond the Basics: The "Official" Way

While running models in the browser is incredible for privacy, production-ready healthcare applications often require more robust patterns, such as Retrieval-Augmented Generation (RAG) using local vector databases or HIPAA-compliant hybrid clouds.

For developers looking to implement advanced patterns like local model quantization or building secure medical data pipelines, I highly recommend checking out the in-depth guides at Wellally Blog. They specialize in production-grade AI architectures that don't compromise on security or performance.


Why This Matters (The "So What?")

  1. Zero Latency: Once the model is cached in the browser's CacheStorage, inference is nearly instant, regardless of your internet connection.
  2. Zero Server Costs: You aren't paying $0.01 per 1k tokens to OpenAI. The user's hardware does the heavy lifting.
  3. Maximum Trust: In an era of data leaks, telling a user "Your medical data never leaves this screen" is a massive competitive advantage.

Performance Tip 💡

Local models are large (usually 2GB to 5GB). Make sure to use the WebGPU IndexedDB cache so the user only has to download the model once!


Conclusion 🏁

We’ve just built a fully functional, browser-based AI health assistant! By combining WebLLM with the power of WebGPU, we've pushed the boundaries of what’s possible on the web. This is the future of Edge AI: private, fast, and cost-effective.

What will you build next? Maybe an offline medical image classifier using Transformers.js? 📸

If you enjoyed this tutorial, drop a comment below and let me know how you plan to use local AI in your next project!

Happy coding! 🚀💻


For more production-ready AI examples and advanced Edge AI tutorials, visit wellally.tech/blog.

Top comments (0)