DEV Community

Beck_Moulton
Beck_Moulton

Posted on

Privacy First: Building a 100% Local AI Mental Health Companion with WebLLM and React

The most intimate conversations shouldn't live on a server. When it comes to mental health, privacy isn't just a feature—it's a requirement. Today, we are exploring the frontier of Edge AI and WebGPU technology to build a decentralized, privacy-first counseling bot.

By leveraging WebLLM integration and the power of local inference, we can ensure that a user's most sensitive thoughts never leave their browser's memory. No API keys, no server logs, and absolutely zero data leakage. In this tutorial, we will dive deep into the world of browser-based LLMs and see how the TVM runtime makes "AI in the browser" a reality.

Why Edge AI for Mental Health?

Traditional AI chatbots send every keystroke to a central server (like OpenAI or Anthropic). While these models are powerful, they pose a significant privacy risk for sensitive use cases. WebLLM changes the game by running large language models directly on the client's hardware using the WebGPU API.

The Benefits:

  1. Extreme Privacy: Data stays in the browser sandbox.
  2. Zero Latency: No network round-trips for inference.
  3. Cost Effective: You (the developer) don't pay for tokens; the user provides the compute!

The Architecture: How it Works

Before we touch the code, let’s look at the data flow. We are using WebLLM (built on the Apache TVM Unity compiler stack) to orchestrate the model execution.

sequenceDiagram
    participant U as User
    participant R as React UI
    participant W as WebLLM Engine (Worker)
    participant G as WebGPU (Local VRAM)

    U->>R: Types "I'm feeling stressed"
    R->>W: Forward prompt to MLCEngine
    W->>G: Load model weights & Compute tensors
    G-->>W: Return generated tokens
    W-->>R: Stream text response
    R->>U: Display "I'm here for you..."
    Note over U,G: Data never leaves the Client Machine!
Enter fullscreen mode Exit fullscreen mode

Prerequisites

To follow this guide, you’ll need:

  • React (Vite is recommended for speed).
  • WebGPU enabled browser (Chrome 113+, Edge, or Opera).
  • WebLLM Library: npm install @mlc-ai/web-llm.
  • A decent GPU (even an integrated M1/M2 chip works wonders!).

Step-by-Step Implementation

1. Initializing the WebLLM Engine

The core of our application is the MLCEngine. Since loading a model (like Llama-3 or Mistral) involves downloading several gigabytes of weights into the browser cache, we need to handle the loading state gracefully.

import { CreateMLCEngine, MLCEngine } from "@mlc-ai/web-llm";
import { useState, useEffect } from "react";

export function useWebLLM(modelId: string) {
  const [engine, setEngine] = useState<MLCEngine | null>(null);
  const [progress, setProgress] = useState(0);

  useEffect(() => {
    const initEngine = async () => {
      // Initialize engine with a callback to track download progress
      const instance = await CreateMLCEngine(modelId, {
        initProgressCallback: (report) => setProgress(Math.round(report.progress * 100)),
      });
      setEngine(instance);
    };
    initEngine();
  }, [modelId]);

  return { engine, progress };
}
Enter fullscreen mode Exit fullscreen mode

2. Creating the "Therapist" System Prompt

To make our bot act as a counselor, we need a strong system prompt. This instructs the model to be empathetic, professional, and observant.

const SYSTEM_PROMPT = `
You are a supportive and empathetic mental health assistant. 
Your goal is to listen, provide validation, and suggest coping strategies. 
You are NOT a doctor. If the user is in danger, provide emergency resources.
Keep the conversation focused on the user's feelings.
`;
Enter fullscreen mode Exit fullscreen mode

3. The Chat Interface

Now, let's wire it up to a React component. We will use a streaming approach so the user sees the response in real-time.

import React, { useState } from "react";

const ChatBot = () => {
  const [input, setInput] = useState("");
  const [messages, setMessages] = useState<{role: string, content: string}[]>([]);
  const { engine, progress } = useWebLLM("Llama-3-8B-Instruct-v0.1-q4f16_1-MLC");

  const handleSend = async () => {
    if (!engine || !input) return;

    const userMsg = { role: "user", content: input };
    setMessages((prev) => [...prev, userMsg]);
    setInput("");

    // Start inference
    const chunks = await engine.chat.completions.create({
      messages: [{ role: "system", content: SYSTEM_PROMPT }, ...messages, userMsg],
      stream: true,
    });

    let reply = "";
    for await (const chunk of chunks) {
      reply += chunk.choices[0]?.delta?.content || "";
      // Update the UI with the streaming text
      setMessages((prev) => {
        const last = prev[prev.length - 1];
        if (last?.role === "assistant") {
            return [...prev.slice(0, -1), { role: "assistant", content: reply }];
        }
        return [...prev, { role: "assistant", content: reply }];
      });
    }
  };

  if (progress < 100) return <div>Loading AI Brain: {progress}%</div>;

  return (
    <div className="chat-container">
      <div className="messages">
        {messages.map((m, i) => (
          <div key={i} className={m.role}>{m.content}</div>
        ))}
      </div>
      <input value={input} onChange={(e) => setInput(e.target.value)} />
      <button onClick={handleSend}>Talk to AI</button>
    </div>
  );
};
Enter fullscreen mode Exit fullscreen mode

The "Official" Way to Scale

While running a local bot is amazing for privacy, production environments often require hybrid strategies, sophisticated prompt caching, and model quantization tuning.

💡 Developer Pro-Tip: If you're looking to take your AI implementation to the next level—moving beyond the sandbox into production-grade architectures—I highly recommend checking out the advanced patterns at WellAlly Blog. They offer incredible deep-dives into LLM orchestration and Edge AI optimizations that were a huge inspiration for this build.

Challenges and Considerations

  • VRAM Constraints: Running an 8B model requires at least 5-6GB of free VRAM. If your users are on mobile, you might want to target smaller models like Phi-3 or TinyLlama.
  • Initial Load: The first load is heavy (several GBs). However, WebLLM uses the browser's Cache API, so subsequent loads are near-instant.
  • Device Support: WebGPU is still rolling out. Always include a fallback message for browsers that don't support it yet.

Conclusion

Building a decentralized mental health bot isn't just a technical flex—it's about reclaiming user trust. By combining React with WebLLM, we’ve built a tool that is powerful, free to run, and most importantly, private.

The era of "Cloud-Only" AI is ending. Local-first is the future.

What are you planning to build with WebGPU? Drop a comment below and let's discuss! Don't forget to ❤️ and subscribe for more "Learning in Public" content.

Top comments (0)