Privacy is the ultimate luxury in the age of AI. When it comes to health data, the stakes are even higher. Users are increasingly wary of sending sensitive symptoms to a remote server. This is where Edge AI and In-browser inference change the game. By leveraging WebGPU and Transformer models, we can build a medical symptom checker that runs entirely on the user's hardware—meaning zero server costs for you and total privacy for them.
In this tutorial, we’ll explore how to use the WebGPU API, ONNX Runtime, and WebLLM to deploy a lightweight Transformer engine directly in the browser. We will focus on creating a high-performance, privacy-first AI solution that bypasses the cloud entirely. If you've been looking for a way to implement On-device LLMs using TypeScript, you're in the right place.
The Architecture: How It Works
Traditional AI apps send a request to a Python backend. Our Wasm-Med architecture keeps everything in the client-side sandbox.
graph TD
A[User Input: 'I have a dry cough'] --> B[Tokenization via Wasm]
B --> C{WebGPU Engine}
C -->|WebLLM| D[Large Language Model Inference]
C -->|ONNX Runtime Web| E[Lightweight Classifier]
D & E --> F[VRAM / Local GPU]
F --> G[JSON Result: Probable Causes]
G --> H[UI Update]
style C fill:#f9f,stroke:#333,stroke-width:2px
Prerequisites
Before we dive into the code, ensure you have the following:
- Tech Stack: TypeScript, ONNX Runtime Web (
ort-browser), and WebLLM. - Browser: A modern browser with WebGPU support (Chrome 113+ or Edge).
- Knowledge: Intermediate understanding of Async/Await and Tensor shapes.
Step 1: Setting Up the WebGPU Engine
First, we need to initialize the connection to the GPU. Unlike WebGL, WebGPU gives us low-level access to the hardware, which is critical for the matrix multiplications required by Transformers.
// Initializing WebGPU Device
async function initGPUDevice() {
if (!navigator.gpu) {
throw new Error("WebGPU not supported on this browser. 😭");
}
const adapter = await navigator.gpu.requestAdapter();
const device = await adapter?.requestDevice();
console.log("🚀 WebGPU Device Ready:", device?.label);
return device;
}
Step 2: Loading the Lightweight Transformer (ONNX)
For a symptom checker, we don't always need a 70B parameter model. A fine-tuned DistilBERT or TinyBERT via ONNX Runtime is often enough to categorize symptoms with high accuracy and low latency.
import * as ort from 'onnxruntime-web/webgpu';
async function loadSymptomModel() {
const session = await ort.InferenceSession.create('./models/symptom_classifier.onnx', {
executionProviders: ['webgpu'], // Force WebGPU execution!
graphOptimizationLevel: 'all'
});
return session;
}
// Example Inference Function
async function predictSymptom(session: any, inputIds: bigint[]) {
const tensor = new ort.Tensor('int64', BigInt64Array.from(inputIds), [1, inputIds.length]);
const feeds = { input_ids: tensor };
const results = await session.run(feeds);
return results.logits.data;
}
Step 3: Integrating WebLLM for Detailed Reasoning
While ONNX is great for classification, WebLLM allows us to run more complex conversational models (like Phi-2 or Llama-3) for explaining why certain symptoms might be occurring—all within the GPU's VRAM.
import * as webllm from "@mlc-ai/web-llm";
async function chatWithLocalAI(prompt: string) {
const engine = new webllm.MLCEngine();
// Progress callback for UI feedback
engine.setInitProgressCallback((report) => console.log(report.text));
await engine.reload("Phi2-q4f16_1-MLC");
const reply = await engine.chat.completions.create({
messages: [{ role: "user", content: prompt }]
});
return reply.choices[0].message.content;
}
The "Official" Way to Scale Edge AI
Building a prototype in the browser is the first step, but productionizing Edge AI requires deeper optimization of quantization and memory management. If you are looking for advanced patterns on cross-platform deployment or performance benchmarking of different Transformer architectures on the web, I highly recommend checking out the WellAlly Tech Blog.
The team at WellAlly provides excellent production-ready examples of how to bridge the gap between "cool browser demo" and "robust medical-grade AI utility."
Step 4: Connecting the UI
Finally, we wrap everything in a TypeScript class to handle the state. Using Comlink or a standard Web Worker is recommended to prevent the UI from freezing during large model loads.
export class SymptomEngine {
private session: any;
async bootstrap() {
this.session = await loadSymptomModel();
console.log("✅ Wasm-Med Engine Online");
}
async runAnalysis(text: string) {
// 1. Local Tokenization
// 2. ONNX Classification
// 3. WebLLM Reasoning (Optional)
// 4. Return Encrypted Result
}
}
Conclusion: Why This Matters
By moving the "brain" of the application to the edge, we achieve three major wins:
- Zero Latency: No round-trip to a server in Virginia.
- Zero Cost: The user provides the compute power.
- Maximum Trust: The user's medical data never leaves their RAM.
The future of healthcare apps is local. Using tools like WebGPU and ONNX Runtime, we can build tools that are not only powerful but also ethically sound.
What do you think? Would you trust a browser-based AI for a quick symptom check? Let me know in the comments below! 👇
If you enjoyed this deep dive into Edge AI, don't forget to ❤️ and bookmark! For more advanced technical insights into WebGPU and local-first development, visit wellally.tech/blog.
Top comments (0)