In the world of digital health, privacy isn't just a feature—it's a requirement. When dealing with sensitive medical data like dermatological photos, users are often (rightfully) hesitant to upload their images to a remote server. Enter Edge AI and the revolution of WebGPU.
By leveraging WebLLM and the TVM (Tensor Virtual Machine) stack, we can now run sophisticated vision models directly inside the browser. This approach enables high-performance, real-time privacy-preserving AI where the image never leaves the user's device. In this guide, we’ll explore how to implement a skin lesion screening tool using WebGPU and TypeScript, moving the heavy lifting from the cloud to the client's GPU.
🏗 The Architecture: High-Performance Edge Inference
Traditional web-based AI often relies on slow API calls. Our solution uses the browser's hardware acceleration via WebGPU, allowing us to execute compiled model kernels at near-native speeds.
graph TD
A[User Image/Camera] --> B{WebGPU Support?}
B -- No --> C[Fallback: CPU/Wasm]
B -- Yes --> D[Canvas API / Image Preprocessing]
D --> E[WebLLM / TVM Runtime]
E --> F[VLM / Vision Model Shards]
F --> G[GPU-Accelerated Inference]
G --> H[Screening Report & Insights]
H --> I[UI Display]
🛠 Prerequisites
To follow along, you'll need:
- Tech Stack: WebLLM, WebGPU-capable browser (Chrome 113+), TypeScript, and Vite.
- A Vision Model: We’ll use a quantized version of a vision-language model (VLM) compatible with the TVM runtime.
🚀 Step 1: Initializing the WebGPU Engine
First, we need to check for WebGPU compatibility and initialize the webLLM engine. Unlike standard REST APIs, we are loading the actual model weights into the browser's indexedDB or memory.
import * as webllm from "@mlc-ai/web-llm";
async function initializeScreeningEngine() {
const modelId = "Llama-3-8B-Vision-Instruct-q4f16_1-MLC"; // Example VLM
// Progress callback to update the UI during heavy model download
const initProgressCallback = (report: webllm.InitProgressReport) => {
console.log(`Loading Model: ${report.text} - ${Math.round(report.progress * 100)}%`);
};
const engine = await webllm.CreateMLCEngine(
modelId,
{ initProgressCallback }
);
return engine;
}
🖼 Step 2: Processing Pixels for the Model
Skin screening requires high-fidelity input. We use the browser's CanvasRenderingContext2D to resize and normalize the image before passing it to the WebGPU buffer.
async function processImage(imageElement: HTMLImageElement): Promise<string> {
const canvas = document.createElement("canvas");
const ctx = canvas.getContext("2d");
// Standardize input size for the vision encoder
canvas.width = 448;
canvas.height = 448;
ctx?.drawImage(imageElement, 0, 0, 448, 448);
// Convert to Base64 for WebLLM vision input
return canvas.toDataURL("image/jpeg");
}
🧠 Step 3: Local Inference
Now for the magic. We send the processed image and a prompt to our local model. Because the TVM runtime has compiled the model specifically for the user's GPU architecture, this happens in milliseconds.
async function runScreening(engine: webllm.MLCEngine, imageBase64: string) {
const messages: webllm.ChatCompletionMessageParam[] = [
{
role: "user",
content: [
{ type: "text", text: "Identify potential skin lesions in this image and provide a preliminary risk assessment." },
{ type: "image_url", image_url: { url: imageBase64 } }
],
},
];
const reply = await engine.chat.completions.create({
messages,
temperature: 0.2, // Keep it deterministic for medical screening
});
return reply.choices[0].message.content;
}
💡 The "Official" Way to Scale
While building a prototype in the browser is exciting, productionizing Edge AI requires handling model versioning, weight sharding, and cross-device performance optimization.
For advanced implementation patterns, performance benchmarks on different GPU architectures, and production-ready Edge AI templates, I highly recommend checking out the technical deep dives at WellAlly Tech Blog. It's an incredible resource for developers looking to bridge the gap between "cool demo" and "robust healthcare application."
📈 Optimization & Benchmarking
Using TypeScript and TVM, we observed that once the model is cached in the browser's CacheStorage:
- Cold Start: 5-10 seconds (Model loading).
- Inference Time: ~200ms - 800ms (depending on GPU).
- Data Egress: 0KB (Completely private).
🎯 Conclusion
The browser is no longer just a document viewer; it's a powerful AI execution environment. By combining WebLLM and WebGPU, we can build healthcare tools that are fast, cost-effective, and—most importantly—private by design.
What's next?
Try integrating this with a mobile PWA to create a "Skin Journal" app that alerts users to changes in their skin over time, all without a single server-side database.
🥑 Found this helpful? Follow me for more "Learning in Public" notes on Edge AI, and don't forget to visit WellAlly Tech for more high-level architecture insights!
Top comments (0)