The AI world is racing towards the cloud. Massive models, centralized APIs, and subscription fees are the norm. But what if you need AI that is fast, entirely private, and capable of seeing the world through your webcam—without sending a single pixel over the network?
Enter the Local AI Mission Console, a project that challenges the cloud-first paradigm by building a fully functional, multimodal AI pipeline directly on your hardware.
In this post, we’ll break down the architecture behind this project, highlight the power of the RunAnywhere Web SDK, and show you how to orchestrate local backend reasoning (via Ollama) with browser-side inference (via Sherpa-ONNX WebAssembly).
The Local-Edge Architecture
Building AI applications usually involves two extremes:
- Cloud AI: Powerful, but high latency, costly, and inherently unprivate.
- True Edge (In-Browser) AI: Extremely private, but limited by WebGL/WebGPU constraints. Running a 7-billion parameter Vision-Language Model (VLM) purely in a browser tab is often unstable.
This project introduces a "Local-Edge" hybrid:
- The Edge (Browser): Handles perception (WebRTC webcam capture) and speech synthesis (WASM TTS). It drives the UI and orchestrates the flow.
- The Local Backend (Ollama): Acts as the heavy-duty reasoning engine running natively on your hardware’s GPU or CPU, abstracted behind a clean local API.
Orchestrating with the RunAnywhere Web SDK
The secret sauce tying the browser to these complex tasks is the RunAnywhere Web SDK.
RunAnywhere provides standardized, deeply typed interfaces for dealing with AI modalities. Instead of wrestling with fragmented Web Audio APIs for playback or writing custom WebRTC hooks for frame extraction, the SDK handles the boilerplate.
import { VisionSystem } from './vision';
import { OllamaClient } from './ollama';
import { SpeechSystem } from './speech';
// The pipeline is simple and declarative:
export class AIPipeline {
async run() {
const frame = await this.vision.captureFrame(); // 1. Perception
const text = await this.ollama.describe(frame); // 2. Reasoning
await this.speech.speak(text); // 3. Synthesis
}
}
The Heavy Lifter: Ollama as a VLM Engine
While text-generation models are shrinking, Vision-Language Models (VLMs) like qwen2.5-vl remain computationally heavy. By delegating vision tasks to Ollama, the browser tab stays lightweight and responsive.
The browser simply converts the webcam frame to base64 and fires an HTTP request to localhost:11434. Ollama’s robust native execution processes the image against the user’s prompt ("Describe what you see in one short sentence") and streams back the result. Because the network request never leaves the local machine, the latency is negligible compared to cloud APIs.
The Browser Magic: Sherpa-ONNX WASM TTS
The most technically complex part of the Local AI Mission Console occurs entirely within the browser tab: Offline Text-to-Speech.
Instead of making an API call for audio synthesis, the app uses @runanywhere/web-onnx to initialize a WebAssembly-compiled Sherpa-ONNX engine.
Overcoming the WASM Hydration Challenge
The Piper TTS models used by Sherpa require more than just a .onnx file. They need a deep dictionary of phonemization rules (espeak-ng-data) consisting of over 350 files.
If a single rule file is missing, the C++ engine inside the WASM sandbox will crash with a fatal exit(-1). We solved this by treating the WASM virtual filesystem (FS) like a real hard drive:
// 1. Fetch a pre-packaged tar archive of the model and espeak-ng-data
const response = await fetch('/models/piper-en.tgz.bin');
const buffer = await response.arrayBuffer();
// 2. Extract it directly into the memory of the WASM virtual filesystem
const { extractTarGz } = await import('@runanywhere/web');
const entries = await extractTarGz(new Uint8Array(buffer));
for (const entry of entries) {
const fsPath = `/models/piper-en/${cleanPath}`;
bridge.writeFile(fsPath, entry.data);
}
// 3. The engine initializes perfectly, finding all 350+ files exactly where it expects them!
By explicitly packaging the data into a .tgz.bin to avoid Vite's auto-decompression headers, the frontend can instantly hydrate the TTS engine.
Solving the "Instance Mismatch"
Vite's module bundling can sometimes duplicate singleton instances across files. To guarantee that the initialization logic in main.ts and the synthesis calls in speech.ts were talking to the exact same physical WASM runtime, we utilized RunAnywhere's ExtensionPoint registry:
// Retrieve the *exact* TTS singleton instantiated by the SDK
const tts = ExtensionPoint.getProvider('tts');
const result = await tts.synthesize("I see a developer building cool things.");
Why This Matters
The Local AI Mission Console serves as a foundational template for modern local apps. It proves that you don't need highly complex electron apps or heavy Python backends to build powerful AI tools.
With RunAnywhere, Ollama, and Vite, any web developer can build zero-latency, 100% private AI applications that leverage local hardware to its maximum potential.
Where to go from here?
This pipeline is just the beginning.
- Try modifying the prompt in
ollama.tsto build a Local Security Guard that only speaks when it detects a person. - Add an
setIntervalloop topipeline.tsfor a Dashcam mode that narrates your drive. - Implement Local RAG by passing the Ollama description to a ChromaDB instance for contextual recall.

Top comments (0)