Every big project starts with a spark of curiosity. For us, it began with the question: "Can we make powerful on-device AI as simple and accessible as browsing a website?"
Try it for yourself: Live Demo
We were exploring amazing tools like Ollama and LM Studio. The power to run large models locally was incredible, but it came with a catch: the complexity of terminals, installers, and configurations. It was a world away from the seamless experience of web-based AI tools.
We saw a gap and an opportunity. This is the story of how we built Gemma Web, a fully private, on-device AI workspace that lives entirely in your browser.
Inspiration from Figma
Our guiding thought was: if a graphically intensive application like Figma can run entirely on the client-side, what's stopping us from doing the same with a language model? This led us down the path of WebAssembly (Wasm), the technology that would become the cornerstone of our project.
The Journey for the Right Tools
Our initial exploration into existing JavaScript AI modules was a mixed bag. While promising, they often felt slow and couldn't deliver the zero-latency experience we were aiming for. We knew we needed something that could leverage the hardware more directly.
The breakthrough came when we found two key pieces of technology from Google:
Gemma Models: A new family of open-weight models specifically optimized for on-device performance.
MediaPipe LLM Task API: A (then in preview) library designed to run these models efficiently using a Wasm backend.
We had found our path forward.
Our Technical Architecture
Here’s a breakdown of the core components we engineered:
The On-Device Inference Engine
The heart of the application is a serverless, local-first inference engine. We used WebAssembly and the MediaPipe LLM Task API to deploy Google’s Gemma models directly in the browser. This approach is what allows us to achieve near zero-latency inference with 100% data privacy and full offline capability.The Client-Side RAG Pipeline
To make the AI truly useful, it needed to understand user-specific context. We implemented a complete Retrieval-Augmented Generation (RAG) pipeline that runs entirely on the client-side. To avoid freezing the app while processing large files, we leveraged a Web Worker to handle the heavy lifting. The worker asynchronously processes user-uploaded PDFs and TXTs, converting them into vector embeddings using TensorFlow.js for context-aware conversations.Robust Local Storage with IndexedDB
To achieve true offline functionality, we needed a way to store everything locally. We architected a robust client-side solution using IndexedDB. This allows us to persist multi-gigabyte AI models, conversation histories, and document embeddings. This robust storage is also what enables Gemma Web to function as a seamless, installable Progressive Web App (PWA).
The Result: A New Kind of AI Workspace
The culmination of this journey is Gemma Web, a tool that we believe represents a step towards the future of personal, private AI.
This project was a deep dive into the future of web-native applications. It taught us that the browser is more than capable of handling complex, computation-heavy tasks that were once reserved for servers.
We'd love for you to check it out and share your feedback in the comments below! Thanks for reading.
Top comments (0)