I built a proof-of-concept that lets you chat with PDF documents using AI models running entirely in your browser via WebGPU. No backend, no API keys, complete privacy.
🔗 Demo: https://webpizza-ai-poc.vercel.app/
📦 Code: https://github.com/stramanu/webpizza-ai-poc
Why?
I've been following the progress of WebGPU and WebLLM, and I was curious: Can we run a full RAG pipeline in the browser?
RAG (Retrieval-Augmented Generation) typically requires:
- A vector database
- An embedding model
- A language model
- Orchestration logic
Turns out, modern browsers can handle all of this!
Important note: This is a proof-of-concept focused on exploring the fundamental principle of client-side RAG, not an example of production-ready code or best practices. The goal was experimentation with WebGPU and LLMs in the browser, so expect rough edges and architectural shortcuts.
The Stack
- Frontend: Angular 20 (standalone components, zoneless)
- LLM: WebLLM v0.2.79 + WeInfer (optimized fork)
- Embeddings: Transformers.js (all-MiniLM-L6-v2)
- Vector Store: IndexedDB with cosine similarity
- PDF Parser: PDF.js
- Deployment: Vercel
How It Works
1. Model Loading
WebLLM downloads pre-compiled MLC models (Phi-3, Llama, Mistral). First load is slow (1-4GB), but they're cached in the browser.
await this.llm.initialize();
2. Document Ingestion
Upload a PDF → Parse with PDF.js → Chunk into ~500 char pieces → Embed each chunk → Store in IndexedDB.
const chunks = await this.parser.parseFile(file);
for (const text of chunks) {
const embedding = await this.embedder.embed(text);
await this.vectorStore.addChunk({ text, embedding });
}
3. Query Processing
User asks question → Embed query → Similarity search in IndexedDB → Get top-k chunks → Feed to LLM with context.
const queryEmbedding = await this.embedder.embed(question);
const relevantChunks = await this.vectorStore.search(queryEmbedding, k=3);
const context = relevantChunks.map(c => c.text).join('\\n');
const response = await this.llm.generate(context, question);
Challenges
1. Cross-Origin Isolation
WebGPU requires SharedArrayBuffer, which needs these headers:
Cross-Origin-Embedder-Policy: require-corp
Cross-Origin-Opener-Policy: same-origin
Vercel makes this easy with vercel.json, but it breaks if you have external resources without CORS.
2. Memory Management
Browsers aren't designed for 4GB models. I had to:
- Clear vector store before loading new documents
- Implement proper cleanup for embeddings
- Handle model caching effectively
3. WebGPU Compatibility
Not all browsers support WebGPU yet. Fallback to WebAssembly works, but it's significantly slower. Added detection logic to guide users to compatible browsers.
What I'd Do Differently
- Implement proper vector indexing - Currently brute force cosine similarity
- Add model quantization options - Let users choose speed vs quality
- Better chunking strategies - Currently just splitting at 500 chars
- Streaming for large documents - Don't embed everything at once
- Support for multiple document formats - Not just PDFs
Privacy Win
One unexpected benefit: Complete privacy by default.
Your documents never leave your device. No API calls, no server uploads, no tracking. Everything happens in your browser.
This makes it useful for:
- Sensitive documents (legal, medical, personal)
- Offline environments
- Privacy-conscious users
- Demos without infrastructure costs
Try It Yourself
🔗 Live Demo: https://webpizza-ai-poc.vercel.app/
Requirements:
- Chrome/Edge 113+ (WebGPU support)
- 4GB+ RAM
- Modern GPU (or patience for CPU fallback)
Quick start:
git clone https://github.com/stramanu/webpizza-ai-poc
cd webpizza-ai-poc
npm install
npm start
Closing Thoughts
This is a proof-of-concept, not production software. It has bugs, rough edges, and questionable architectural decisions.
But it proves that browser-based AI is getting real. WebGPU + WebAssembly + modern JS frameworks = surprisingly capable local inference.
What would you build with this stack?
Inspired by: DataPizza AI
Questions? Issues? Drop a comment or open an issue on GitHub!
Top comments (0)