Emanuele Strazzullo

Posted on Nov 7

Building a Browser-Based RAG System with WebGPU

#ai #llm #rag #webdev

I built a proof-of-concept that lets you chat with PDF documents using AI models running entirely in your browser via WebGPU. No backend, no API keys, complete privacy.

🔗 Demo: https://webpizza-ai-poc.vercel.app/
📦 Code: https://github.com/stramanu/webpizza-ai-poc

Why?

I've been following the progress of WebGPU and WebLLM, and I was curious: Can we run a full RAG pipeline in the browser?

RAG (Retrieval-Augmented Generation) typically requires:

A vector database
An embedding model
A language model
Orchestration logic

Turns out, modern browsers can handle all of this!

Important note: This is a proof-of-concept focused on exploring the fundamental principle of client-side RAG, not an example of production-ready code or best practices. The goal was experimentation with WebGPU and LLMs in the browser, so expect rough edges and architectural shortcuts.

The Stack

Frontend: Angular 20 (standalone components, zoneless)
LLM: WebLLM v0.2.79 + WeInfer (optimized fork)
Embeddings: Transformers.js (all-MiniLM-L6-v2)
Vector Store: IndexedDB with cosine similarity
PDF Parser: PDF.js
Deployment: Vercel

How It Works

1. Model Loading

WebLLM downloads pre-compiled MLC models (Phi-3, Llama, Mistral). First load is slow (1-4GB), but they're cached in the browser.

await this.llm.initialize();

2. Document Ingestion

Upload a PDF → Parse with PDF.js → Chunk into ~500 char pieces → Embed each chunk → Store in IndexedDB.

const chunks = await this.parser.parseFile(file);
for (const text of chunks) {
  const embedding = await this.embedder.embed(text);
  await this.vectorStore.addChunk({ text, embedding });
}

3. Query Processing

User asks question → Embed query → Similarity search in IndexedDB → Get top-k chunks → Feed to LLM with context.

const queryEmbedding = await this.embedder.embed(question);
const relevantChunks = await this.vectorStore.search(queryEmbedding, k=3);
const context = relevantChunks.map(c => c.text).join('\\n');
const response = await this.llm.generate(context, question);

Challenges

1. Cross-Origin Isolation

WebGPU requires SharedArrayBuffer, which needs these headers:

Cross-Origin-Embedder-Policy: require-corp
Cross-Origin-Opener-Policy: same-origin

Vercel makes this easy with vercel.json, but it breaks if you have external resources without CORS.

2. Memory Management

Browsers aren't designed for 4GB models. I had to:

Clear vector store before loading new documents
Implement proper cleanup for embeddings
Handle model caching effectively

3. WebGPU Compatibility

Not all browsers support WebGPU yet. Fallback to WebAssembly works, but it's significantly slower. Added detection logic to guide users to compatible browsers.

What I'd Do Differently

Implement proper vector indexing - Currently brute force cosine similarity
Add model quantization options - Let users choose speed vs quality
Better chunking strategies - Currently just splitting at 500 chars
Streaming for large documents - Don't embed everything at once
Support for multiple document formats - Not just PDFs

Privacy Win

One unexpected benefit: Complete privacy by default.

Your documents never leave your device. No API calls, no server uploads, no tracking. Everything happens in your browser.

This makes it useful for:

Sensitive documents (legal, medical, personal)
Offline environments
Privacy-conscious users
Demos without infrastructure costs

Try It Yourself

🔗 Live Demo: https://webpizza-ai-poc.vercel.app/

Requirements:

Chrome/Edge 113+ (WebGPU support)
4GB+ RAM
Modern GPU (or patience for CPU fallback)

Quick start:

git clone https://github.com/stramanu/webpizza-ai-poc
cd webpizza-ai-poc
npm install
npm start

Closing Thoughts

This is a proof-of-concept, not production software. It has bugs, rough edges, and questionable architectural decisions.

But it proves that browser-based AI is getting real. WebGPU + WebAssembly + modern JS frameworks = surprisingly capable local inference.

What would you build with this stack?

Inspired by: DataPizza AI

Questions? Issues? Drop a comment or open an issue on GitHub!

DEV Community