DEV Community

Cover image for Building a Browser-Based RAG System with WebGPU
Emanuele Strazzullo
Emanuele Strazzullo

Posted on

Building a Browser-Based RAG System with WebGPU

I built a proof-of-concept that lets you chat with PDF documents using AI models running entirely in your browser via WebGPU. No backend, no API keys, complete privacy.

🔗 Demo: https://webpizza-ai-poc.vercel.app/
📦 Code: https://github.com/stramanu/webpizza-ai-poc

Why?

I've been following the progress of WebGPU and WebLLM, and I was curious: Can we run a full RAG pipeline in the browser?

RAG (Retrieval-Augmented Generation) typically requires:

  1. A vector database
  2. An embedding model
  3. A language model
  4. Orchestration logic

Turns out, modern browsers can handle all of this!

Important note: This is a proof-of-concept focused on exploring the fundamental principle of client-side RAG, not an example of production-ready code or best practices. The goal was experimentation with WebGPU and LLMs in the browser, so expect rough edges and architectural shortcuts.

The Stack

  • Frontend: Angular 20 (standalone components, zoneless)
  • LLM: WebLLM v0.2.79 + WeInfer (optimized fork)
  • Embeddings: Transformers.js (all-MiniLM-L6-v2)
  • Vector Store: IndexedDB with cosine similarity
  • PDF Parser: PDF.js
  • Deployment: Vercel

How It Works

1. Model Loading

WebLLM downloads pre-compiled MLC models (Phi-3, Llama, Mistral). First load is slow (1-4GB), but they're cached in the browser.

await this.llm.initialize();
Enter fullscreen mode Exit fullscreen mode

2. Document Ingestion

Upload a PDF → Parse with PDF.js → Chunk into ~500 char pieces → Embed each chunk → Store in IndexedDB.

const chunks = await this.parser.parseFile(file);
for (const text of chunks) {
  const embedding = await this.embedder.embed(text);
  await this.vectorStore.addChunk({ text, embedding });
}
Enter fullscreen mode Exit fullscreen mode

3. Query Processing

User asks question → Embed query → Similarity search in IndexedDB → Get top-k chunks → Feed to LLM with context.

const queryEmbedding = await this.embedder.embed(question);
const relevantChunks = await this.vectorStore.search(queryEmbedding, k=3);
const context = relevantChunks.map(c => c.text).join('\\n');
const response = await this.llm.generate(context, question);
Enter fullscreen mode Exit fullscreen mode

Challenges

1. Cross-Origin Isolation

WebGPU requires SharedArrayBuffer, which needs these headers:

Cross-Origin-Embedder-Policy: require-corp
Cross-Origin-Opener-Policy: same-origin
Enter fullscreen mode Exit fullscreen mode

Vercel makes this easy with vercel.json, but it breaks if you have external resources without CORS.

2. Memory Management

Browsers aren't designed for 4GB models. I had to:

  • Clear vector store before loading new documents
  • Implement proper cleanup for embeddings
  • Handle model caching effectively

3. WebGPU Compatibility

Not all browsers support WebGPU yet. Fallback to WebAssembly works, but it's significantly slower. Added detection logic to guide users to compatible browsers.

What I'd Do Differently

  1. Implement proper vector indexing - Currently brute force cosine similarity
  2. Add model quantization options - Let users choose speed vs quality
  3. Better chunking strategies - Currently just splitting at 500 chars
  4. Streaming for large documents - Don't embed everything at once
  5. Support for multiple document formats - Not just PDFs

Privacy Win

One unexpected benefit: Complete privacy by default.

Your documents never leave your device. No API calls, no server uploads, no tracking. Everything happens in your browser.

This makes it useful for:

  • Sensitive documents (legal, medical, personal)
  • Offline environments
  • Privacy-conscious users
  • Demos without infrastructure costs

Try It Yourself

🔗 Live Demo: https://webpizza-ai-poc.vercel.app/

Requirements:

  • Chrome/Edge 113+ (WebGPU support)
  • 4GB+ RAM
  • Modern GPU (or patience for CPU fallback)

Quick start:

git clone https://github.com/stramanu/webpizza-ai-poc
cd webpizza-ai-poc
npm install
npm start
Enter fullscreen mode Exit fullscreen mode

Closing Thoughts

This is a proof-of-concept, not production software. It has bugs, rough edges, and questionable architectural decisions.

But it proves that browser-based AI is getting real. WebGPU + WebAssembly + modern JS frameworks = surprisingly capable local inference.

What would you build with this stack?


Inspired by: DataPizza AI

Questions? Issues? Drop a comment or open an issue on GitHub!

Top comments (0)