Kumaravelu Saraboji Mahalingam

Posted on Apr 14

🚀 I Built a Browser-Local AI Assistant in Next.js with WebLLM, WASM, ONNX Runtime, Web Workers, and RAG

#ai #webllm #webassembly #onnx

Most AI chat widgets are just frontends for a remote API.

This one is different.

My assistant runs its core retrieval and generation pipeline inside the browser using WebLLM, Web Workers, WASM, ONNX Runtime Web, and a local RAG architecture built in Next.js.

You can try it here:

👉 https://databro.dev/?chat=open

What makes this fun is not just that it works locally.

It is that the browser is doing real AI work:

loading model artifacts
reusing browser cache
embedding queries
retrieving relevant chunks
reranking candidates
generating grounded answers
returning data back to the UI

That turns the browser from a thin client into an actual inference runtime.

🎯 Why I built it this way

Most website assistants follow the same pattern:

User enters a prompt.
Frontend sends it to a backend.
Backend calls an LLM API.
Response comes back.

That works. But it also means:

extra round trips
recurring inference cost
more infrastructure
more privacy tradeoffs
less control over local behavior

I wanted a different architecture: a browser-local AI assistant that could answer from a curated knowledge base without depending on server-side inference for the main path.

That is where WebLLM, Web Workers, WASM, ONNX Runtime Web, and RAG start fitting together really well.

🧠 The core idea

This is not “an LLM in a webpage.”

It is a layered browser-native AI system where each part has a very specific role:

Next.js widget → chat UI and state

Web Worker → orchestration and background execution

WebLLM → local generation runtime

WASM → efficient low-level browser execution

ONNX Runtime Web → browser inference for embedding and reranking tasks

RAG pipeline → grounding answers against the knowledge base

Caching → making repeat sessions practical

Once I started thinking about the architecture this way, the implementation became much cleaner.

🔥 What WebLLM actually is

A lot of people hear “WebLLM” and assume it is the model.

It is not.

WebLLM is the browser-side runtime used to load and execute supported language models locally.

That means the model and the runtime are two different things:

WebLLM = execution engine
Llama / Phi / Gemma / Mistral = model loaded into that engine

This distinction matters because it changes how you think about browser inference.

You are not calling a model API.

You are creating a local runtime, loading a model into it, and then sending prompt messages into that runtime.

That framing made a huge difference for me.

📦 Does WebLLM need to be downloaded?

Yes — and this is one of the most important practical details in browser-local AI.

On first use, the browser usually needs to download:

the selected model artifacts
runtime support assets
related files required to initialize inference

That means browser-local AI comes with a real first-run cost.

But after that, things get better fast.

Once those assets are cached, later sessions are much faster. This is one of the biggest UX wins in local inference: the browser starts behaving more like an application runtime than a stateless page.

⚙️ How the WebLLM runtime gets created

At a high level, the runtime lifecycle looks like this:

Create the WebLLM engine.
Select a supported model.
Download artifacts if they are not already cached.
Load the model into the engine.
Send structured prompt messages for generation.
Return the generated output.

So the runtime is not just a helper utility.

It is the execution container for the model.

That is why I think of WebLLM as a browser-native inference runtime rather than a simple wrapper library.

🧱 Why WASM matters

WASM (WebAssembly) is one of the hidden pillars of browser-local AI.

A lot of browser AI articles mention it in passing, but it deserves more attention than that.

WASM gives the browser a compact, efficient way to execute compute-heavy logic closer to native speed than ordinary JavaScript. That matters because local inference is not light work.

Tasks like these become much more realistic with performant browser execution paths:

model runtime support
tensor-heavy execution
embedding pipelines
reranking workloads
token generation infrastructure

Without efficient lower-level execution, the entire local inference stack becomes much harder to make practical.

🧠 WebLLM vs WASM vs ONNX Runtime Web

These are related, but they are not the same thing.

A simple way to separate them:

Layer	Responsibility
WebLLM	Local runtime for browser-based LLM generation
WASM	Efficient low-level execution layer in the browser
ONNX Runtime Web	Browser inference runtime for ONNX-backed model workloads
Web Worker	Background execution boundary that protects UI responsiveness

So WASM is not competing with WebLLM.

It is one of the technologies helping make browser-native inference feasible.

🧮 What ONNX Runtime Web is doing here

One of the easiest mistakes in local AI architecture is treating every model task as if it belongs to the same runtime.

It does not.

Generation is one kind of workload.

Embedding and reranking are different workloads.

That is why I like this split:

WebLLM for generation
ONNX Runtime Web for retrieval-side transformer execution

This is a strong design because retrieval-side tasks often need a different execution path than token-by-token LLM generation.

In practice, browser-local RAG is rarely “one model doing everything.”

It is a pipeline of specialized responsibilities.

🧵 Why Web Workers are non-negotiable

If you run browser-local AI on the main thread, the UI will eventually remind you that this was a bad idea.

Maybe not immediately.

Maybe not on your development machine.

But once model loading, chunk scoring, reranking, and generation pile up, the experience starts to degrade fast.

That is why Web Workers are essential.

A worker gives you a separate execution context for heavy tasks so the main thread can stay focused on:

rendering
input handling
scrolling and interaction
animation
state updates

For AI-heavy browser apps, that separation is not a nice-to-have.

It is architecture.

⚛️ Creating a Web Worker in Next.js

Workers are browser APIs, so they should only be created on the client side.

That means your widget component should be client-rendered and the worker should be created lazily when the chat experience actually begins.

This pattern works especially well in Next.js because it lets you keep rendering concerns in the UI layer while moving heavy orchestration into a background execution boundary.

I also prefer lazy worker creation because it avoids paying the initialization cost for users who never open the assistant.

📨 Widget-to-worker messaging

Once the worker exists, the widget should communicate with it using structured messages rather than trying to share runtime state directly.

That message boundary matters a lot.

The UI sends things like:

prompt text
serialized KB context
request identifier

The worker sends back:

final answer
citation metadata
failure state if something breaks

That separation keeps the frontend simpler and makes the worker a clean orchestration boundary instead of an implementation detail leaking into the UI.

🧠 Worker orchestration is where the real engineering happens

The worker is not just a background script.

It is the orchestration layer of the entire local AI lifecycle.

This is where things become much more than “I loaded a model in the browser.”

The worker is responsible for coordinating:

model/runtime initialization
cache-aware model reuse
KB artifact loading
embedding availability
retrieval and score fusion
reranking
confidence checks
prompt assembly
answer generation
packaging result metadata back to the widget

That orchestration layer is what transforms separate browser AI technologies into an actual product.

This, honestly, is where most of the engineering value lives.

🏗️ What the worker is really managing

The worker owns the lifecycle of the expensive parts of the system.

That typically includes:

the generation engine
embedding model access
reranker access
parsed or cached KB context
warm in-memory session state

This is important because the main thread should not be responsible for managing heavy AI runtime objects.

The UI should care about:

input
loading states
response rendering
citations
interaction flow

The worker should care about:

initialization
orchestration
reuse
inference sequencing

That separation is one of the biggest reasons the app feels stable instead of fragile.

🗂️ Overall architecture

🔄 Prompt lifecycle

Let’s walk the whole journey.

1) The user opens the widget

The Next.js app renders the chat interface and waits for interaction.

2) The worker is created lazily

Only when the user opens or uses the assistant does the app create the worker.

3) The worker warms up the AI stack

It checks whether engines, pipelines, and context state already exist.

4) Browser cache is consulted

If model assets are already cached, startup is faster.

If not, the first-run downloads happen here.

5) KB vectors are loaded

The worker loads precomputed vector artifacts or rebuilds what it needs.

6) The user enters a prompt

The widget sends the prompt and context payload to the worker.

7) The embedding model encodes the query

The prompt is turned into a dense vector representation.

8) Retrieval begins

Dense retrieval and sparse retrieval identify candidate KB chunks.

9) Hybrid scoring narrows the pool

The system fuses semantic and lexical signals.

10) Reranking refines the candidates

The best chunks are rescored for prompt-specific usefulness.

11) Confidence gating runs

If the candidates are weak, the system can fall back safely.

12) Grounded context is assembled

The final chunk set is turned into the context window for generation.

13) WebLLM executes generation

The worker sends system rules, prompt, and grounded context into the local runtime.

14) The worker packages the result

The answer text and citation metadata are returned to the widget.

15) The UI renders the final response

The user receives a grounded answer without needing the main inference path to leave the browser.

That full lifecycle is what turns a browser-local model into a practical assistant.

🛠️ What I learned building this

Here is the short version of what mattered most:

Treat WebLLM as a runtime, not as “the model.”
Expect first-run downloads and design for cache reuse.
Keep heavy work off the main thread.
Use workers as orchestration boundaries, not just compute bins.
Separate generation from retrieval-side inference.
Precompute KB vectors whenever possible.
Use reranking if grounded quality matters.
Add confidence gates before you need them.
Design for warm sessions, not just cold starts.

These are the choices that made the assistant feel like a product instead of a demo.

🎉 Final thought

The most exciting part of this project is not that one library made it possible.

It is that several browser-native technologies now fit together well enough to build a real local AI product.

That stack, for me, looks like this: