Most AI chat widgets are just frontends for a remote API.
This one is different.My assistant runs its core retrieval and generation pipeline inside the browser using WebLLM, Web Workers, WASM, ONNX Runtime Web, and a local RAG architecture built in Next.js.
You can try it here:
๐ https://databro.dev/?chat=open
What makes this fun is not just that it works locally.
It is that the browser is doing real AI work:
- loading model artifacts
- reusing browser cache
- embedding queries
- retrieving relevant chunks
- reranking candidates
- generating grounded answers
- returning data back to the UI
That turns the browser from a thin client into an actual inference runtime.
๐ฏ Why I built it this way
Most website assistants follow the same pattern:
- User enters a prompt.
- Frontend sends it to a backend.
- Backend calls an LLM API.
- Response comes back.
That works. But it also means:
- extra round trips
- recurring inference cost
- more infrastructure
- more privacy tradeoffs
- less control over local behavior
I wanted a different architecture: a browser-local AI assistant that could answer from a curated knowledge base without depending on server-side inference for the main path.
That is where WebLLM, Web Workers, WASM, ONNX Runtime Web, and RAG start fitting together really well.
๐ง The core idea
This is not โan LLM in a webpage.โ
It is a layered browser-native AI system where each part has a very specific role:
- Next.js widget โ chat UI and state
- Web Worker โ orchestration and background execution
- WebLLM โ local generation runtime
- WASM โ efficient low-level browser execution
- ONNX Runtime Web โ browser inference for embedding and reranking tasks
- RAG pipeline โ grounding answers against the knowledge base
- Caching โ making repeat sessions practical
Once I started thinking about the architecture this way, the implementation became much cleaner.
๐ฅ What WebLLM actually is
A lot of people hear โWebLLMโ and assume it is the model.
It is not.
WebLLM is the browser-side runtime used to load and execute supported language models locally.
That means the model and the runtime are two different things:
- WebLLM = execution engine
- Llama / Phi / Gemma / Mistral = model loaded into that engine
This distinction matters because it changes how you think about browser inference.
You are not calling a model API.
You are creating a local runtime, loading a model into it, and then sending prompt messages into that runtime.
That framing made a huge difference for me.
๐ฆ Does WebLLM need to be downloaded?
Yes โ and this is one of the most important practical details in browser-local AI.
On first use, the browser usually needs to download:
- the selected model artifacts
- runtime support assets
- related files required to initialize inference
That means browser-local AI comes with a real first-run cost.
But after that, things get better fast.
Once those assets are cached, later sessions are much faster. This is one of the biggest UX wins in local inference: the browser starts behaving more like an application runtime than a stateless page.
โ๏ธ How the WebLLM runtime gets created
At a high level, the runtime lifecycle looks like this:
- Create the WebLLM engine.
- Select a supported model.
- Download artifacts if they are not already cached.
- Load the model into the engine.
- Send structured prompt messages for generation.
- Return the generated output.
So the runtime is not just a helper utility.
It is the execution container for the model.
That is why I think of WebLLM as a browser-native inference runtime rather than a simple wrapper library.
๐งฑ Why WASM matters
WASM (WebAssembly) is one of the hidden pillars of browser-local AI.
A lot of browser AI articles mention it in passing, but it deserves more attention than that.
WASM gives the browser a compact, efficient way to execute compute-heavy logic closer to native speed than ordinary JavaScript. That matters because local inference is not light work.
Tasks like these become much more realistic with performant browser execution paths:
- model runtime support
- tensor-heavy execution
- embedding pipelines
- reranking workloads
- token generation infrastructure
Without efficient lower-level execution, the entire local inference stack becomes much harder to make practical.
๐ง WebLLM vs WASM vs ONNX Runtime Web
These are related, but they are not the same thing.
A simple way to separate them:
| Layer | Responsibility |
|---|---|
| WebLLM | Local runtime for browser-based LLM generation |
| WASM | Efficient low-level execution layer in the browser |
| ONNX Runtime Web | Browser inference runtime for ONNX-backed model workloads |
| Web Worker | Background execution boundary that protects UI responsiveness |
So WASM is not competing with WebLLM.
It is one of the technologies helping make browser-native inference feasible.
๐งฎ What ONNX Runtime Web is doing here
One of the easiest mistakes in local AI architecture is treating every model task as if it belongs to the same runtime.
It does not.
Generation is one kind of workload.
Embedding and reranking are different workloads.
That is why I like this split:
- WebLLM for generation
- ONNX Runtime Web for retrieval-side transformer execution
This is a strong design because retrieval-side tasks often need a different execution path than token-by-token LLM generation.
In practice, browser-local RAG is rarely โone model doing everything.โ
It is a pipeline of specialized responsibilities.
๐งต Why Web Workers are non-negotiable
If you run browser-local AI on the main thread, the UI will eventually remind you that this was a bad idea.
Maybe not immediately.
Maybe not on your development machine.
But once model loading, chunk scoring, reranking, and generation pile up, the experience starts to degrade fast.
That is why Web Workers are essential.
A worker gives you a separate execution context for heavy tasks so the main thread can stay focused on:
- rendering
- input handling
- scrolling and interaction
- animation
- state updates
For AI-heavy browser apps, that separation is not a nice-to-have.
It is architecture.
โ๏ธ Creating a Web Worker in Next.js
Workers are browser APIs, so they should only be created on the client side.
That means your widget component should be client-rendered and the worker should be created lazily when the chat experience actually begins.
This pattern works especially well in Next.js because it lets you keep rendering concerns in the UI layer while moving heavy orchestration into a background execution boundary.
I also prefer lazy worker creation because it avoids paying the initialization cost for users who never open the assistant.
๐จ Widget-to-worker messaging
Once the worker exists, the widget should communicate with it using structured messages rather than trying to share runtime state directly.
That message boundary matters a lot.
The UI sends things like:
- prompt text
- serialized KB context
- request identifier
The worker sends back:
- final answer
- citation metadata
- failure state if something breaks
That separation keeps the frontend simpler and makes the worker a clean orchestration boundary instead of an implementation detail leaking into the UI.
๐ง Worker orchestration is where the real engineering happens
The worker is not just a background script.
It is the orchestration layer of the entire local AI lifecycle.
This is where things become much more than โI loaded a model in the browser.โ
The worker is responsible for coordinating:
- model/runtime initialization
- cache-aware model reuse
- KB artifact loading
- embedding availability
- retrieval and score fusion
- reranking
- confidence checks
- prompt assembly
- answer generation
- packaging result metadata back to the widget
That orchestration layer is what transforms separate browser AI technologies into an actual product.
This, honestly, is where most of the engineering value lives.
๐๏ธ What the worker is really managing
The worker owns the lifecycle of the expensive parts of the system.
That typically includes:
- the generation engine
- embedding model access
- reranker access
- parsed or cached KB context
- warm in-memory session state
This is important because the main thread should not be responsible for managing heavy AI runtime objects.
The UI should care about:
- input
- loading states
- response rendering
- citations
- interaction flow
The worker should care about:
- initialization
- orchestration
- reuse
- inference sequencing
That separation is one of the biggest reasons the app feels stable instead of fragile.
๐๏ธ Overall architecture
๐ Prompt lifecycle
Letโs walk the whole journey.
1) The user opens the widget
The Next.js app renders the chat interface and waits for interaction.
2) The worker is created lazily
Only when the user opens or uses the assistant does the app create the worker.
3) The worker warms up the AI stack
It checks whether engines, pipelines, and context state already exist.
4) Browser cache is consulted
If model assets are already cached, startup is faster.
If not, the first-run downloads happen here.
5) KB vectors are loaded
The worker loads precomputed vector artifacts or rebuilds what it needs.
6) The user enters a prompt
The widget sends the prompt and context payload to the worker.
7) The embedding model encodes the query
The prompt is turned into a dense vector representation.
8) Retrieval begins
Dense retrieval and sparse retrieval identify candidate KB chunks.
9) Hybrid scoring narrows the pool
The system fuses semantic and lexical signals.
10) Reranking refines the candidates
The best chunks are rescored for prompt-specific usefulness.
11) Confidence gating runs
If the candidates are weak, the system can fall back safely.
12) Grounded context is assembled
The final chunk set is turned into the context window for generation.
13) WebLLM executes generation
The worker sends system rules, prompt, and grounded context into the local runtime.
14) The worker packages the result
The answer text and citation metadata are returned to the widget.
15) The UI renders the final response
The user receives a grounded answer without needing the main inference path to leave the browser.
That full lifecycle is what turns a browser-local model into a practical assistant.
๐ ๏ธ What I learned building this
Here is the short version of what mattered most:
- Treat WebLLM as a runtime, not as โthe model.โ
- Expect first-run downloads and design for cache reuse.
- Keep heavy work off the main thread.
- Use workers as orchestration boundaries, not just compute bins.
- Separate generation from retrieval-side inference.
- Precompute KB vectors whenever possible.
- Use reranking if grounded quality matters.
- Add confidence gates before you need them.
- Design for warm sessions, not just cold starts.
These are the choices that made the assistant feel like a product instead of a demo.
๐ Final thought
The most exciting part of this project is not that one library made it possible.
It is that several browser-native technologies now fit together well enough to build a real local AI product.
That stack, for me, looks like this:
- WebLLM for generation
- WASM for efficient browser execution
- ONNX Runtime Web for embedding and reranking paths
- Web Workers for orchestration
- RAG for grounding
Put together, they turn the browser into something much more powerful than a UI shell.
They turn it into the runtime.


Top comments (1)
One surprising thing we've found is that many developers overlook the importance of integrating AI models into existing workflows rather than just building standalone applications. In practice, getting AI agents to seamlessly interact with other systems via Web Workers or WASM can significantly boost efficiency. This approach often transitions the AI from being a novelty to becoming a critical part of the workflow, ultimately driving more meaningful engagement with AI tools. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)