Daniel Chifamba

Posted on Oct 10

Run Your Offline AI Chat: Pure Browser, No Backend

#ai #react #webassembly #javascript

Imagine running a ChatGPT-like AI right here in your browser - completely offline. No server needed, no API calls, just pure browser power. Sounds impossible? Not anymore! Modern browsers have evolved into incredibly capable platforms, and I'm excited to show you just what they can do. Together, we will build a React.js web app that you can chat with, even while you're offline

In this guide, we'll go through three options:

The Quick Start Approach. A ready-to-use version that works offline after downloading your chosen LLM
The Local Development Path. Running the source code locally on your own browser
Building from the Ground Up. Making a similar project from scratch using React

Background

"Boring...😴" Okay, If you're eager to jump straight into implementation, feel free to skip to "Running your own local models" 😜

The AI revolution has transformed how we work, and for good reason. Consider this: while it might take a person 20-30 minutes to read and summarize a book, an AI can accomplish the same task in seconds. Many of us have embraced tools like ChatGPT, Gemini, Claude and Grok for this very reason. But what if you want these capabilities without sharing your data with cloud services?

Thanks to the open-source AI community, we have many LLMs (Large Language Models) that are publicly available. Take the latest LLMs released by Meta for example, they are compact enough to run on-device and the good news is they are available for anyone to use. But here's the thing - once you've got these models in your hands, how do you actually use them locally?

Popular solutions like Ollama and LMStudio offer impressive local AI capabilities, making it possible for you to run a model on your device. Ollama offers a nice CLI chat, while LMStudio provides a polished GUI experience. Both support API integration for custom applications. But today, we're exploring something cool - running these models directly in your browser.

Running your own local models in your browser without a backend

What makes this approach special is that we're running LLM models directly in your browser - no backend required. The concept is straightforward: download a model file once, then run inference locally, right in your browser. The capabilities of modern browsers are truly remarkable.

This magic is possible thanks to WebAssembly, which allows us to leverage powerful C/C++ tools like llama.cpp.

Drum-roll please 🥁...

Introducing Wllama - an impressive project that provides WebAssembly bindings for llama.cpp, enabling seamless in-browser LLM inference.

Let's begin.

The Quick Start Approach 🚀

Visit the pre-built demo site: https://private-ai-chat-assistant.vercel.app/ and experience it yourself:

Choose your preferred model from the dropdown
Type your prompt
Hit enter

That's all it takes. The page downloads your selected model directly to your device (browser cache), and from there, everything runs locally. Want proof? Try disconnecting from the internet or check your browser's network tab - no external calls, just pure local processing power at work.

Ready to explore more models? Head over to HuggingFace's GGUF model collection. You'll find a vast array of models ready for use - just download any GGUF file (up to 2GB) and load it through the dropdown.

Here are a few to get you started:

SmolLm 135M by HuggingFace
Qwen 2.5 0.5B by Alibaba
Llama 3.2 1B by Meta

"That's nice, but I want to run it locally myself", I hear someone say. Sure...I got you covered 🙂

The Local Development Path 🏠

Let's set up your own local instance.

What You'll Need:

Basic HTML, Javascript, and CSS knowledge
Web browser with WebAssembly support. (I'm betting yours has 🤞😁)
Node.js installed. You can get a copy here and follow the installation steps.

Steps:

Clone the repository:

git clone https://github.com/nadchif/in-browser-llm-inference.git

Install dependencies:

cd in-browser-llm-inference &&
npm install

Note that it will attempt to download a model during installation to be used as the default option when you run the web app.

Launch the web app:

npm run dev

Navigate to http://localhost:5173/

Have fun!! 😊

Building from the Ground Up 💪

Now, let's demystify what is really going on by building our own implementation from scratch.

1. Create a React App with Vite

First, let's set up our development environment:

npm create vite@latest

When prompted:

Name your project (we'll use "browser-llm-chat")
Select React as your framework
Choose JavaScript as your variant

Then initialize your project:

cd browser-llm-chat
npm install

2. Design the Basic UI

Let's create a clean, functional interface. Replace the contents of App.jsx with:

import { useState } from "react";

function App() {
  const [prompt, setPrompt] = useState("");
  const [output, setOutput] = useState([]);
  const [isLoading, setIsLoading] = useState(false);
  const [progress, setProgress] = useState(0);

  const handlePromptInputChange = (e) => setPrompt(e.target.value);
  const shouldDisableSubmit = isLoading || prompt.trim().length === 0;


  const submitPrompt = () => {
    // We'll implement this next
  };

  return (
    <div>
      <pre>{output.map(({ role, content }) => `${role}: ${content}\n\n`)}</pre>
      {!output.length && (
        <div>
          {isLoading ? <span>Loading {Math.ceil(progress)}%</span> : <h1>Hi, How may I help you?</h1>}
        </div>
      )}
      <div>
        <input
          type="text"
          value={prompt}
          onChange={handlePromptInputChange}
          placeholder="Enter your prompt here"
        />
        <button type="button" onClick={submitPrompt} disabled={shouldDisableSubmit}>
          <div>→</div>
        </button>
      </div>
    </div>
  );
}

export default App;

Add some basic styling by replacing the contents of index.css with:

body {
  font-family: sans-serif;
  display: flex;
  justify-content: center;
}
pre {
  font-family: sans-serif;
  min-height: 30vh;
  white-space: pre-wrap;
  white-space: -moz-pre-wrap;
  white-space: -pre-wrap;
  white-space: -o-pre-wrap;
}
input {
  padding: 12px 20px;
  border: 1px solid #aaa;
  background-color: #f2f2f2;
}
input, pre {
  width: 60vw;
  min-width: 40vw;
  max-width: 640px;
}
button {
  padding: 12px 20px; 
  background-color: #000;
  color: white; 
}

3. Integrate Wllama

Now for the exciting part - let's add AI capabilities to our application:

npm install @wllama/wllama @huggingface/jinja

Update your App.jsx to integrate Wllama:

import { useState } from "react";
import { Wllama } from "@wllama/wllama/esm/wllama";
import wllamaSingleJS from "@wllama/wllama/src/single-thread/wllama.js?url";
import wllamaSingle from "@wllama/wllama/src/single-thread/wllama.wasm?url";
import { Template } from "@huggingface/jinja";

const wllama = new Wllama({
  "single-thread/wllama.js": wllamaSingleJS,
  "single-thread/wllama.wasm": wllamaSingle,
});

/* You can find more models at HuggingFace: https://huggingface.co/models?library=gguf
 * You can also download a model of your choice and place it in the /public folder, then update the modelUrl like this:
 * const modelUrl = "/<your-model-file-name>.gguf";
 */
const modelUrl = "https://huggingface.co/neopolita/smollm-135m-instruct-gguf/resolve/main/smollm-135m-instruct_q8_0.gguf";

/* See more about templating here:
* https://huggingface.co/docs/transformers/main/en/chat_templating
*/
const formatChat = async (messages) => {
  const template = new Template(wllama.getChatTemplate() ?? "");
  return template.render({
    messages,
    bos_token: await wllama.detokenize([wllama.getBOS()]),
    eos_token: await wllama.detokenize([wllama.getEOS()]),
    add_generation_prompt: true,
  });
};

function App() {
  // previous state declarations...

  const submitPrompt = async () => {
    setIsLoading(true);

    if (!wllama.isModelLoaded()) {
      await wllama.loadModelFromUrl(modelUrl, {
        n_threads: 1,
        useCache: true,
        allowOffline: true,
        progressCallback: (progress) => setProgress((progress.loaded / progress.total) * 100),
      });
    }
    const promptObject = { role: "user", content: prompt.trim() };
    setOutput([promptObject]);
    await wllama.createCompletion(await formatChat([promptObject]), {
      nPredict: 256,
      sampling: { temp: 0.4, penalty_repeat: 1.3 },
      onNewToken: (token, piece, text) => {
        setOutput([promptObject, { role: "assistant", content: text }]);
      },
    });
    setIsLoading(false);
  };
  // rest of existing code...
}
export default App;

Congratulations! Nicely done 👏 At this point, you have a functioning AI chat interface running entirely in your browser!

Let's review what we just did.

We imported and initialized an instance of Wllama
We added the URL that will be used to download the model initially.
We set up Chat Templating. Learn more about that 👉 Chat Templating Documentation and Extended Wllama Examples

Your web app will:

Download and cache the model on first use so it works completely offline after that.
Process prompts locally, in your browser, using your CPU
Stream responses as they're generated

If you're not already running the web app, start it using npm run dev and visit http://localhost:5173/. Test it out, and celebrate 🎉👏

Current Limitations 😅

Currently CPU-only (no WebGPU support yet)
2GB file size limit for models, though there's a workaround to split them (see the Wllama documentation)

A Huge Thanks To 👏

Cool Similar Projects to Check Out 😎

Ready for a challenge? (Pick any)

Update your copy of the code and improve the styling/CSS
Implement the ability for a user to pick a model file on their device. See: Example and Wllama documentation
The chat output is currently shown as plain text. To enhance the visual presentation, integrate a markdown library like react-markdown.

Share your outcome - let's celebrate your win! 🏆 🕺

DEV Community