DEV Community

Cover image for Run Your Offline AI Chat: Pure Browser, No Backend
Daniel Chifamba
Daniel Chifamba

Posted on

Run Your Offline AI Chat: Pure Browser, No Backend

Imagine running a ChatGPT-like AI right here in your browser - completely offline. No server needed, no API calls, just pure browser power. Sounds impossible? Not anymore! Modern browsers have evolved into incredibly capable platforms, and I'm excited to show you just what they can do. Together, we will build a React.js web app that you can chat with, even while you're offline

In this guide, we'll go through three options:

  1. The Quick Start Approach. A ready-to-use version that works offline after downloading your chosen LLM
  2. The Local Development Path. Running the source code locally on your own browser
  3. Building from the Ground Up. Making a similar project from scratch using React

Background

"Boring...๐Ÿ˜ด" Okay, If you're eager to jump straight into implementation, feel free to skip to "Running your own local models" ๐Ÿ˜œ

The AI revolution has transformed how we work, and for good reason. Consider this: while it might take a person 20-30 minutes to read and summarize a book, an AI can accomplish the same task in seconds. Many of us have embraced tools like ChatGPT, Gemini, Claude and Grok for this very reason. But what if you want these capabilities without sharing your data with cloud services?

Thanks to the open-source AI community, we have many LLMs (Large Language Models) that are publicly available. Take the latest LLMs released by Meta for example, they are compact enough to run on-device and the good news is they are available for anyone to use. But here's the thing - once you've got these models in your hands, how do you actually use them locally?

Popular solutions like Ollama and LMStudio offer impressive local AI capabilities, making it possible for you to run a model on your device. Ollama offers a nice CLI chat, while LMStudio provides a polished GUI experience. Both support API integration for custom applications. But today, we're exploring something cool - running these models directly in your browser.

Running your own local models in your browser without a backend

What makes this approach special is that we're running LLM models directly in your browser - no backend required. The concept is straightforward: download a model file once, then run inference locally, right in your browser. The capabilities of modern browsers are truly remarkable.

This magic is possible thanks to WebAssembly, which allows us to leverage powerful C/C++ tools like llama.cpp.

Drum-roll please ๐Ÿฅ...

Introducing Wllama - an impressive project that provides WebAssembly bindings for llama.cpp, enabling seamless in-browser LLM inference.

Let's begin.

The Quick Start Approach ๐Ÿš€

Visit the pre-built demo site: https://private-ai-chat-assistant.vercel.app/ and experience it yourself:

  1. Choose your preferred model from the dropdown
  2. Type your prompt
  3. Hit enter

That's all it takes. The page downloads your selected model directly to your device (browser cache), and from there, everything runs locally. Want proof? Try disconnecting from the internet or check your browser's network tab - no external calls, just pure local processing power at work.

Image description

Ready to explore more models? Head over to HuggingFace's GGUF model collection. You'll find a vast array of models ready for use - just download any GGUF file (up to 2GB) and load it through the dropdown.

Here are a few to get you started:

"That's nice, but I want to run it locally myself", I hear someone say. Sure...I got you covered ๐Ÿ™‚

The Local Development Path ๐Ÿ 

Let's set up your own local instance.

What You'll Need:

  • Basic HTML, Javascript, and CSS knowledge
  • Web browser with WebAssembly support. (I'm betting yours has ๐Ÿคž๐Ÿ˜)
  • Node.js installed. You can get a copy here and follow the installation steps.

Steps:

  1. Clone the repository:
git clone https://github.com/nadchif/in-browser-llm-inference.git
Enter fullscreen mode Exit fullscreen mode
  1. Install dependencies:
cd in-browser-llm-inference &&
npm install
Enter fullscreen mode Exit fullscreen mode

Note that it will attempt to download a model during installation to be used as the default option when you run the web app.

  1. Launch the web app:
npm run dev
Enter fullscreen mode Exit fullscreen mode
  1. Navigate to http://localhost:5173/

Have fun!! ๐Ÿ˜Š

Building from the Ground Up ๐Ÿ’ช

Now, let's demystify what is really going on by building our own implementation from scratch.

1. Create a React App with Vite

First, let's set up our development environment:

npm create vite@latest
Enter fullscreen mode Exit fullscreen mode

When prompted:

  • Name your project (we'll use "browser-llm-chat")
  • Select React as your framework
  • Choose JavaScript as your variant

Then initialize your project:

cd browser-llm-chat
npm install
Enter fullscreen mode Exit fullscreen mode

2. Design the Basic UI

Let's create a clean, functional interface. Replace the contents of App.jsx with:

import { useState } from "react";

function App() {
  const [prompt, setPrompt] = useState("");
  const [output, setOutput] = useState([]);
  const [isLoading, setIsLoading] = useState(false);
  const [progress, setProgress] = useState(0);

  const handlePromptInputChange = (e) => setPrompt(e.target.value);
  const shouldDisableSubmit = isLoading || prompt.trim().length === 0;


  const submitPrompt = () => {
    // We'll implement this next
  };

  return (
    <div>
      <pre>{output.map(({ role, content }) => `${role}: ${content}\n\n`)}</pre>
      {!output.length && (
        <div>
          {isLoading ? <span>Loading {Math.ceil(progress)}%</span> : <h1>Hi, How may I help you?</h1>}
        </div>
      )}
      <div>
        <input
          type="text"
          value={prompt}
          onChange={handlePromptInputChange}
          placeholder="Enter your prompt here"
        />
        <button type="button" onClick={submitPrompt} disabled={shouldDisableSubmit}>
          <div>โ†’</div>
        </button>
      </div>
    </div>
  );
}

export default App;
Enter fullscreen mode Exit fullscreen mode

Add some basic styling by replacing the contents of index.css with:

body {
  font-family: sans-serif;
  display: flex;
  justify-content: center;
}
pre {
  font-family: sans-serif;
  min-height: 30vh;
  white-space: pre-wrap;
  white-space: -moz-pre-wrap;
  white-space: -pre-wrap;
  white-space: -o-pre-wrap;
}
input {
  padding: 12px 20px;
  border: 1px solid #aaa;
  background-color: #f2f2f2;
}
input, pre {
  width: 60vw;
  min-width: 40vw;
  max-width: 640px;
}
button {
  padding: 12px 20px; 
  background-color: #000;
  color: white; 
}
Enter fullscreen mode Exit fullscreen mode

Output preview

3. Integrate Wllama

Now for the exciting part - let's add AI capabilities to our application:

npm install @wllama/wllama @huggingface/jinja
Enter fullscreen mode Exit fullscreen mode

Update your App.jsx to integrate Wllama:

import { useState } from "react";
import { Wllama } from "@wllama/wllama/esm/wllama";
import wllamaSingleJS from "@wllama/wllama/src/single-thread/wllama.js?url";
import wllamaSingle from "@wllama/wllama/src/single-thread/wllama.wasm?url";
import { Template } from "@huggingface/jinja";

const wllama = new Wllama({
  "single-thread/wllama.js": wllamaSingleJS,
  "single-thread/wllama.wasm": wllamaSingle,
});

/* You can find more models at HuggingFace: https://huggingface.co/models?library=gguf
 * You can also download a model of your choice and place it in the /public folder, then update the modelUrl like this:
 * const modelUrl = "/<your-model-file-name>.gguf";
 */
const modelUrl = "https://huggingface.co/neopolita/smollm-135m-instruct-gguf/resolve/main/smollm-135m-instruct_q8_0.gguf";

/* See more about templating here:
* https://huggingface.co/docs/transformers/main/en/chat_templating
*/
const formatChat = async (messages) => {
  const template = new Template(wllama.getChatTemplate() ?? "");
  return template.render({
    messages,
    bos_token: await wllama.detokenize([wllama.getBOS()]),
    eos_token: await wllama.detokenize([wllama.getEOS()]),
    add_generation_prompt: true,
  });
};

function App() {
  // previous state declarations...

  const submitPrompt = async () => {
    setIsLoading(true);

    if (!wllama.isModelLoaded()) {
      await wllama.loadModelFromUrl(modelUrl, {
        n_threads: 1,
        useCache: true,
        allowOffline: true,
        progressCallback: (progress) => setProgress((progress.loaded / progress.total) * 100),
      });
    }
    const promptObject = { role: "user", content: prompt.trim() };
    setOutput([promptObject]);
    await wllama.createCompletion(await formatChat([promptObject]), {
      nPredict: 256,
      sampling: { temp: 0.4, penalty_repeat: 1.3 },
      onNewToken: (token, piece, text) => {
        setOutput([promptObject, { role: "assistant", content: text }]);
      },
    });
    setIsLoading(false);
  };
  // rest of existing code...
}
export default App;
Enter fullscreen mode Exit fullscreen mode

Congratulations! Nicely done ๐Ÿ‘ At this point, you have a functioning AI chat interface running entirely in your browser!

Image description

Let's review what we just did.

Your web app will:

  1. Download and cache the model on first use so it works completely offline after that.
  2. Process prompts locally, in your browser, using your CPU
  3. Stream responses as they're generated

If you're not already running the web app, start it using npm run dev and visit http://localhost:5173/. Test it out, and celebrate ๐ŸŽ‰๐Ÿ‘

Current Limitations ๐Ÿ˜…

  • Currently CPU-only (no WebGPU support yet)
  • 2GB file size limit for models, though there's a workaround to split them (see the Wllama documentation)

A Huge Thanks To ๐Ÿ‘

Cool Similar Projects to Check Out ๐Ÿ˜Ž

--

Ready for a challenge? (Pick any)

  • Update your copy of the code and improve the styling/CSS
  • Implement the ability for a user to pick a model file on their device. See: Example and Wllama documentation
  • The chat output is currently shown as plain text. To enhance the visual presentation, integrate a markdown library like react-markdown.

Share your outcome - let's celebrate your win! ๐Ÿ† ๐Ÿ•บ

Top comments (0)