DEV Community

Cover image for Running Language Models Directly in the Browser
Jaime
Jaime

Posted on

Running Language Models Directly in the Browser

TLDR

LLMs are everywhere, but every time they are used, we are giving our data away for free. There are projects like SmolLM2 working on Small Language Models that can run completely in the browser, leveraging the GPU when possible and falling back to CPU (WASM) when needed. While the work behind SmolLM2 is great, the output is not as effective as the one generated by well-known LLMs.

Intro

LLMs are huge and they need a lot of resources: storage, CPU, GPU, and RAM. Private providers don't publish information, but here's a rough comparison:

Llama 3.1 405B Claude Sonnet 4.6 GPT-4
Est. params 405B ~175B * ~1.8T *
Disk ~243 GB ~175 GB * ~900 GB *
GPU VRAM ~203 GB ~350 GB * ~3.6 TB *
RAM ~1.5 TB ~700 GB * ~5 TB *

* These values are educated guesses, not real values.

Browsers have limited access to resources; the following table shows some rough estimates:

Chrome Firefox Safari
RAM available 1–4 GB 0.5–2 GB 0.3–1 GB
GPU (VRAM) WebGPU ✓ WebGPU limited WebGPU (iOS 17+)

Clearly, we can see it becomes almost impossible to run known LLMs in the browser. That's why these LLMs are usually accessed through an SDK, RESTful API, or their GUI, with all the computing happening in the cloud.

LLMs running on a server come with an immense infrastructure cost. There are also privacy concerns, and as a user our only option is to trust that the providers will use our data responsibly.

That's why I wanted to play with alternatives that can run completely in the browser.

SmolLM2

Based on this paper, the model was trained on different categories — to mention some: Knowledge/Reasoning, Math, Code, and Generative Tasks.

That said, I built a simple web app that uses SmolLM2 to generate text.

SmolLM2 demo web app

SmolLM2-135M-Instruct

This model size is ~140MB, and it needs to be downloaded and then compiled; the compilation takes more computing resources, working fine on desktop but not on mobile.

Additionally, Safari doesn't support webgpu by default — it's still experimental. It can be enabled (Safari → Settings → Feature Flags → WebGPU) but it's disabled for "security" reasons, meaning Apple still doesn't trust GPU instructions coming from the browser. Hence the need for wasm, which is an older technology, enabled by default, and works exclusively with the CPU.

import { pipeline } from "@huggingface/transformers";

const device = "gpu" in navigator ? "webgpu" : "wasm";

const model = await pipeline(
  "text-generation",
  "HuggingFaceTB/SmolLM2-135M-Instruct",
  {
    device,
    dtype: "q4f16",
  }
);

const messages = [
  { role: "system", content: "you are a English teacher." },
  { role: "user", content: "list 10 important topics" },
];

const output = await model(messages, { max_new_tokens: 128 });
console.log(output);

// Generated topics:
//  1. History
// 2. Science
// 3. Literature
// 4. Art
// 5. Business
// 6. Sports
// 7. Politics
// 8. Social Issues
// 9. Technology
// 10. Essay
Enter fullscreen mode Exit fullscreen mode

SmolLM2-135M-Instruct storage size

SmolLM2-360M-Instruct-q4f16_1-MLC

This model works on mobile. The main difference is that this model is already compiled, hence the MLC (Machine Learning Compilation) affix. Even though the model size is bigger (~200MB), it works because it requires less computing work (no compilation phase). Additionally, the mobile browser caps the buffer size to 128MB (a shorter max context window).

const model = await CreateMLCEngine("SmolLM2-360M-Instruct-q4f16_1-MLC");

const messages = [
  {
    role: "system",
    content:
      "You are a helpful assistant. You respond ONLY with a numbered list. No introduction, no explanation, no extra text. Just the list.",
  },
  {
    role: "user",
    content: "list 10 important topics",
  },
];

const output = await modelSmall.chat.completions.create({ messages });
console.log(output);
// 1. Acid-base mechanism
// 2. Glucarotides
// 3. Proteases
// 4. Oligopeptidases
// 5. Enzymes involved in lipolysis
// 6. Alcoholic acids
// 7. Fuel cell and biodiesel production
// 8. Hydrogen peroxide degradation pathways building block of ethanol
// 9. Bacteria and their role as biofuel digesters
// 10. Directed evolution of biofuel production and pathways that mimic photosynthetic pathways of familiar autotrophs.
Enter fullscreen mode Exit fullscreen mode

SmolLM2-360M-Instruct-q4f16_1-MLC storage size

Conclusion

GPU is faster than CPU. One reason is that the GPU spins out a lot of workers to distribute the load and do the math simultaneously. LLMs rely heavily on math operations, and GPUs process more tokens per second.

It was interesting learning about MLC. Models that are already compiled save computing resources; however, there are fewer pre-compiled models and they are harder to use for generating custom or fine-tuned models. Finally, they are bigger in size, which makes sense because the compiled version includes more things like weights, layout, configurations, etc.

SmolLM2-135M-Instruct and SmolLM2-360M-Instruct-q4f16_1-MLC are language models from the same family; however, SmolLM2-135M-Instruct performs better even though it's smaller in size, confirming the general rule:

A smaller well-quantized, well-tuned model often beats a larger poorly-quantized one

Finally, both models are too large for the web (>100MB), and at this point, while it's impressive to have language models running completely in the browser without any backend service, the reality is that the outputs are very poor compared to those of the well-known LLMs. Nonetheless, it's great to have access to these models, and kudos to the team behind them. I'm sure Small Language Models will become more effective.

Extra

While working on the demo app I found two errors and one warning:

  • This happens on iPhone when trying to use SmolLM2-135M-Instruct with gpu:
webgpu:
Unhandled Promise Rejection: Error: no available backend found. ERR: [webgpu] TypeError: gg().webgpuInit is not a function. (In 'gg().webgpuInit(g=>{A.webgpu.device=g})', 'gg().webgpuInit' is undefined)
Enter fullscreen mode Exit fullscreen mode
  • This happens on iPhone when trying to use SmolLM2-135M-Instruct with wasm:
wasm:
Unhandled Promise Rejection: Error: no available backend found. ERR: [wasm] RangeError: Out of memory
Enter fullscreen mode Exit fullscreen mode
  • This is a warning that iPhone shows when loading SmolLM2-360M-Instruct-q4f16_1-MLC:
Requested maxStorageBufferBindingSize exceeds limit.
requested=1024MB,
limit=683MB.
WARNING: Falling back to 128MB
Enter fullscreen mode Exit fullscreen mode

Top comments (0)