DEV Community

Cover image for How to Run a 1.7B Parameter LLM in Your Browser With WebGPU
Alan West
Alan West

Posted on

How to Run a 1.7B Parameter LLM in Your Browser With WebGPU

You've got a 1.7 billion parameter model. You want it running locally. In a browser tab. No server, no API keys, no Docker containers.

Sounds impossible, right? A few months ago, I would've agreed with you. But 1-bit quantized models like Bonsai 1.7B have changed the math entirely — we're talking about squeezing a capable language model down to roughly 290MB. That's smaller than most Electron apps.

Here's how to actually make this work, and why it fails when you try the obvious approaches first.

The Problem: LLMs Are Too Fat for the Browser

Let's do some quick napkin math. A standard 1.7B parameter model stored in FP16 (16-bit floating point) takes up about 3.4GB. Even with aggressive 4-bit quantization (GPTQ/AWQ style), you're still looking at roughly 1GB.

That's a non-starter for browser delivery. You'd be asking users to download a gigabyte before they can even interact with your app. And once it's downloaded, you need enough VRAM to actually run inference — WebGL gives you limited compute capabilities, and WASM-based CPU inference is painfully slow for anything above a few hundred million parameters.

I hit this wall last month trying to build an offline-capable writing assistant. The models that were small enough to load were too dumb to be useful, and the models that were useful were too large to load.

Root Cause: Why Traditional Quantization Hits a Floor

Traditional post-training quantization (PTQ) compresses weights after the model has been trained. You take your FP16 weights and map them to INT8 or INT4 representations. The problem is structural: you still need to store scale factors, zero points, and often keep certain sensitive layers at higher precision.

With 4-bit quantization, you're at roughly 0.5 bytes per parameter. For 1.7B parameters, that's still ~850MB before you account for activations and KV cache during inference.

The breakthrough with 1-bit models is fundamentally different. Instead of compressing after training, models like Bonsai are trained from scratch (or fine-tuned) with ternary weights: each weight is constrained to {-1, 0, 1}. That's roughly 1.58 bits per weight in practice (log2(3)), and the storage is dramatically more efficient.

# Traditional 4-bit quantization: still needs scale factors per group
# Memory per parameter ≈ 4 bits + overhead
traditional_size = (1.7e9 * 4) / 8 / 1e9  # ~0.85 GB

# 1-bit ternary quantization: {-1, 0, 1}
# Memory per parameter ≈ 1.58 bits (log2(3))
ternary_size = (1.7e9 * 1.58) / 8 / 1e9  # ~0.34 GB

# With packing optimizations, you can get even lower
# Bonsai 1.7B reportedly fits in ~290MB
Enter fullscreen mode Exit fullscreen mode

That size difference is what makes browser delivery feasible.

The Solution: WebGPU + 1-Bit Models

WebGPU is the key piece that makes this viable. Unlike WebGL, which was designed for rendering triangles and got awkwardly shoehorned into compute tasks, WebGPU was built from the ground up with general-purpose GPU compute in mind. It exposes compute shaders, proper buffer management, and workgroup-level parallelism.

Step 1: Check WebGPU Support

Before you do anything else, verify the user's browser actually supports WebGPU. As of early 2026, Chrome and Edge have shipped it, Firefox has it behind a flag, and Safari support is progressing.

async function checkWebGPUSupport() {
  if (!navigator.gpu) {
    console.error('WebGPU not supported in this browser');
    return null;
  }

  const adapter = await navigator.gpu.requestAdapter();
  if (!adapter) {
    console.error('No GPU adapter found');
    return null;
  }

  const device = await adapter.requestDevice();

  // Check available memory — important for model loading
  const adapterInfo = await adapter.requestAdapterInfo();
  console.log('GPU:', adapterInfo.description);

  return device;
}
Enter fullscreen mode Exit fullscreen mode

This is where most people's first attempt fails. They assume WebGPU is universally available, ship without a fallback, and get bug reports from Firefox users wondering why the page is blank.

Step 2: Loading the Model Efficiently

Don't just fetch the entire model as one blob. Stream it in chunks and show progress. A 290MB download with no progress indicator feels broken.

async function loadModelWithProgress(modelUrl, onProgress) {
  const response = await fetch(modelUrl);
  const contentLength = response.headers.get('content-length');
  const total = parseInt(contentLength, 10);
  let loaded = 0;

  const reader = response.body.getReader();
  const chunks = [];

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    chunks.push(value);
    loaded += value.length;
    onProgress(loaded / total); // 0.0 to 1.0
  }

  // Concatenate chunks into a single ArrayBuffer
  const buffer = new Uint8Array(loaded);
  let offset = 0;
  for (const chunk of chunks) {
    buffer.set(chunk, offset);
    offset += chunk.length;
  }

  return buffer;
}
Enter fullscreen mode Exit fullscreen mode

Pro tip: use the Cache API to store the model weights after the first download. Nobody wants to re-download 290MB every time they visit your page.

async function getCachedModel(modelUrl) {
  const cache = await caches.open('llm-weights-v1');
  let response = await cache.match(modelUrl);

  if (!response) {
    response = await fetch(modelUrl);
    await cache.put(modelUrl, response.clone());
  }

  return response.arrayBuffer();
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Writing Efficient Compute Shaders for Ternary Weights

Here's where 1-bit models really shine on WebGPU. Because weights are ternary {-1, 0, 1}, your matrix multiplication kernels become dramatically simpler. You don't need floating point multiply-accumulate — it's essentially additions and subtractions.

// WGSL compute shader for ternary weight matrix-vector multiply
// Each weight is packed: 2 bits per weight, 16 weights per u32
@group(0) @binding(0) var<storage, read> weights: array<u32>;
@group(0) @binding(1) var<storage, read> input: array<f32>;
@group(0) @binding(2) var<storage, read_write> output: array<f32>;

@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
    let row = gid.x;
    var sum: f32 = 0.0;

    // Each u32 packs 16 ternary weights (2 bits each)
    // 00 = 0, 01 = 1, 10 = -1
    for (var i: u32 = 0u; i < arrayLength(&input) / 16u; i = i + 1u) {
        let packed = weights[row * (arrayLength(&input) / 16u) + i];
        for (var j: u32 = 0u; j < 16u; j = j + 1u) {
            let w = (packed >> (j * 2u)) & 3u;
            let idx = i * 16u + j;
            if (w == 1u) {
                sum = sum + input[idx];
            } else if (w == 2u) {
                sum = sum - input[idx];
            }
            // w == 0: skip (multiply by zero)
        }
    }

    output[row] = sum;
}
Enter fullscreen mode Exit fullscreen mode

Notice how there's no floating-point multiplication in the inner loop. This is a massive win on GPU hardware where ALU throughput matters.

Why It Still Fails (and How to Fix It)

Even with all of this, you'll hit a few gotchas:

  • Memory pressure. Browsers impose per-tab memory limits. On mobile, you might get killed by the OS if you allocate too aggressively. Pre-allocate your buffers and reuse them across inference steps instead of creating new ones each pass.

  • Shader compilation stalls. The first inference call will feel slow because WebGPU needs to compile your WGSL shaders. Trigger a dummy warm-up pass during model loading so the user doesn't see a freeze on their first message.

  • Tokenizer overhead. Don't forget you need a tokenizer too. Ship a BPE tokenizer compiled to WASM rather than implementing it in JavaScript — the performance difference is significant for longer inputs.

  • KV cache growth. Even with tiny weights, your attention KV cache grows with context length and is stored in FP16 or FP32. Cap your context length aggressively (1024-2048 tokens) to keep memory manageable in-browser.

Prevention: Designing for Browser Inference From the Start

If you're planning to ship a model in-browser, think about this early:

  • Choose architecture carefully. Smaller vocabulary sizes and fewer attention heads reduce both model size and KV cache pressure.
  • Test on real hardware. That beefy M-series MacBook in your lap is not representative. Test on a 2-year-old Windows laptop with integrated graphics.
  • Progressive enhancement. Always have a fallback. WebGPU not available? Offer a smaller model on WASM. Even that won't work? Fall back to a server-side API.
  • Chunk your downloads. Split model shards so users with flaky connections don't have to restart a 290MB download from scratch.

The Bigger Picture

What excites me about this space isn't any single model — it's the convergence of trends. 1-bit quantization makes models small. WebGPU makes browsers fast. Together, they unlock a category of applications that simply wasn't possible before: fully local, fully private AI that runs in a browser tab.

We're not at GPT-4 quality in 290MB, obviously. But for focused tasks — code completion, text summarization, form filling, local RAG — a 1.7B ternary model is surprisingly capable. And it runs without sending a single byte of user data to a server.

That's a tradeoff I'll take any day.

Top comments (0)