Phantasm0009

Posted on Mar 2

Building a WebGPU Library from Scratch

#webdev #javascript #programming #showdev

I wanted a NumPy-like API that ran on the GPU in the browser. No training graphs, no autograd—just arrays and ops. So I built one.

Why not just use TensorFlow.js?

TensorFlow.js is great, but it's heavy. I needed something small for demos and experiments. I also wanted to understand how GPU compute actually works under the hood. So I started a small library called accel-gpu.

The basics

WebGPU exposes compute shaders via WGSL. You create buffers, write shaders, and dispatch workgroups. The tricky part is making that feel like a.add(b) instead of "create bind group, set pipeline, dispatch, sync."

I went with a simple model: each op has a precompiled WGSL shader. add is a shader that does out[i] = a[i] + b[i]. relu is out[i] = max(0, a[i]). No runtime shader generation, no graph IR—just a map of op names to shader strings.

@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
  let i = gid.x;
  if (i < arrayLength(&out)) {
    out[i] = a[i] + b[i];
  }
}

That's the whole add kernel. The runner creates a pipeline from it, binds buffers, and dispatches. For 100k elements, that's ~400 workgroups of 256 threads.

WebGL fallback

WebGPU isn't everywhere yet. Safari and Firefox need WebGL2. So I added a WebGL backend.

WebGL doesn't have compute shaders. The trick is to use a full-screen quad and a fragment shader. You pack floats into RGBA8 (bit manipulation), render to a texture, and unpack. It's a bit hacky, but it works. Each "compute" op becomes a render pass.

Then there's a CPU backend for Node and headless runs. Same API, different implementation. The API stays the same whether you're on WebGPU, WebGL, or CPU.

Shape inference

For matmul, you need M, N, and K. I didn't want users to pass those every time. So the library infers them from array shapes. If A is [2, 3] and B is [3, 4], it knows M=2, N=4, K=3. That's just a bit of shape logic, but it makes the API much nicer.

Reductions are annoying

sum over 1M elements can't fit in one workgroup. You need a multi-pass reduction. Each pass sums pairs, halves the size, and repeats until you have one value. The first version had a bug where I didn't handle non-power-of-two sizes correctly. Took a while to track down.

argmax is worse—you need both value and index. I ended up doing it on the CPU by reading the buffer back. Not ideal for huge arrays, but fine for typical use.

What I'd do differently

Buffer pooling — I added it later. Creating and destroying GPUBuffers every frame was slow. A simple pool helped a lot.
Error handling — Early on, WebGPU errors were cryptic. I added clearer messages ("reshape: cannot reshape [2,3] to [4]"), and that saved a lot of debugging time.
Testing — I test on the CPU backend. Same ops, same results, no GPU setup. Makes CI straightforward.

The fun parts

Softmax was satisfying: max for numerical stability, then exp, then normalize. Layer norm and attention scores followed. FFT was interesting—Cooley-Tukey, bit reversal, butterfly ops. All on the CPU for now, but the API is there.

Conv2D and pooling are implemented on the CPU as well. They work, but a real GPU implementation would need proper 2D/3D dispatch and more thought about memory layout.

Is it worth it?

For learning, yes. I understand WebGPU and WGSL much better now. For production, it depends. If you need a small, dependency-free GPU math lib, it might fit. If you need full training, use TensorFlow.js or a similar library.

The repo is accel-gpu on GitHub if you want to poke around. The shaders live in src/kernels/shaders.ts.

DEV Community