Alexey

Posted on Jan 16 • Edited on Jan 30

WebGPU tutorial: compute, vertex, and fragment shaders on the web

#webdev #webgpu #tutorial #javascript

A gift to the world, WebGPU promises to bring cutting edge GPU compute capabilities to the web - to all consumer platforms everywhere with a shared codebase.

Its predecessor, WebGL, fantastic in its own right, sorely lacked compute shader functionality - limiting applications.

WGSL - the WebGPU shader/compute language - borrows from the best in its domain: Rust and GLSL.

This tutorial fills a gap in documentation I felt when I was learning to work with WebGPU: I wanted a simple starting point for using a compute shader to calculate data for the vertex and fragment shaders.

The single-file HTML with all the code explained in this tutorial can be found at https://github.com/alexey-dc/webgpu_html/blob/main/000_compute_vertex_fragment.html - read on for the breakdown.

Here's that HTML running on my domain for a single click demo: https://alexey-dc.com/wgsl_demo (requires WebGPU-enabled browser like Chrome or Edge https://caniuse.com/webgpu).

High level setup

This is a particle simulation - that is it occurs over time, in timesteps.

Time is tracked in JS/on the CPU, passed in as a (float) uniform to the GPU.

The particle data is managed entirely on the GPU - though there is still a handshake with the CPU that allows allocating the memory and setting initial values. It's also possible to read the data back to the CPU, but that's left out of this tutorial.

The magic of this setup is that every particle is updated in parallel with all other particles, enabling once mind-boggling compute and render speed capabilities in the browser (the parallelization maxes out at the number of cores on the GPU; we can divide the number of particles by the number of cores to get the true number of cycles per core per update step).

Bindings

The WebGPU mechanism for data exchange between the CPU to the GPU is bindings - a JS Array (like a Float32Array) can be "bound" to a memory location in WGSL with a WebGPU Buffer. The WGSL memory location is identified with two integers: a group number and a binding number.

In our case, both the compute shader and the vertex shader rely on two data bindings: time and particle positions.

Time - uniforms

The uniform definition exists in both the compute shader (https://github.com/alexey-dc/webgpu_html/blob/main/000_compute_vertex_fragment.html#L43) and the vertex shader (https://github.com/alexey-dc/webgpu_html/blob/main/000_compute_vertex_fragment.html#L69) - the compute shader updates position, and the vertex shader updates color based on time.

Let's take a look at the binding setup in JS and WGSL, starting with the compute shader.

const computeBindGroup = device.createBindGroup({
  /*
    see computePipeline definition at
    https://github.com/alexey-dc/webgpu_html/blob/main/000_compute_vertex_fragment.html#L102

    it allows linking a JS string with WGSL code to WebGPU
  */
  layout: computePipeline.getBindGroupLayout(0), // group number 0
  entries: [{
    // time bound at binding number 0
    binding: 0,
    resource: {
      /*
      for reference, buffer declared as:

      const timeBuffer = device.createBuffer({
        size: Float32Array.BYTES_PER_ELEMENT,
        usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST})
      })

      https://github.com/alexey-dc/webgpu_html/blob/main/000_compute_vertex_fragment.html#L129
      */
      buffer: timeBuffer 
    }
  },
  {
    // particle position data at binding number 1 (still in group 0)
    binding: 1,
    resource: {
      buffer: particleBuffer
    }
  }]
});

And the corresponding declarations in the compute shader

// From the compute shader - similar declaration in vertex shader
@group(0) @binding(0) var<uniform> t: f32;
@group(0) @binding(1) var<storage, read_write> particles : array<Particle>;

Importantly, we bind the JS-side timeBuffer to WGSL by matching the group and binding numbers in JS and WGSL.

This gives us the power to control the variable's value from JS:

/* Just need 1 element in the array since time is a single float value */
const timeJs = new Float32Array(1)
let t = 5.3
/* Plain JS, just set the value */
timeJs.set([t], 0)
/* Pass the data from CPU/JS to GPU/WGSL */
device.queue.writeBuffer(timeBuffer, 0, timeJs);

Particle positions - WGSL storage

We store and update the particle positions directly in GPU-accessible memory - allowing us to update them in parallel relying on the massive multi-core architecture of the GPU.

The parallelization is orchestrated with the help of a workgroup size, declared in the compute shader:

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) global_id : vec3<u32>) {
  // ...
}

The @builtin(global_invocation_id) global_id : vec3<u32> value gives a thread identifier.

By definition, global_invocation_id = workgroup_id * workgroup_size + local_invocation_id - which means it can be used as a particle index.

For example, if we have 10k particles, and a workgroup_size of 64, we'll need to dispatch Math.ceil(10000/64) workgroups. We'll explicitly tell the GPU to do that amount of work each time we trigger a compute pass from JS:

computePass.dispatchWorkgroups(Math.ceil(PARTICLE_COUNT / WORKGROUP_SIZE));

If PARTICLE_COUNT == 10000 and WORKGROUP_SIZE == 64, we'll launch 157 workgroups (10000/64 = 156.25), and each will compute ranging local_invocation_id from 0 to 63 (while workgroup_id will range from 0 to 157). We'll end up doing slightly more calculations in one of the workgroups, since 157 * 64 = 1048. We deal with the overflow by discarding the extraneous invocations.

Here's what the compute shader ends up looking with those considerations:

@compute @workgroup_size(${WORKGROUP_SIZE})
fn main(@builtin(global_invocation_id) global_id : vec3<u32>) {
  let index = global_id.x;
  // Discard extra computations due to workgroup grid misalignment
  if (index >= arrayLength(&particles)) {  
    return;
  }
  /* Convert integer index to float so we can compute position updates based on index (and time)*/
  let fi = f32(index);    
  particles[index].position = vec2<f32>(
    /* No grand intent behind the formulas - just an example of using time+index */
    cos(fi * 0.11) * 0.8 + sin((t + fi)/100)/10,
    sin(fi * 0.11) * 0.8 + cos((t + fi)/100)/10
  );
}

These values will persist across compute passes, because particles are defined as a storage var.

Reading particles positions from compute shader in vertex shader

To read the particle positions in the vertex shader from the compute shader, we'll need a read-only view into the data, since only compute shaders are allowed to write to storage.

Here's the WGSL declaration for that:

@group(0) @binding(0) var<uniform> t: f32;
@group(0) @binding(1) var<storage> particles : array<vec2<f32>>;
/*
Or equivalent:

@group(0) @binding(1) var<storage, read> particles : array<vec2<f32>>;
*/

Attempting to re-use the same read_write style from the compute shader would just error out:

var with 'storage' address space and 'read_write' access mode cannot be used by vertex pipeline stage

Note that the binding numbers in the vertex shader don't have to match the compute shader binding numbers - they just need to match whatever the declaration for the bind group for the vertex shader are:

const renderBindGroup = device.createBindGroup({
  layout: pipeline.getBindGroupLayout(0),
  entries: [{
    binding: 0,
    resource: {
      buffer: timeBuffer
    }
  },
  {
    binding: 1,
    resource: {
      buffer: particleBuffer
    }
  }]
});

I opted for binding: 2 in the GitHub sample code https://github.com/alexey-dc/webgpu_html/blob/main/000_compute_vertex_fragment.html#L70 - just as a matter of exploring the boundaries of the constraints imposed by WebGPU

Running the simulation step by step

After all the setup is in place, the update-and-render loop is orchestrated in JS:

/* Start simulation at t = 0*/
let t = 0
function frame() {
  /*
    Use constant integer timesteps for simplicity - will render consistently regardless of framerate.
  */
  t += 1
  timeJs.set([t], 0)
  device.queue.writeBuffer(timeBuffer, 0, timeJs);

  // Compute pass to update particle positions
  const computePassEncoder = device.createCommandEncoder();
  const computePass = computePassEncoder.beginComputePass();
  computePass.setPipeline(computePipeline);
  computePass.setBindGroup(0, computeBindGroup);
  // Important to dispatch the right number of workgroups to process all particles
  computePass.dispatchWorkgroups(Math.ceil(PARTICLE_COUNT / WORKGROUP_SIZE));
  computePass.end();
  device.queue.submit([computePassEncoder.finish()]);

  // Render pass
  const commandEncoder = device.createCommandEncoder();
  const passEncoder = commandEncoder.beginRenderPass({
    colorAttachments: [{
      view: context.getCurrentTexture().createView(),
      clearValue: { r: 0.0, g: 0.0, b: 0.0, a: 1.0 },
      loadOp: 'clear',
      storeOp: 'store',
    }]
  });
  passEncoder.setPipeline(pipeline);
  passEncoder.setBindGroup(0, renderBindGroup);
  passEncoder.draw(PARTICLE_COUNT);
  passEncoder.end();
  device.queue.submit([commandEncoder.finish()]);

  requestAnimationFrame(frame);
}
frame();

Final words

WebGPU unlocks the power of massively parallel GPU computations in the browser.

It operates in passes - each pass having localized variables, enabled through pipelines with memory bindings (bridging CPU memory and GPU memory).

Compute passes allow orchestrating parallel workloads through workgroups.

While it does require some heavy set up, in my humble opinion the local binding/state style is a huge improvement over the global state model of WebGL - making it much easier to work with, while also finally bringing the power of GPU compute to the web.