ndesmic

Posted on Mar 17

Building a JS pytorch clone with WebGPU

#javascript #webgpu #machinelearning

Last time I created a simple autograd. It's interesting but also not super usable. This is because it cannot take advantage of all the computer the computer has. CPU loops are slow, and GPUs can do many parallel ops at a time. I actually started down this road first before trying CUDA. What I found was that WebGPU is much, much more difficult to get right because WGSL is still pretty early and limited meaning you can't do the things you want easily. You also have to deal with async operations and a lot of ceremony to setup the calls. I highly recommend if you are trying to follow along to do the same especially if you are not already comfortable with compute shaders in general.

WGSL Compute Shader Basics

With the advent of WebGPU web finally has access to compute shaders. "shader" is GPU parlance for a program that is run on the GPU. These are typically small but do lots of data crunching. It's a "shader" because they originally (and still are!) used to color in pixel values. They also used to be very inflexible having only a few functions and fixed pipelines but modern GPUs can do general purpose computing (GPGPU) via compute shaders which make very few assumptions about how the data is used. In essence you upload a shader program to the GPU as well as some big arrays, do some parallel computation on them and send them back to the javascript program.

The basis of this is the WebGPUDevice which is the interface where everything lives. You can get access to it from navigator.gpu.requestAdapter() which gives you an adapter. The adapter lets you get the device with adapter.requestDevice() (you can also choose the GPU if you have more than one).

const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();

From the device we have access to all the WebGPU APIs. Setting up the shader is quite complicated though. We need a few things:

The shader code
The binding layout
This pipeline
The bindings
The encoder
Getting our data

The shader code

WebGPU shaders are written in a language called WGSL (WebGPU Shader Language). It's reminiscent of rust but geared toward shader programs. This step I think is the easiest to conceptualize. We pass a string of WGSL to the device and it will compile it, giving us warnings if something doesn't work, and it will upload it to the GPU so it can run. This is done with device.createShaderModule(code: string)

const module = device.createShaderModule(`
@group(0) @binding(0)
var<storage, read> valuesA: array<f32>;
@group(0) @binding(1)
var<storage, read> valuesB: array<f32>;
@group(0) @binding(2)
var<storage, read_write> output: array<f32>;
@compute @workgroup_size(8, 8)
fn main(
    @builtin(global_invocation_id) global_id: vec3<u32>
){
    if(global_id.x > arrayLength(&valuesA)) {
        return;
    }
    let idx = global_id.x;
    output[idx] = valuesA[idx] + valuesB[idx];
})`);

Don't worry too much about the code yet. This is just how we "compile" it.

Binding layout

The bindings are basically how we pass data into the shader. These are fairly complicated because they have a lot of ceremony. The first part is that we setup the layout which is basically like a schema for passing things in. This is an ordered list of buffers, and how they get used. It might look something like:

const bindGroupLayout = device.createBindGroupLayout({
    entries: [
      {
        binding: 0,
        visibility: GPUShaderStage.COMPUTE,
        buffer: {
          type: "read-only-storage"
        }
      },
      {
        binding: 1,
        visibility: GPUShaderStage.COMPUTE,
        buffer: {
          type: "read-only-storage"
        }
      },
      {
        binding: 2,
        visibility: GPUShaderStage.COMPUTE,
        buffer: {
          type: "storage"
        }
      }
    ]
});

The binding number corresponds to the index it gets bound to in the shader program. The visibility tells the GPU where in the pipeline it's available. Since we're just dealing with compute shaders we only have one value. The last gives the GPU hints about how the buffer is used so it can try to make it more optimal (these are not optional). In this example we have 3 bindings. The first 2 would be typical of inputs because inputs only need to be read. The 3rd is typical of an output because outputs are written to. You can think of it like 2 parameters and a return value.

The pipeline

Back in the before times GPUs just had fixed functions you could do. They later evolved to take in programmable shaders like above where you could write your own code to run on the GPU. But there were only 2 main types, vertex and fragment/pixel shaders and they always came in a specific order and rasterization, clipping and blending happened at predefined stages in the pipeline. This is still mostly true for graphics however pipelines are getting more configurable so you can decide to add extra steps or not use others. Compute shaders are basically do-what-you-want. As a result WebGPU requires us to define the pipeline which creates extra ceremony not found in WebGL. At least for the computer shader pipeline, this needs two things: a bind group layout and the shader we created above.

const pipelineLayout = device.createPipelineLayout({
   bindGroupLayouts: [bindGroupLayout]
});
const pipeline = device.createComputePipeline({
    layout: pipelineLayout,
    compute: {
        module: module,
        entryPoint: "main"
    }
});

We need to set the entryPoint for the shader module so it knows where to start executing.

These are all of the parts that need to be setup once per shader. The rest will need to be done per call to the shader.

Bindings

For each binding we need to allocate a buffer of data. We'll copy data into this buffer and then push it up to the GPU.

const bindGroupEntries = [];

//first parameter
const input1 = new Float32Array([/*some data*/]);
const gpuBuffer1 = device.createBuffer({ size: input1.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DEST });

bindGroupEntries.push({ 
    binding: 0, 
    resource: { 
        buffer: gpuBuffer1
    }
});
device.queue.writeBuffer(gpuBuffer1, 0, input1);

//second parameter
const input2 = new Float32Array([/*some data*/]);
const gpuBuffer2 = device.createBuffer({ size: input2.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DEST });

bindGroupEntries.push({ 
    binding: 1, 
    resource: { 
        buffer: gpuBuffer2
    }
});
device.queue.writeBuffer(gpuBuffer2, 0, input2);

We need our data in typed arrays. Even if you intend to use structs for data they need to be in ArrayBuffers (there's also some byte padding issues that we don't need to deal with yet thankfully). We create GPUBuffer which is basically a reference to GPU memory. We need to tell it how much and tell it how it's going to be used. We also need to keep track of the bind group entries. These correspond with the bind group layout so we expect there will be 3 total using the example above. To actually write the data to the GPU is tedious because we have to deal with "mapped" and "unmapped" buffers which synchronize who is allowed (between the CPU and GPU) to touch the data to prevent races. Thankfully there is a helpful device.queue.writeBuffer which simplifies this. It takes the GPUBuffer and offset to start at and the data to write and will push the data up to the GPU.

Getting data back is not so simple. We need to define an extra intermediate buffer to go along with the output buffer. This is used to shuttle data across to the CPU side and is an extra step WebGPU requires.

const outputBuffer = device.createBuffer({
    size: outputByteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC
});
bindGroupEntries.push({
    binding: 2,
    resource: {
        buffer: outputBuffer
    }
});
const stagingBuffer = device.createBuffer({
    size: outputByteLength,
    usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST
});

The outputBuffer is the one actually used in the shader. The stagingBuffer is an intermediary we'll worry about later. It's not part of the bindings because it's not part of the shader code.

const bindGroup = device.createBindGroup({
    layout: bindGroupLayout,
    entries: bindGroupEntries
});

The final thing we'll do it create the BindGroup itself which contains all of the bind group data we just made.

Encoder

Now that we have a pipline and our data we can actually invoke it. We first start with the CommandEncoder which is an object that serializes the instructions we will give the GPU.

const commandEncoder = device.createCommandEncoder();

Then we need a PassEncoder which gives the GPU instructions to use a particular pipeline, bind group and then start processing it with some configuration of threads.

const passEncoder = commandEncoder.beginComputePass();
passEncoder.setPipeline(pipeline);
passEncoder.setBindGroup(0, bindGroup);
passEncoder.dispatchWorkgroups(4);
passEncoder.end();
/* Some buffer stuff happens here (see next section) */
const commands = commandEncoder.finish();
device.queue.submit([commands]);

Setting the pipeline and bind group are I think somewhat self-explanatory. The GPU is big state-machine so we need to set the active pipeline and bind group. dispatchWorkgroups basically follows a similar convention to how CUDA did it. Each workgroup is actually a virtual "cube" of threads as defined by the shader (see previous chapter). This is the actual command that "runs" the pipeline. After that we end which I assume cleans stuff up or resets a device state of some sort.

Next we finish the command encoder which encodes the commands and then we send the encoded commands to the GPU with device.queue.submit

Getting our data back

This is probably the most confusing part if you are just code reading. Like mentioned above the buffers have a bit of ceremony and while there's a simple way to get data in, getting data out requires us to do the whole thing with a staging buffer.

In the above code I commented a section saying some "buffer stuff happens here". This is that "buffer stuff:"

commandEncoder.copyBufferToBuffer(outputBuffer, 0, stagingBuffer, 0, output.byteLength);

We take the data in outputBuffer (which is on the GPU) and then copy it to stagingBuffer (also on the GPU). I'm not sure the exact reason for this limitation but if you recall the output and staging buffers had different usage flags. We cannot "map" (transfer ownership to javascript land) the output buffer and we can't set MAP_READ flag on it because it can only be paired with COPY_DST. So to get data out of outputBuffer we copy it into the stagingBuffer which we can map.

//after device.queue.submit()
await stagingBuffer.mapAsync(GPUMapMode.READ, 0, outputByteLength);
const copyArrayBuffer = stagingBuffer.getMappedRange(0, outputByteLength);

The first step locks the buffer so that the javascript can read that data (and the GPU can't do something to clobber it). We then can extract the data from the GPU buffer with getMappedRange and the start and end points. Again, we do another strange copy step:

const data = copyArrayBuffer.slice(0);
stagingBuffer.unmap();
outputResults = new Float32Array(copyArrayBuffer);

arrayBuffer.slice(0) is just a convention to copy the array buffer into another array buffer. We do this because copyArrayBuffer gets linked to stagingBuffer. When mapped javascript has access to it. But when we unmap to release it back to the GPU we lose access to copyArrayBuffer with it (it becomes "detached"). Basically the copyArrayBuffer is just a view over the shared memory and when we flip the switch the underlying memory is now owned by the GPU and we can no longer access it. Since we do want to clean up and reset that buffer we need to make a copy that we fully own which is what data represents. We then make a Float32Array view over that array buffer which becomes our final input value that we can now freely pass around in javascript.

It's admittedly a lot to take in if you've never seen it.

WGSL Shader

So that was just setting up a pipeline and getting data. Let's go back to step one where we actually wrote the shader code:

@group(0) @binding(0)
var<storage, read> valuesA: array<f32>;
@group(0) @binding(1)
var<storage, read> valuesB: array<f32>;
@group(0) @binding(2)
var<storage, read_write> output: array<f32>;

@compute @workgroup_size(8, 8)
fn main(
    @builtin(global_invocation_id) global_id: vec3<u32>
){
    if(global_id.x > arrayLength(&valuesA)) {
        return;
    }
    let idx = global_id.x;
    output[idx] = valuesA[idx] + valuesB[idx];
})

Thankfully I think this is a little easier to understand. @group() and @binding() correspond the bind group and the binding in the group. We only have one group so it's always group 0 and the binding numbers correspond to the binding numbers in the layout and the passed in bindings. So valuesA gets bound to the first input buffer and valuesB gets bound to the second input buffer. In the shader these are arrays of float 32s. output is the output buffer. Within the shader they have their own usages which I won't really get into other than to say we read the inputs and write to the output so it needs to be setup as such.

Next we mark the function as part of a compute shader with @compute and we see the workgroup size with @workgroup_size(x,y,z) . If you recall it's that virtual "cube" of threads. To recap 8x8 is 64 threads, just like 64x1x1. Dimensions cannot get too big though so sometimes you need to make it more square.

Our function is named "main" which corresponds to what we told the pipeline was the entrypoint. It takes one parameter which will come from WebGPU itself which is why it has that @builtin annotation. This is the global invocation id. Workgroup "blocks" have the x,y,z size set in the shader but you might remember we called dispatchWorkgroups with the number of workgroups. So we actually have 4 8x8x1 "blocks". The invocation id is basically an index into the thread blocks and represents the thread index relative to all of the threads over all workgroups (you can also request the local id which is the id within the workgroup block). For now overallocation is fine we won't worry about the numbers. However, we do need the id because all thread run in parallel so we need some way to tell each thread which part of the data to process.

Each thread will handle one index of the arrays. So if we had 40 elements, then we throw 40 threads at it with each on taking the index corresponding to itself so that they don't overlap and clobber each other. The rest is a basic add operation over two numbers, storing the result in output.

The final note is the if statement. If the id falls outside the length of the array we just quit because there's nothing to do. If you try to read outside the array it will actually clamp to the end index and a bunch of threads will try to write the same value which can be bad. Note that you can't completely right-size the threads either. The GPU will give you them in blocks of say, 16. So you can request a number but it will round up to the nearest block it can allocate, say 48. So you will likely have those extra threads even if you don't want them, so we always need to make sure they aren't hurting us.

Making the kernels

Using the foundation from above I made a function that takes in some shader code and some inputs and makes it into a function.

export function compileKernel({
    device,
    code,
    params,
}) {
    const module = device.createShaderModule({ code });
    const bindGroupLayout = getBindGroupLayoutForParams(device, params);

    const pipeline = device.createComputePipeline({
        layout: device.createPipelineLayout({
            bindGroupLayouts: [bindGroupLayout]
        }),
        compute: {
            module,
            entryPoint: "main"
        }
    });

    return async function ({ inputs, outputs }) {
        const bindGroupEntries = [];

        for (let i = 0; i < params.inputs.length; i++) {
            const inputBuffer = inputs[i];
            const param = params.inputs[i];

            switch (param.type) {
                case "array": {
                    const gpuBuffer = device.createBuffer({ size: inputBuffer.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST });
                    bindGroupEntries.push({
                        label: param.name,
                        binding: bindGroupEntries.length,
                        resource: {
                            buffer: gpuBuffer
                        }
                    });
                    device.queue.writeBuffer(gpuBuffer, 0, inputBuffer);
                    break;
                }
            }
        }

        const outputBuffers = new Array(params.outputs.length);
        const stagingBuffers = new Array(params.outputs.length);

        for (let i = 0; i < params.outputs.length; i++) {
            outputBuffers[i] = device.createBuffer({
                label: params.outputs[i].name,
                size: outputs[i],
                usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC
            });
            stagingBuffers[i] = device.createBuffer({
                label: `${params.outputs[i].name}-staging`,
                size: outputs[i],
                usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST
            });
            bindGroupEntries.push({
                label: params.outputs[i].name,
                binding: bindGroupEntries.length,
                resource: {
                    buffer: outputBuffers[i]
                }
            });
        }

        const bindGroup = device.createBindGroup({
            layout: bindGroupLayout,
            entries: bindGroupEntries
        });

        const commandEncoder = device.createCommandEncoder();
        const passEncoder = commandEncoder.beginComputePass();
        passEncoder.setPipeline(pipeline);
        passEncoder.setBindGroup(0, bindGroup);
        passEncoder.dispatchWorkgroups(Math.ceil(Math.max(...inputs.filter(b => b.byteLength).map(b => b.byteLength)) / 8));
        passEncoder.end();
        for(let i = 0; i < stagingBuffers.length; i++){
            commandEncoder.copyBufferToBuffer(outputBuffers[i], 0, stagingBuffers[i], 0, outputs[i]);
        }
        const commands = commandEncoder.finish();
        device.queue.submit([commands]);

        const outputResults = new Array(outputs.length);
        for(let i = 0; i < outputs.length; i++){
            await stagingBuffers[i].mapAsync(GPUMapMode.READ, 0, outputs[i].byteLength);
            const copyArrayBuffer = stagingBuffers[i].getMappedRange(0, outputs[i]);

            const data = copyArrayBuffer.slice(0);
            stagingBuffers[i].unmap();
            if(params.outputs[i].subtype === "f32"){
                outputResults[i] = new Float32Array(data);
            } else if(params.outputs[i].subtype === "u32"){
                outputResults[i] = new Uint32Array(data);
            }
        }

        return outputResults;
    }
}

Nothing here is new, it's just doing a few transforms to make a pipeline for each op for example automating adding multiple input or output buffers and typing the outputs. In particular the first part of the function sets everything up and the returned function can be called multiple times with different data. There is one function of note here getBindGroupLayoutForParams.

function getBindGroupLayoutForParams(device, params){
    const entries = [];
    if(params.inputs){
        for(let i = 0; i < params.inputs.length; i++){
            entries.push({
                binding: entries.length,
                visibility: GPUShaderStage.COMPUTE,
                buffer: {
                    type:"read-only-storage"
                }
            });
        }
    }
    if(params.outputs){
        for(let i = 0; i < params.outputs.length; i++){
            entries.push({
                binding: entries.length,
                visibility: GPUShaderStage.COMPUTE,
                buffer: {
                    type: "storage"
                }
            });
        }
    }

    return device.createBindGroupLayout({
        entries
    });
}

This just adds the parameter based on whether it was an input or an output. Each op looks like this:

const addOp = {
    forward: {
        params: {
            inputs: [
                { name: "valuesA", type: "array" },
                { name: "valuesB", type: "array" },
            ],
            outputs: [
                { name: "output", type: "array", subtype: "f32" }
            ]
        },
        code: `
            @group(0) @binding(0)
            var<storage, read> valuesA: array<f32>;

            @group(0) @binding(1)
            var<storage, read> valuesB: array<f32>;

            @group(0) @binding(2)
            var<storage, read_write> output: array<f32>;

            @compute @workgroup_size(8, 8)
            fn main(
                @builtin(global_invocation_id) global_id: vec3<u32>
            ){
                let idx = global_id.x;

                if global_id.x < arrayLength(&valuesA) {
                    output[idx] = valuesA[idx] + valuesB[idx];
                }
            }
            `
    },
    backward: {
        params: {
            inputs: [
                { name: "gradA", direction: "in", type: "array" },
                { name: "gradB", direction: "in", type: "array" },
                { name: "gradResult", direction: "in", type: "array" }
            ],
            outputs: [
                { name: "gradAOut", direction: "out", type: "array", subtype: "f32" },
                { name: "gradBOut", direction: "out", type: "array", subtype: "f32" },
            ]
        },
        code: `
            @group(0) @binding(0)
            var<storage, read> gradA: array<f32>;

            @group(0) @binding(1)
            var<storage, read> gradB: array<f32>;

            @group(0) @binding(2)
            var<storage, read> gradResult: array<f32>;

            @group(0) @binding(3)
            var<storage, read_write> gradAOut: array<f32>;

            @group(0) @binding(4)
            var<storage, read_write> gradBOut: array<f32>;

            @compute @workgroup_size(8, 8)
            fn main(
                @builtin(global_invocation_id) global_id: vec3<u32>
            ){
                let idx = global_id.x;

                if global_id.x < arrayLength(&gradA) { //buffers are the same size so pick one
                    gradAOut[idx] = gradA[idx] + gradResult[idx];
                    gradBOut[idx] = gradB[idx] + gradResult[idx];
                }
            }
        `
    }
}

And there's another helper to compile the different parts:

export function compileOp(device, op){
    return {
        forward: compileKernel({ device, code: op.forward.code, params: op.forward.params }),
        backward: compileKernel({ device, code: op.backward.code, params: op.backward.params })
    }
}

The params don't even really need a name as they are indexed by position but it helps with readability as the label can be used in debugging. Hopefully, the setup makes sense. Each op has a backwards and forward operation. The backward is not the same as the forward and it's not always invoked when we do a pass so it's a different shader. The backwards part also takes in more parameters because it needs access to more things. Backward also has 2 outputs, one for each node of the binary op as both will be filled in by the backwards step. This could have been split into two shader calls but that's just twice the overhead to run.

Adding the same node

One thing that we found last time was that adding a node to itself somewhere in the graph can cause problems since it updates itself. As with the CUDA implementation there is no way to know that 2 GPU buffers reference the same thing. And since gradA and gradB are copies their gradient won't add like it's supposed to.

The first time I tried this I added a boolean parameter to the shader. This was foolish since in the shader both sides of the if are executed so it's constant overhead. Instead, I realized that both buffers being the same is pretty uncommon and so it should just add the results if the gradients reference the same thing. This works the same as the CUDA code.

//wgpu-tensor.js
//inside of the add method 
result[backward] = async () => {
    const [thisGradient, otherGradient] = await this.#kernels.add.backward({ 
        inputs: [this.gradient, other.gradient, result.gradient],
        outputs: [{ byteLength: this.gradient.byteLength }, { byteLength: other.gradient.byteLength }]
    });
    if (this === other) { //check if same node
        const [combinedGrad] = await this.#kernels.add.forward({
            inputs: [thisGradient, otherGradient],
            outputs: [{ byteLength: this.#values.byteLength }]
        });
        this.#gradient = combinedGrad;
    } else {
        this.gradient = thisGradient;
        other.gradient = otherGradient;
    }
}

So this is kinda gross and we'll need to do this for all binary op shaders. Now that we have the kernel code we just need to wire it up to the Tensor.

//wgpu-tensor.js
//inside of WGPUTensor class
async add(other) {
    if (other.totalLength != this.totalLength) throw new Error(`Tensor not the right length, argument was $
{other.totalLength}, needs to be ${this.totalLength}`);
    const [resultValues] = await this.#kernels.add.forward({ 
        inputs: [
            this.#values, 
            other.values
        ],
        outputs: [
            this.#values.byteLength
        ]
    });
    const result = new WGPUTensor({
        values: resultValues,
        shape: this.#shape,
        children: [this, other],
        op: "+",
        device: this.#device
    });
    result[backward] = async () => {
        const [thisGradient, otherGradient] = await this.#kernels.add.backward({ 
            inputs: [this.gradient, other.gradient, result.gradient],
            outputs: [{ byteLength: this.gradient.byteLength }, { byteLength: other.gradient.byteLength }]
        });
        if (this === other) {
            const [combinedGrad] = await this.#kernels.add.forward({
                inputs: [thisGradient, otherGradient],
                outputs: [{ byteLength: this.#values.byteLength }]
            });
            this.#gradient = combinedGrad;
        } else {
            this.gradient = thisGradient;
            other.gradient = otherGradient;
        }
    };
    return result;
}

Almost the same as before but we just outsource the calculation to the kernel. We also need to pass in the device to the tensor so it's available. Originally I just wanted one global device but this became a problem when running unit tests. Deno's test runner doesn't like when you setup GPU resources and don't tear them down. Unfortunately, the most reliable way to get rid of everything is to dispose the device. In the end, I think this is correct, it's bad to have dangling resources in the test even if it doesn't completely make sense during a normal run as the device would just be a singleton. It's an extra parameter, but it makes it easier to clean up our mess in the end.

Reduction ops

Oh boy. If it was difficult before it's lunatic difficulty now. So I had to find out the hard and painful way to get this op to work. The biggest problem here is that we need to allocate an array to get the dimensional indices. WGSL cannot currently do allocation nor can it do dynamic-sized arrays (structs with arrays as the final property can be dynamic but only the binding, you cannot have a dynamic array in WGSL code). To add insult to injury we can't do recursion either. I tried to see if I could re-write it without allocation but I couldn't find a way, and neither did Copilot when I asked. After looking it over I'm somewhat convinced this actually does make sense as you'd need to somehow store the dimensional indices either in an array or on the stack. But in either case we can't do it. So are we stuck?

Well we basically have to re-invent allocation. We start with an "arena" which is a big block of memory representing scratch space we can use for dynamic algorithms. This is also tricky because everything is parallel, so everything needs it's own space. So to do this well we need to figure out how much each thread will allocate and multiply that by the number of threads.

const outputLength = this.#shape.reduce((prod, x, idx) => {
    return idx !== dimensionToReduce ? prod * x : prod;
}, 1);
const threadMemSize = ((this.#shape.length * 4) - 2) * 4; //manually calculated from shader :/
const memSize = threadMemSize * outputLength;

We allocate a shape 4 times, but 2 of those times are the newShape which is one dimension smaller. Thankfully all we need are ints so we don't have to worry about conversion and each u32 is 4 bytes so the whole things is multiplied by 4.

To simplify for this case I added another parameter type called memory. It's the same as an output but it doesn't create a staging buffer or type to copy, it's just empty memory on the GPU. I'll skip this since it's difficult to show but you could also just use an output for this too, you just don't need to read it (although when starting out the ability to read the memory as an output is helpful to debug the implementation).

Let's look at the shader code:

const sumOp = {
    forward: {
        params: {
            memory: [
                { name: "mem", type: "array", },
            ],
            inputs: [
                { name: "shape", type: "array" },
                { name: "dimToReduce", type: "u32" },
                { name: "values", type: "array" },
            ],
            outputs: [
                { name: "output", type: "array", subtype: "f32" }
            ]
        },
        code: `
            @group(0) @binding(0)
            var<storage, read_write> mem: array<u32>; //memory arena because we can't allocate in the shader

            @group(0) @binding(1)
            var<storage, read> shape: array<u32>;

            struct U32 {
                value: u32
            }

            @group(0) @binding(2)
            var<storage, read> dim_to_reduce: U32;

            @group(0) @binding(3)
            var<storage, read> values: array<f32>;

            @group(0) @binding(4)
            var<storage, read_write> output: array<f32>;

            fn remove_at_index(source_ptr: u32, length: u32, destination_ptr: u32, index_to_remove: u32) {
                var source_index = 0u;
                var destination_index = 0u;

                while(destination_index < length - 1){
                    if(source_index != index_to_remove){
                        mem[destination_ptr + destination_index] = mem[source_ptr + source_index];
                        source_index++;
                        destination_index++;
                    } else {
                        source_index++;
                    }
                }
            }

            fn insert_at_index(source_ptr: u32, length: u32, destination_ptr: u32, index_to_insert: u32, value: u32) {
                var source_index = 0u;
                var destination_index = 0u;

                while(destination_index < length + 1){
                    if(destination_index != index_to_insert){
                        mem[destination_ptr + destination_index] = mem[source_ptr + source_index];
                        source_index++;
                        destination_index++;
                    } else {
                        mem[destination_ptr + destination_index] = value;

                        destination_index++;
                    }
                }
            }

            fn get_dimensional_indices(flat_index: u32, shape_ptr: u32, shape_size: u32, destination_ptr: u32) {
                var destination_index = destination_ptr;
                var current_index = flat_index;

                for(var i = 0u; i < shape_size; i++){
                    mem[destination_ptr + i] = current_index % mem[shape_ptr + i];
                    current_index = current_index / mem[shape_ptr + i];
                }
            }

            fn get_flat_index(dimensional_index_ptr: u32, shape_ptr: u32, shape_size: u32) -> u32 {
                var index = 0u;

                for (var i = 0u; i < shape_size; i++)
                {
                    index *= mem[shape_ptr + shape_size - 1 - i];
                    index += mem[dimensional_index_ptr + shape_size - 1 - i];
                }
                return index;
            }

            @compute @workgroup_size(8, 8)
            fn main(
                @builtin(global_invocation_id) global_id: vec3<u32>
            ){
                let idx = global_id.x;
                let shape_size = arrayLength(&shape);

                var new_size = 1u;
                for(var j = 0u; j < shape_size; j++){
                    if(j != dim_to_reduce.value){
                        new_size *= shape[j];
                    }
                }

                if idx < new_size {
                    let base_ptr = idx * ((shape_size * 4) - 2); //manual calc :/ 
                    var mem_ptr = base_ptr;

                    let shape_ptr = base_ptr;
                    for(var i = 0u; i < shape_size; i++){ //write shape to mem
                        mem[mem_ptr] = shape[i];
                        mem_ptr++;
                    }

                    let new_shape_ptr = mem_ptr;
                    remove_at_index(shape_ptr, shape_size, new_shape_ptr, dim_to_reduce.value);
                    mem_ptr += shape_size - 1;

                    let partial_dim_index_ptr = mem_ptr;
                    get_dimensional_indices(idx, new_shape_ptr, shape_size - 1, partial_dim_index_ptr);
                    mem_ptr += shape_size - 1;      

                    for(var i = 0u; i < shape[dim_to_reduce.value]; i++){
                        let dim_index_ptr = mem_ptr;
                        insert_at_index(partial_dim_index_ptr, shape_size - 1, dim_index_ptr, dim_to_reduce.value, i);

                        let flat_index = get_flat_index(dim_index_ptr, shape_ptr, shape_size);

                        output[idx] += values[flat_index];
                    } 
                }
            }
            `
    }
}

We have the new memory type and it comes first in the binding order. remove_at_index, insert_at_index, get_dimensional_indices and get_flat_index are the same as the CUDA implementation. In fact, for that we were using C pointers for arrays, this time we're also using pointers they're just indices into the big mem pool and we keep track of them as u32s. The important thing is to set the global offset which prevents the threads from clobbering each other's working memory. We also need the gross and painful step of manually allocating by incrementing a memory pointer. It's not good looking code, that's for sure. It's also a complete pain to debug. As I mentioned before, making mem an output and printing it can help you figure out what's going on. One trick to help is to make the mem buffer much bigger than necessary and write debug values into parts of memory that would not be written to. Just keep in mind everything is parallel and isolate a single thread like if idx == 0 { ... }. Make sure you correctly count up how much memory you need too, this has to be done manually.

One other thing to note is that we also can't pass in scalar values like dim_to_reduce. These have to be boxed up into structs.

Extremely painful.

More ops

I'm not really going to belabor adding more ops as the basic setup can be followed for each. You should be careful about async functions, passing in the device and making sure you are using the right buffers (mul needs the gradients, values and outputs so it has 6 parameters!). You can check the source code if you need to reference implementation.

Tips

Unfortunately debugging WebGPU code is just hard. There's no way to log things, the only way you can print stuff is by writing it to a buffer and then printing the buffer contents. However, for getting though syntax errors I found that Chrome's implementation gives much better errors than Deno's implementation. I recommend setting up a simple test webapp for your code so you can get those error messages to help you. Also, if you're copy/pasting code from the CPU version make sure you are adding the device and awaiting the operations.

Next steps

From here it's mainly just an implementation chore of adding more ops. Although the performance probably sucks. We would need very big blocks for the GPU's throughput to overcome latency, especially with all the memory hauling back and forth but at some point it will. But there are a few things that might be addressable:

One is that we have to wait for each op, then do some javascript and then do another op with all the buffer dances in between. It would be much more efficient if there was a way to fuse ops into a single shader so we don't have to pay as much overhead. It's also not ideal that we need to await out math ops because it's not ergonomic. If we constructed a graph and then calculated everything at the end that would be most efficient. This is how tensorflow js works. The downside is this would almost certainly require codegen which is really gnarly to deal with. Still, calculating per op has it's benefits though when messing around say in a Jupyter notebook. This is called "immediate mode" in tensorflow. So there's a bit of value in both.

We could also combine GPUTensor and Tensor and make a CPU kernel. I like the current split because it makes the code a lot easier to understand (which is the most important for this exercise) but for practical purposes it's probably better to combine them. Frameworks like pytorch let you choose and it might make sense as the smaller ops will almost certainly perform better on the CPU.

We could also look at other implementations. WASM is of course an obvious one. Less latency and complexity than WebGPU but also not nearly as fast even with SIMD instructions. Also since there's still buffer copying it's slow. This is mainly why I didn't pursue it, I don't know if the niche makes enough sense if WebGPU is available. Tensorflow supports WebGL which is a bit of a hack but it was nice in the days before WebGPU to get some extra speed in browsers. However, it doesn't work on the server and WebGPU is much better suited for the task. Maybe one day WebNN will be a thing and can do most of the work for us.

The other thing is that I haven't looked at any tuning or optimization and haven't even tried to benchmark. All we know is that these implementations do the same thing, we have no clue if any of this complexity is actually worthwhile at all.

Code

https://github.com/ndesmic/mljs/tree/v1.2

Your AI Code Assistant

Automate your code reviews. Catch bugs before your coworkers. Fix security issues in your code. Built to handle large projects, Amazon Q Developer works alongside you from idea to production code.

Get started free in your IDE

DEV Community