ndesmic

Posted on Mar 3

Building a JS pytorch clone with CUDA

#javascript #cuda #machinelearning

I had actually wanted part 2 to be a WebGPU implementation and started poking at the CUDA version toward the end. I didn't know anything about CUDA, I haven't used C except in school and hadn't played around with FFI so I was expecting it to be hard. But the neat part was while it's still a fair bit of stuff it's actually easier than using WebGPU! So much so that I decided to change the order of the series. Writing CUDA kernels is easier for a few reasons:

You might have already been exposed to C/C++
C/C++ have lots of resources and you can get AI chatbots to help, WGSL you more or less have to read the spec because few people are writing about it.
You can console log from the GPU code!
Code can be synchronous
C/C++ is designed to manipulate memory, while javascript TypedArrays are painful

So what is CUDA? CUDA is Nvidia's proprietary platform for programming Nvidia GPUs. This means we get access to everything at full speed, not just what the WebGPU API gives us. This also means that macos isn't going to work but since Nvidia holds a monopoly on the GPU market it's still reasonable to optimize for them. So let's give those beefy gamer PCs a workout!

We're going to do the kernels in CUDA and then use FFI to link them back to Deno. FFI (Foreign Function Interface) is a way in which we can use native compiled binaries in other code (it's also the reason whenever you start a programming tutorial nothing works on your machine which is ironic given my rant last time). This edition will be mostly Windows based. You might be able to follow using Linux but you'll probably need some supplemental help.

Setting up CUDA

Unfortunately this is platform dependent and because I'm just coding on my PC I can really only offer Windows. We can start with their quick start guide: https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html#windows . This is mostly straight forward. You will need visual studio for Windows https://visualstudio.microsoft.com/vs/ first. You download the CUDA installer and it'll do its thing and install a plugin into visual studio. At this point you can write C++ programs that use CUDA (I'd love to be able to use vscode, but I didn't want to waste too much time figuring out how the Windows CLI for nvcc works).

One thing to note if you follow their instructions and download the sample repository. The project versions might need to be tweaked. To do this you open the project file eg nbody_vs2022.scxproj with a text editor (visual studio make this unnecessarily hard) and edit some lines to update the tool version:

<ImportGroup Label="ExtensionSettings">
-  <Import Project="$(CUDAPropsPath)\CUDA 12.5.props" />
+  <Import Project="$(CUDAPropsPath)\CUDA 12.6.props" />
</ImportGroup>

<ImportGroup Label="ExtensionTargets">
-  <Import Project="$(CUDAPropsPath)\CUDA 12.5.targets" />
+  <Import Project="$(CUDAPropsPath)\CUDA 12.6.targets" />
</ImportGroup>

Just check the build output if you are getting errors and figure out which version you have. This will almost certainly become out of date.

Building a simple CUDA app

If you can get the sample running then you should be able to develop a CUDA app. In Visual Studio you should be able to start a new CUDA project which will give you some boilerplate with a kernel.cu file. .cu is the CUDA C++ extension. We can make a version of the add kernel from last time:

#include "stdio.h"

__global__ void add(int n, float* a, float*b, float*c)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n)
    {
        c[i] = a[i] + b[i];
    }
}

int main()
{
    int N = 6;
    int BLOCK_SIZE = 2;

    float* a = new float[N];
    float* b = new float[N];
    float* c = new float[N];

    for (int i = 0; i < N; i++)
    {
        a[i] = i;
        b[i] = 2 * i;
    }

    float* a_d;
    float* b_d;
    float* c_d;

    cudaMalloc((void**)&a_d, N * sizeof(float));
    cudaMalloc((void**)&b_d, N * sizeof(float));
    cudaMalloc((void**)&c_d, N * sizeof(float));

    cudaMemcpy(a_d, a, N * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(b_d, b, N * sizeof(float), cudaMemcpyHostToDevice);

    add <<<N/(float)BLOCK_SIZE, BLOCK_SIZE >>>(N, a_d, b_d, c_d);

    cudaMemcpy(c, c_d, N * sizeof(float), cudaMemcpyDeviceToHost);

    for (int i = 0; i < N; i++)
    {
        printf("a: %f, b: %f, c%f\n", a[i], b[i], c[i]);
    }

    system("pause");

    cudaFree(a_d);
    cudaFree(b_d);
    cudaFree(c_d);

    return 0;
}

The stuff in __global__ is an entrypoint for code that runs on the GPU so we can call it from CPU code. This isn't allowed to return anything, it just take references to memory and you copy in and out as you need. It takes the number of elements and 3 pointers to some data. We need the size because with just pointers we don't know how long the buffers are. The pointer just point to the beginning of the array but they should be packed together one float after the next, all 6 of them.

a_d, b_d and c_d are all buffers that the GPU has access to. I found this convention looking at some materials, I have no idea if it's common but I think _d means "device" and separates it from a buffer held on the CPU side. We call cudaMemcpy(bufferPointerA, bufferPointerB, size, cudeMemcpyHostToDevice) to put data into them. We're copying data from one buffer to another and the final parameter tells it where it's going, which is to the GPU ("device" in CUDA parlance is the GPU, "host" is the CPU). So we take memory the CPU can access and copy it to the GPU where the GPU can access it.

Workgroups

When we make calls to execute code on the GPU we have to deal with workgroup sizes. For whatever reason, it was though that it would be a good idea to make these indexed in 3 dimensions instead of linearly (I get it makes sense if you're computing 2 or 3 dimensional tensors but it's annoying for arbitrary dimensions). A "thread block" (not sure if that's the official term) is a 3 dimensional block where each individual unit is a thread. The index of the current thread within the block is given by threadIdx.x, threadIdx.y, and threadIdx.z. To make it more complicated we also have a "block grid" which is a 3D grid of blocks. Which block we're in is given by blockIdx.x, blockIdx.y, blockIdx.z. The block size and grid size are defined by us when we call the function. This uses the triple angle syntax <<<gridSize,blockSize >>>. For example

add<<<2, 8>>>(...)

will execute with a grid of size 2 (one-dimension) and 8 threads (one dimension) per block for a total of 16 threads.

dim3 block_dim(3,3,3);
dim3 grid_dim(2,2,1);
add<<<grid_dim, block_dim>>>(...)

Will have a grid of 2x2x1 (4 total) and a block of 3x3x3(27). So there will be 27x4 = 108 total threads. For our linear kernels it's best to keep it simple and not invent dimensions.

In all of the kernels we'll be making we get the absolute index of the current thread like this:

int idx = blockIdx.x * blockDim.x + threadIdx.x;

And we are only using one dimensional grids/blocks. We define an arbitrary block size and then we just figure out how many of those block would be needed to get to at least the size of the output. Note that we need to round up or if we get a fractional value we'll get size 0 which will give invalid configuration argument.

int BLOCK_SIZE = 32
int GRID_SIZE = (int)ceil(size / (float)BLOCK_SIZE)

But how do we decide? Mostly magic numbers from what I found. I depends on the exact GPU. Dimensions cannot exceed certain sizes so for very long blocks of data we might have to define them in 2d or 3d terms. Also, the hardware will usually give us a power of 2 (eg minimum 32-thread per block) so asking for fewer than that won't make any difference. Grid dimensions versus block dimensions also appears to be something you performance tune and there's no right answer (I haven't done any tuning by the way).

variable	dimensions	description
threadIdx	x,y,z	position of thread in block
blockIdx	x,y,z	position of block in thread
blockDim	x,y,z	size of block (in dimension)
gridDim	x,y,z	size of grid (in dimension)

When we call add<<<girdDim,blockDim>>>This will run code on the GPU. Then we cudaMemcpy the other direction, from the GPU to the CPU and we get data back in our c buffer. The last part prints the data out, pauses for a keystroke so we can see the output and then politely frees up the buffers on the GPU and ends with exit code 0.

Syntax and pointers aside it's not too bad.

Building a dll

Next, let's figure out how to call C code from Deno. Visual Studio will give you a template for building a C++ dll but I immediately found it to be unnecessarily complex so I tried again with a blank project. There's two main components we need a header file .h and a C++ file .cpp. So I created a simple add function to test this out.

//add.h

extern "C" __declspec(dllexport) float add(const float a, const float b);

If you follow the template you'll get some extra pragma stuff that's just unnecessary. As far as I can tell this is the minimal amount of stuff you actually need. I also don't fully understand what __declspec(dllexport) does from the syntax point of view but it signals to the compiler that we're making a dll. The rest of the header is basically the function schema (we have only one at the moment) and extern "C" basically makes it available without name mangling.

//add.cpp
#include "add.h"

float add(float a, float b) {
    return a + b;
}

Adds two numbers and imports our header. If you go the template route you get pch.h and a bunch of other random junk that doesn't seem necessary. The other thing is that in order to actually compile this we need to go into the project settings (right click project name in solution explorer -> Configuration Properties -> General -> Configuration Type -> Dynamic Library (.dll)). By default it wants to compile an .exe which needs a main function and you'll get cryptic errors suggesting that. You also have to do this per build profile (release/debug).

Now we should be able to compile it (I used release mode just to be safe). Then from a Deno project I can import it:

const dylib = Deno.dlopen("./xdlltest.dll", {
    add: { parameters: ["f32", "f32"], result: "f32"}
});

const result = dylib.symbols.add(345, 34);

console.log(result);

The result was 379 which means it's working.

Putting it together

So now we can build a simple CUDA C++ app, and a simple dll that we can use with Deno. So it should just be a matter of putting the two together.

So I created new project with a .cu file:

//kernel.cpp
#include "cuda_runtime.h"
#include "kernel.h"

__global__ void addKernel(int size, const float* a, const float* b, float* output)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < size)
    {
        output[i] = a[i] + b[i];
    }
}

float* add(const int size, const float* valuesA, const float* valuesB)
{
    int BLOCK_SIZE = 32;
    int GRID_SIZE = (int)ceil(size / (float)BLOCK_SIZE);

    float* output = new float[size];

    float* valuesA_d;
    float* valuesB_d;
    float* output_d;

    cudaMalloc((void**)&valuesA_d, size * sizeof(float));
    cudaMalloc((void**)&valuesB_d, size * sizeof(float));
    cudaMalloc((void**)&output_d, size * sizeof(float));

    cudaMemcpy(valuesA_d, valuesA, size * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(valuesB_d, valuesB, size * sizeof(float), cudaMemcpyHostToDevice);

    addKernel<<<GRID_SIZE, BLOCK_SIZE >>> (size, valuesA_d, valuesB_d, output_d);

    cudaMemcpy(output, output_d, size * sizeof(float), cudaMemcpyDeviceToHost);

    cudaFree(valuesA_d);
    cudaFree(valuesB_d);
    cudaFree(output_d);

    return output;
}

This is almost the same thing as when we tried out CUDA for the first time, we just pass in the size and buffers and return the buffer. We also need the header file:

//kernel.h

extern "C" __declspec(dllexport) float* add(const int size, const float* valuesA, const float* valuesB);

We need to set it to output as a dll but this should work the same way as the dll example. Note that I did encounter some funkyness testing things out. Certain #includes) seem to mess up the output dll like stdio.h due to a name conflict and Deno wasn't able to find the exports and this gave me a lot of trouble debugging.

Barring that, on the Deno side we can import it. This took a little digging to figure out as Deno's FFI docs are not great. The nice part is, as long as dlopen works you can console log in the C++ code to debug which is a big step up from WGSL and WASM.

const dylib = Deno.dlopen("./kernel.dll", {
    add: { parameters: ["i32", "buffer", "buffer"], result: "pointer"}
});

const a = new Float32Array([1,2,3,4]);
const b = new Float32Array([5,5,7,7]);

const resultPointer = dylib.symbols.add(a.length, a.buffer, b.buffer);
const buffer = Deno.UnsafePointerView.getArrayBuffer(resultPointer, a.byteLength);
const c = new Float32Array(buffer);

console.log(c);

We need to pass in the float *s as "buffer" types which correspond to array buffers. The return value is a pointer, it's just the start address of the array and has no length so we need to provide one and call Deno.UnsafePointerView.getArrayBuffer using the pointer. It's unsafe because we're doing raw memory access, if you screw this up bad things can happen and you'll read into memory you may not have intended. Not such a big problem for us since we're not letting other people run this code but something to consider. We can then view that ArrayBuffer as a Float32Array and you can probably see where it goes from here.

FFI is kind of a super power so long as you're willing to subject yourself to writing in system languages. This will likely only run on Windows though and that is the real pain because dealing with native binaries as a consumer sucks, and is probably the source of all my python problems that lead me down this road in the first place. Unless you plan to support macos, Windows and Linux with your library it's probably not the tool you want.

Getting rid of Visual Studio

I like vscode because it's pretty simple (yeah there's faster, but it's fast enough for me most of the time). The same cannot be said of Visual Studio which is big and bloated and hides a lot of details. If I'm going to be developing this I'd rather:

1) Use one code editor
2) Use one unified project

vscode can deal with c/c++/cuda especially because our setup is simple, we just need to compile some .cu files. So we just need to figure out what magic Visual Studio doing under to hood to output the dll. I already know that it's calling nvcc which is nvida's C++ compiler, so we just need to work back which args to pass it and we can call it from a deno task.

It seems that installing CUDA also installed nvcc and added it to the path. However when I run it I get an error nvcc fatal : Cannot find compiler 'cl.exe' in PATH. This is apparently part of the c++ compiler package that comes with Visual Studio. The fix seems to be adding cl.exe to the path. This is found at C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.23.28105\bin\Hostx64\x64 for VS 2022. Once we have that on the path then everything should work. In order to compile the code as a dll we need to run the command like this: nvcc -o ./dist/add --shared add.cu This will output a .dll, .lib and .exp but only the .dll is actually necessary. -o is the output file name and --shared generates a shared library (dll).

We can wire this up to a deno task and remove a heavy tool from the process. Nice!

Remaking the tensor class

With this new found ability to call native CUDA code we can start remaking our tensor class. For this exercise I'm making a new class and implementing each method again. I'm also exporting it as Tensor so it's easy to swap out. In other libraries they make a "kernel" abstraction. That is, there is a single tensor class that calls out to different implementations. This isn't a bad idea and probably useful for real production use-cases, but I want the code at this point to really spell out what's going on because I found digging through other libraries to be absolutely impenetrable.

Here's the CUDATensor with a single add op implemented.

//CUDATensor.js
import { getTotalLength } from "./tensor-utils.js";
import { topologicalSort } from "./topological-sort.js";

const kernel = Deno.dlopen("./dist/kernel.dll", {
    add: { parameters: ["i32", "buffer", "buffer"], result: "pointer" },
    addBackprop: { parameters: ["i32", "buffer", "buffer", "buffer"], result: "void" }
});

const backward = Symbol("backward");

export class Tensor {
    #shape;
    #values;
    #children;
    #op;
    #label;
    [backward] = () => { };
    #gradient = 0;

    constructor({ values, shape, children, op, label }) {
        const totalLength = getTotalLength(shape);

        this.#shape = shape;
        if (values) {
            if (values.length != totalLength) throw new Error(`Supplied values are the wrong length, need ${totalLength}`);
            if (values instanceof Float32Array) {
                this.#values = values;
            } else {
                this.#values = new Float32Array(values)
            }
        } else {
            this.#values = new Float32Array(totalLength);
        }

        this.#children = children ?? [];
        this.#gradient = new Float32Array(totalLength);
        this.#op = op ?? null;
        this.#label = label ?? null;
    }
    get totalLength() {
        return this.#values.length;
    }
    get shape() {
        return this.#shape;
    }
    get values() {
        return this.#values;
    }
    get gradient() {
        return this.#gradient;
    }
    get children() {
        return this.#children;
    }
    add(other) {
        if (other.totalLength != this.totalLength) throw new Error(`Tensor not the right length, argument was ${other.totalLength}, needs to be ${this.totalLength}`);

        const resultPointer = kernel.symbols.add(this.totalLength, this.#values, other.values);
        const outBuffer = Deno.UnsafePointerView.getArrayBuffer(resultPointer, this.totalLength * 4);
        const values = new Float32Array(outBuffer);

        const result = new Tensor({
            values,
            shape: this.#shape,
            children: [this, other],
            op: "+"
        });

        result[backward] = () => {
            kernel.symbols.addBackprop(this.totalLength, this.#gradient, other.gradient, result.gradient);
        };

        return result;
    }
    backward() {
        this.#gradient = new Float32Array(this.totalLength).fill(1);
        const sortedDependencies = topologicalSort(this, x => x.children).reverse();
        for (const node of sortedDependencies) {
            node[backward]();
        }
    }
}

While we already covered the add op, we need its backprop so lets add something for that in the kernel.

//kernel.cu
__global__ void addBackpropKernel(const int size, float *gradA, float *gradB, float *gradResult)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < size)
    {
        gradA[idx] = gradA[idx] + gradResult[idx];
        gradB[idx] = gradB[idx] + gradResult[idx];
    }
}

void addBackprop(const int size, float *gradA, float *gradB, const float *gradResult)
{
    int BLOCK_SIZE = 32;
    int GRID_SIZE = (int)ceil(size / (float)BLOCK_SIZE);

    float *gradA_d;
    float *gradB_d;
    float *gradResult_d;

    cudaMalloc((void **)&gradA_d, size * sizeof(float));
    cudaMalloc((void **)&gradB_d, size * sizeof(float));
    cudaMalloc((void **)&gradResult_d, size * sizeof(float));

    cudaMemcpy(gradA_d, gradA, size * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(gradB_d, gradB, size * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(gradResult_d, gradResult, size * sizeof(float), cudaMemcpyHostToDevice);

    addBackpropKernel<<<GRID_SIZE, BLOCK_SIZE>>>(size, gradA_d, gradB_d, gradResult_d);

    cudaMemcpy(gradA, gradA_d, size * sizeof(float), cudaMemcpyDeviceToHost);
    cudaMemcpy(gradB, gradB_d, size * sizeof(float), cudaMemcpyDeviceToHost);

    cudaFree(gradA_d);
    cudaFree(gradB_d);
    cudaFree(gradResult_d);
}

And remember to add it to the .h:

//kernel.h
extern "C" __declspec(dllexport) void addBackprop(const int size, float *gradA, float *gradB, const float *gradResult);

I think this should make sense since it's also adding but instead of one output buffer we have two. This also means that we are not going to return a value but rather modify the 2 array references passed in (dunno if it would make more sense to modify the single return of add to work the same way...). However, while this would work in most cases it actually has a bug. If both tensors are the same tensor, this will actually get the wrong result because they have to be added together. Since we need to copy memory onto the GPU it doesn't seem possible to share a reference for gradA_d and gradB_d. This sucks and means we need to special case it.

void addBackprop(const int size, float *gradA, float *gradB, const float *gradResult)
{
    int BLOCK_SIZE = 32;
    int GRID_SIZE = (int)ceil(size / (float)BLOCK_SIZE);

    float *gradA_d;
    float *gradB_d;
    float *gradResult_d;

    cudaMalloc((void **)&gradA_d, size * sizeof(float));
    cudaMalloc((void **)&gradB_d, size * sizeof(float));
    cudaMalloc((void **)&gradResult_d, size * sizeof(float));

    cudaMemcpy(gradA_d, gradA, size * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(gradB_d, gradB, size * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(gradResult_d, gradResult, size * sizeof(float), cudaMemcpyHostToDevice);

    addBackpropKernel<<<GRID_SIZE, BLOCK_SIZE>>>(size, gradA_d, gradB_d, gradResult_d);

+    if(gradA == gradB){ //if same reference then we need to add them
+        float *gradUnified_d;
+        cudaMalloc((void **)&gradUnified_d, size * sizeof(float));
+        addKernel<<<GRID_SIZE, BLOCK_SIZE>>>(size, gradA_d, gradB_d, gradUnified_d);

+        cudaMemcpy(gradA, gradUnified_d, size * sizeof(float), cudaMemcpyDeviceToHost);
+        cudaMemcpy(gradB, gradUnified_d, size * sizeof(float), cudaMemcpyDeviceToHost);
+        cudaFree(gradUnified_d);
+    } else {
        cudaMemcpy(gradA, gradA_d, size * sizeof(float), cudaMemcpyDeviceToHost);
        cudaMemcpy(gradB, gradB_d, size * sizeof(float), cudaMemcpyDeviceToHost);
+    }

    cudaFree(gradA_d);
    cudaFree(gradB_d);
    cudaFree(gradResult_d);
}

Here we check if gradA and gradB are the same (the pointer value is equal). If so, we special case it. We create a new output buffer and use our existing addKernel to add the tensors together and copy the result back to gradA and gradB. There's probably more elegant ways to deal with this but it's beyond my knowledge at this point. On the plus side since gradA_d and gradB_d are still sitting in GPU memory for reuse this shouldn't be too bad and is strictly better than calling add again from Deno. One thing I never did figure out was why I needed gradUnified_d instead of just re-using grad_A_d. Doing that will produce an empty buffer and from what I could tell the assignment silently fails. Something to do with how the GPU allocates it? const issues? I don't know but we could save an allocation if I could figure it out.

Anyway, the rest of this is just an exercise in reimplementation so I'm not going to spell it out. We have all the tools to reimplement sub, div , mul and other binary ops.

Note if you export a function "div" you will get errors!

error: more than one instance of overloaded function "div" has "C" linkage

This is because it conflicts with the function div from stdio. Since we probably do want to keep stdio we'll need to rename it (other strategies got complicated). So from here I had to end all my exported functions in _op, eg div_op.

Reduction Ops

Last time reduction ops gave me trouble and they certainly were complex and required a lot of debugging so this time so I think it's worth looking at again.

//kernel.cu
__device__ int* insertAtIndex(const int* array, const int length, const int index, const int value){
    int* outArray = (int *)malloc((length + 1) * sizeof(int));
    int sourceIndex = 0;
    int destinationIndex = 0;
    while(destinationIndex < length + 1){
        if(destinationIndex != index){
            outArray[destinationIndex] = array[sourceIndex];
            sourceIndex++;
            destinationIndex++;
        } else {
            outArray[destinationIndex] = value;
            destinationIndex++;
        }
    }
    return outArray;
}
__device__ int *removeAtIndex(const int *array, const int length, const int index)
{
    int *outArray = (int *)malloc((length - 1) * sizeof(int));
    int sourceIndex = 0;
    int destinationIndex = 0;
    while (destinationIndex < length - 1)
    {
        if (sourceIndex != index)
        {
            outArray[destinationIndex] = array[sourceIndex];
            sourceIndex++;
            destinationIndex++;
        }
        else
        {
            sourceIndex++;
        }
    }
    return outArray;
}
//shape is column major
__device__ int* getDimensionalIndex(int flatIndex, const int shape[], const int shapeSize)
{
    int* indices = (int*)malloc(sizeof(shape));
    for(int i = 0; i < shapeSize; i++){
        indices[i] = flatIndex % shape[i];
        flatIndex = flatIndex / shape[i];
    }
    return indices;
}

//shape and dimensionalIndex are column major
__device__ int getFlatIndex(const int dimensionalIndex[], const int shape[], const int shapeSize)
{
    int index = 0;
    for (int i = 0; i < shapeSize; i++)
    {
        index *= shape[shapeSize - 1 - i];
        index += dimensionalIndex[shapeSize - 1 - i];
    }

    return index;
}
__global__ void sumKernel(const int* shape, const int shapeSize, const int dimToReduce, const float* values, float* output)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    int size = 1;
    for (int j = 0; j < shapeSize; j++)
    {
        if (j != dimToReduce)
        {
            size *= shape[j];
        }
    }
    if (idx < size)
    {
        int *newShape = removeAtIndex(shape, shapeSize, dimToReduce);
        int *partialDimIndex = getDimensionalIndex(idx, shape, shapeSize - 1);

        for(int i = 0; i < shape[dimToReduce]; i++){
            int* dimIndex = insertAtIndex(partialDimIndex, shapeSize - 1, dimToReduce, i);
            int flatIdx = getFlatIndex(dimIndex, shape, shapeSize);
            output[idx] += values[getFlatIndex(dimIndex, shape, shapeSize)];
            free(dimIndex);
        }

        free(partialDimIndex);
    }
}

float* sum_op(const int* shape, const int shapeSize, const int dimToReduce, const float* values)
{
    int BLOCK_SIZE = 32;
    int GRID_SIZE = (int)ceil(size / (float)BLOCK_SIZE);

    int size = 1;
    for (int i = 0; i < shapeSize; i++)
    {
        size *= shape[i];
    }
    int outSize = size / shape[dimToReduce];

    float* output = new float[outSize];

    float* values_d;
    float* output_d;
    int* shape_d;

    cudaMalloc((void**)&values_d, size * sizeof(float));
    cudaMalloc((void**)&shape_d, shapeSize * sizeof(int));
    cudaMalloc((void**)&output_d, outSize * sizeof(float));

    cudaMemcpy(values_d, values, size * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(shape_d, shape, shapeSize * sizeof(int), cudaMemcpyHostToDevice);

    sumKernel<<GRID_SIZE, BLOCK_SIZE >>> (shape_d, shapeSize, dimToReduce, values_d, output_d);

    cudaMemcpy(output, output_d, outSize * sizeof(float), cudaMemcpyDeviceToHost);

    cudaFree(values_d);
    cudaFree(output_d);

    return output;
}

Alright, here's what I came up with. The sum function is easy enough to understand I think. We have 3 buffers, the values, the output and the shape. Even though shape is relatively simple being an array of dimensions, it still needs a length shapeSize and has to be passed over as a typed array. We also need the dimension that we're going to reduce dimToReduce which will be an index into shape. The only other somewhat interesting this is that we recover the total size from the shape so it's not necessary to pass it in. We actually need the starting and ending shape, with the ending shape just dividing out the dimension we're reducing.

For the GPU kernel itself, we figure out our index and recalculate the shape. This could potentially be optimized by passing the total size since we already has it. As is, this is recalculated for every element of the tensor. I might go back and do that, so if it looks different in the code that's why.

The algorithm is the same as in javascript. I feel deep down that there's a non-allocating way to do this but I never figured it out. I apologize to any FAANG interviewers. So we need methods to insert and remove from the shape array, which are immutable and allocate new arrays. Note the __device__ syntax to make a private function on the GPU. This mirrors what the js code was working with toSpliced. We first get the new shape, which is the old shape with that dimension removed. We calculate all the other dimensional indices based on our work block index, and leave a placeholder for the one we're reducing. Then, in a loop over that dimension, we insert the missing dimensional index, convert the array of indices into a flat index over the buffer and add the results up into the corresponding output slot.

Tips if you are having issues

Use printf

printf functions like console.log for C (you need to add your own new line). You can get it by importing stdio.h. The nice thing is that this works on the GPU code! If you've ever worked on WebGL or WebGPU you've probably felt the pain of not having a way to log things. However, if you use it on the GPU code you should also call cudaDeviceSynchronize().

//calling the GPU code
cudaMemcpy(gradA_d, gradA, size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(gradB_d, gradB, size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(gradResult_d, gradResult, size * sizeof(float), cudaMemcpyHostToDevice);
addKernel<<<GRID_SIZE, BLOCK_SIZE>>>(size, gradA_d, gradB_d, gradResult_d);
cudaDeviceSynchronize();

This waits for the print buffers to flush so that the prints don't get missed. Since those prints run once-per-thread on the GPU they can easily get unreadable and they will actually interleave themselves in the console. It's recommended that you add the thread id to the message so that you can tell where it came from:

int idx = int i = blockIdx.x * blockDim.x + threadIdx.x;
printf("thread: %i", idx)

Also it you just want to check a global thing, like say the values of a buffer being passed in, you can use an if block to print only once:

if(idx === 0){
  for(int i = 0; i < valueSize; i++){
     printf("%i", values[i])
  }
}

Catch CUDA errors

Sometimes something in the GPU code will not work at runtime but it will silently fail. In this case it will complete but you won't get the expected result and it might impact later unrelated calls too! You can add some debug logging to the kernels. Right after the call to the kernel function:

addKernel<<<GRID_SIZE, BLOCK_SIZE>>>(size, gradA_d, gradB_d, gradResult_d);
cudaDeviceSynchronize();
cudaError_t err = cudaGetLastError();
if(err != cudaSuccess){
    printf("CUDA error: %s\n", cudaGetErrorString(err));
}

This isn't strictly needed if everything works but might be good to guard you from future issues.

Double check your sizes

Once of the more common problems I ran into was using the wrong size for a buffer which would truncate it with zeros or give you otherwise garbage data. One place where it's easy to get tripped up is the length of an array versus the byteLength of an array. For example I mistook sizeof(array) to get the number of elements, when it actually gets the byte size (sorry, haven't used C since college). Or in another case I forgot ciel on my grid size and rounded down to 0 grid size which gives errors. Just check for things like that if values seem off.

Double check you aren't using CPU pointers on the GPU and vice-versa

At one point I passed in valuesA instead of valuesA_d, that is I used the CPU pointer instead of the GPU pointer. This will cause an illegal memory access.

Double check your .h matches your .cu

Another amateur mistake, but it's hard to tell when this happens. It will compile fine but Deno will usually tell you that it cannot find a symbol. This also means that types should match exactly. If you use const parameters for immutability, they need to be const in the header file.

Double check you are passing the right param types in Deno

If you copy paste:

negBackprop_op: { parameters: ["i32", "buffer", "buffer"], result: "pointer" },

But change it to:

expBackprop_op: { parameters: ["i32", "buffer", "buffer"], result: "pointer" },

You will get weird crashes. This is because expBackprop_op should have looked like:

expBackprop_op: { parameters: ["i32", "buffer", "buffer", "buffer"], result: "pointer" }

Missing that last buffer with not cause an error but it will cause an exception in the library when you try to access values. You won't get an error, it will just stop, so be careful you are defining it correctly on the Deno side too.

If the kernel runtime errors it can impact future calls

This was a weird thing to debug but adding a new test broke old existing tests. I was getting silent errors from the runtime and this would break future unrelated calls. You may need to back out new changes if you see odd behavior like this and make sure you check the CUDA errors.

Check for name collisions

As noted above, it took me a while to figure out that stdio exported a function "div". If you are getting errors about overloaded functions but you aren't overloading them, this could be a cause.

Code

The code for this post can be found here: https://github.com/ndesmic/mljs/compare/v1.0...v1.1

DEV Community