I had actually wanted part 2 to be a WebGPU implementation and started poking at the CUDA version toward the end. I didn't know anything about CUDA, I haven't used C except in school and hadn't played around with FFI so I was expecting it to be hard. But the neat part was while it's still a fair bit of stuff it's actually easier than using WebGPU! So much so that I decided to change the order of the series. Writing CUDA kernels is easier for a few reasons:
- You might have already been exposed to C/C++
- C/C++ have lots of resources and you can get AI chatbots to help, WGSL you more or less have to read the spec because few people are writing about it.
- You can console log from the GPU code!
- Code can be synchronous
- C/C++ is designed to manipulate memory, while javascript TypedArrays are painful
So what is CUDA? CUDA is Nvidia's proprietary platform for programming Nvidia GPUs. This means we get access to everything at full speed, not just what the WebGPU API gives us. This also means that macos isn't going to work but since Nvidia holds a monopoly on the GPU market it's still reasonable to optimize for them. So let's give those beefy gamer PCs a workout!
We're going to do the kernels in CUDA and then use FFI to link them back to Deno. FFI (Foreign Function Interface) is a way in which we can use native compiled binaries in other code (it's also the reason whenever you start a programming tutorial nothing works on your machine which is ironic given my rant last time). This edition will be mostly Windows based. You might be able to follow using Linux but you'll probably need some supplemental help.
Setting up CUDA
Unfortunately this is platform dependent and because I'm just coding on my PC I can really only offer Windows. We can start with their quick start guide: https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html#windows . This is mostly straight forward. You will need visual studio for Windows https://visualstudio.microsoft.com/vs/ first. You download the CUDA installer and it'll do its thing and install a plugin into visual studio. At this point you can write C++ programs that use CUDA (I'd love to be able to use vscode, but I didn't want to waste too much time figuring out how the Windows CLI for nvcc works).
One thing to note if you follow their instructions and download the sample repository. The project versions might need to be tweaked. To do this you open the project file eg nbody_vs2022.scxproj
with a text editor (visual studio make this unnecessarily hard) and edit some lines to update the tool version:
<ImportGroup Label="ExtensionSettings">
- <Import Project="$(CUDAPropsPath)\CUDA 12.5.props" />
+ <Import Project="$(CUDAPropsPath)\CUDA 12.6.props" />
</ImportGroup>
<ImportGroup Label="ExtensionTargets">
- <Import Project="$(CUDAPropsPath)\CUDA 12.5.targets" />
+ <Import Project="$(CUDAPropsPath)\CUDA 12.6.targets" />
</ImportGroup>
Just check the build output if you are getting errors and figure out which version you have. This will almost certainly become out of date.
Building a simple CUDA app
If you can get the sample running then you should be able to develop a CUDA app. In Visual Studio you should be able to start a new CUDA project which will give you some boilerplate with a kernel.cu
file. .cu
is the CUDA C++ extension. We can make a version of the add
kernel from last time:
#include "stdio.h"
__global__ void add(int n, float* a, float*b, float*c)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
{
c[i] = a[i] + b[i];
}
}
int main()
{
int N = 6;
int BLOCK_SIZE = 2;
float* a = new float[N];
float* b = new float[N];
float* c = new float[N];
for (int i = 0; i < N; i++)
{
a[i] = i;
b[i] = 2 * i;
}
float* a_d;
float* b_d;
float* c_d;
cudaMalloc((void**)&a_d, N * sizeof(float));
cudaMalloc((void**)&b_d, N * sizeof(float));
cudaMalloc((void**)&c_d, N * sizeof(float));
cudaMemcpy(a_d, a, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(b_d, b, N * sizeof(float), cudaMemcpyHostToDevice);
add <<<N/(float)BLOCK_SIZE, BLOCK_SIZE >>>(N, a_d, b_d, c_d);
cudaMemcpy(c, c_d, N * sizeof(float), cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++)
{
printf("a: %f, b: %f, c%f\n", a[i], b[i], c[i]);
}
system("pause");
cudaFree(a_d);
cudaFree(b_d);
cudaFree(c_d);
return 0;
}
The stuff in __global__
is an entrypoint for code that runs on the GPU so we can call it from CPU code. This isn't allowed to return anything, it just take references to memory and you copy in and out as you need. It takes the number of elements and 3 pointers to some data. We need the size because with just pointers we don't know how long the buffers are. The pointer just point to the beginning of the array but they should be packed together one float after the next, all 6 of them.
a_d
, b_d
and c_d
are all buffers that the GPU has access to. I found this convention looking at some materials, I have no idea if it's common but I think _d
means "device" and separates it from a buffer held on the CPU side. We call cudaMemcpy(bufferPointerA, bufferPointerB, size, cudeMemcpyHostToDevice)
to put data into them. We're copying data from one buffer to another and the final parameter tells it where it's going, which is to the GPU ("device" in CUDA parlance is the GPU, "host" is the CPU). So we take memory the CPU can access and copy it to the GPU where the GPU can access it.
Workgroups
When we make calls to execute code on the GPU we have to deal with workgroup sizes. For whatever reason, it was though that it would be a good idea to make these indexed in 3 dimensions instead of linearly (I get it makes sense if you're computing 2 or 3 dimensional tensors but it's annoying for arbitrary dimensions). A "thread block" (not sure if that's the official term) is a 3 dimensional block where each individual unit is a thread. The index of the current thread within the block is given by threadIdx.x
, threadIdx.y
, and threadIdx.z
. To make it more complicated we also have a "block grid" which is a 3D grid of blocks. Which block we're in is given by blockIdx.x
, blockIdx.y
, blockIdx.z
. The block size and grid size are defined by us when we call the function. This uses the triple angle syntax <<<gridSize,blockSize >>>
. For example
add<<<2, 8>>>(...)
will execute with a grid of size 2 (one-dimension) and 8 threads (one dimension) per block for a total of 16 threads.
dim3 block_dim(3,3,3);
dim3 grid_dim(2,2,1);
add<<<grid_dim, block_dim>>>(...)
Will have a grid of 2x2x1 (4 total) and a block of 3x3x3(27). So there will be 27x4 = 108 total threads. For our linear kernels it's best to keep it simple and not invent dimensions.
In all of the kernels we'll be making we get the absolute index of the current thread like this:
int idx = blockIdx.x * blockDim.x + threadIdx.x;
And we are only using one dimensional grids/blocks. We define an arbitrary block size and then we just figure out how many of those block would be needed to get to at least the size of the output. Note that we need to round up or if we get a fractional value we'll get size 0 which will give invalid configuration argument
.
int BLOCK_SIZE = 32
int GRID_SIZE = (int)ceil(size / (float)BLOCK_SIZE)
But how do we decide? Mostly magic numbers from what I found. I depends on the exact GPU. Dimensions cannot exceed certain sizes so for very long blocks of data we might have to define them in 2d or 3d terms. Also, the hardware will usually give us a power of 2 (eg minimum 32-thread per block) so asking for fewer than that won't make any difference. Grid dimensions versus block dimensions also appears to be something you performance tune and there's no right answer (I haven't done any tuning by the way).
variable | dimensions | description |
---|---|---|
threadIdx | x,y,z | position of thread in block |
blockIdx | x,y,z | position of block in thread |
blockDim | x,y,z | size of block (in dimension) |
gridDim | x,y,z | size of grid (in dimension) |
When we call add<<<girdDim,blockDim>>>
This will run code on the GPU. Then we cudaMemcpy
the other direction, from the GPU to the CPU and we get data back in our c
buffer. The last part prints the data out, pauses for a keystroke so we can see the output and then politely frees up the buffers on the GPU and ends with exit code 0.
Syntax and pointers aside it's not too bad.
Building a dll
Next, let's figure out how to call C code from Deno. Visual Studio will give you a template for building a C++ dll but I immediately found it to be unnecessarily complex so I tried again with a blank project. There's two main components we need a header file .h
and a C++ file .cpp
. So I created a simple add function to test this out.
//add.h
extern "C" __declspec(dllexport) float add(const float a, const float b);
If you follow the template you'll get some extra pragma stuff that's just unnecessary. As far as I can tell this is the minimal amount of stuff you actually need. I also don't fully understand what __declspec(dllexport)
does from the syntax point of view but it signals to the compiler that we're making a dll. The rest of the header is basically the function schema (we have only one at the moment) and extern "C"
basically makes it available without name mangling.
//add.cpp
#include "add.h"
float add(float a, float b) {
return a + b;
}
Adds two numbers and imports our header. If you go the template route you get pch.h and a bunch of other random junk that doesn't seem necessary. The other thing is that in order to actually compile this we need to go into the project settings (right click project name in solution explorer -> Configuration Properties -> General -> Configuration Type -> Dynamic Library (.dll)). By default it wants to compile an .exe
which needs a main function and you'll get cryptic errors suggesting that. You also have to do this per build profile (release/debug).
Now we should be able to compile it (I used release mode just to be safe). Then from a Deno project I can import it:
const dylib = Deno.dlopen("./xdlltest.dll", {
add: { parameters: ["f32", "f32"], result: "f32"}
});
const result = dylib.symbols.add(345, 34);
console.log(result);
The result was 379
which means it's working.
Putting it together
So now we can build a simple CUDA C++ app, and a simple dll that we can use with Deno. So it should just be a matter of putting the two together.
So I created new project with a .cu
file:
//kernel.cpp
#include "cuda_runtime.h"
#include "kernel.h"
__global__ void addKernel(int size, const float* a, const float* b, float* output)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < size)
{
output[i] = a[i] + b[i];
}
}
float* add(const int size, const float* valuesA, const float* valuesB)
{
int BLOCK_SIZE = 32;
int GRID_SIZE = (int)ceil(size / (float)BLOCK_SIZE);
float* output = new float[size];
float* valuesA_d;
float* valuesB_d;
float* output_d;
cudaMalloc((void**)&valuesA_d, size * sizeof(float));
cudaMalloc((void**)&valuesB_d, size * sizeof(float));
cudaMalloc((void**)&output_d, size * sizeof(float));
cudaMemcpy(valuesA_d, valuesA, size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(valuesB_d, valuesB, size * sizeof(float), cudaMemcpyHostToDevice);
addKernel<<<GRID_SIZE, BLOCK_SIZE >>> (size, valuesA_d, valuesB_d, output_d);
cudaMemcpy(output, output_d, size * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(valuesA_d);
cudaFree(valuesB_d);
cudaFree(output_d);
return output;
}
This is almost the same thing as when we tried out CUDA for the first time, we just pass in the size and buffers and return the buffer. We also need the header file:
//kernel.h
extern "C" __declspec(dllexport) float* add(const int size, const float* valuesA, const float* valuesB);
We need to set it to output as a dll but this should work the same way as the dll example. Note that I did encounter some funkyness testing things out. Certain #include
s) seem to mess up the output dll like stdio.h
due to a name conflict and Deno wasn't able to find the exports and this gave me a lot of trouble debugging.
Barring that, on the Deno side we can import it. This took a little digging to figure out as Deno's FFI docs are not great. The nice part is, as long as dlopen
works you can console log in the C++ code to debug which is a big step up from WGSL and WASM.
const dylib = Deno.dlopen("./kernel.dll", {
add: { parameters: ["i32", "buffer", "buffer"], result: "pointer"}
});
const a = new Float32Array([1,2,3,4]);
const b = new Float32Array([5,5,7,7]);
const resultPointer = dylib.symbols.add(a.length, a.buffer, b.buffer);
const buffer = Deno.UnsafePointerView.getArrayBuffer(resultPointer, a.byteLength);
const c = new Float32Array(buffer);
console.log(c);
We need to pass in the float *
s as "buffer" types which correspond to array buffers. The return value is a pointer, it's just the start address of the array and has no length so we need to provide one and call Deno.UnsafePointerView.getArrayBuffer
using the pointer. It's unsafe
because we're doing raw memory access, if you screw this up bad things can happen and you'll read into memory you may not have intended. Not such a big problem for us since we're not letting other people run this code but something to consider. We can then view that ArrayBuffer
as a Float32Array
and you can probably see where it goes from here.
FFI is kind of a super power so long as you're willing to subject yourself to writing in system languages. This will likely only run on Windows though and that is the real pain because dealing with native binaries as a consumer sucks, and is probably the source of all my python problems that lead me down this road in the first place. Unless you plan to support macos, Windows and Linux with your library it's probably not the tool you want.
Getting rid of Visual Studio
I like vscode because it's pretty simple (yeah there's faster, but it's fast enough for me most of the time). The same cannot be said of Visual Studio which is big and bloated and hides a lot of details. If I'm going to be developing this I'd rather:
1) Use one code editor
2) Use one unified project
vscode can deal with c/c++/cuda especially because our setup is simple, we just need to compile some .cu
files. So we just need to figure out what magic Visual Studio doing under to hood to output the dll. I already know that it's calling nvcc
which is nvida's C++ compiler, so we just need to work back which args to pass it and we can call it from a deno task.
It seems that installing CUDA also installed nvcc
and added it to the path. However when I run it I get an error nvcc fatal : Cannot find compiler 'cl.exe' in PATH
. This is apparently part of the c++ compiler package that comes with Visual Studio. The fix seems to be adding cl.exe
to the path. This is found at C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.23.28105\bin\Hostx64\x64
for VS 2022. Once we have that on the path then everything should work. In order to compile the code as a dll we need to run the command like this: nvcc -o ./dist/add --shared add.cu
This will output a .dll, .lib and .exp but only the .dll is actually necessary. -o
is the output file name and --shared
generates a shared library (dll).
We can wire this up to a deno task and remove a heavy tool from the process. Nice!
Remaking the tensor class
With this new found ability to call native CUDA code we can start remaking our tensor class. For this exercise I'm making a new class and implementing each method again. I'm also exporting it as Tensor
so it's easy to swap out. In other libraries they make a "kernel" abstraction. That is, there is a single tensor class that calls out to different implementations. This isn't a bad idea and probably useful for real production use-cases, but I want the code at this point to really spell out what's going on because I found digging through other libraries to be absolutely impenetrable.
Here's the CUDATensor
with a single add op implemented.
//CUDATensor.js
import { getTotalLength } from "./tensor-utils.js";
import { topologicalSort } from "./topological-sort.js";
const kernel = Deno.dlopen("./dist/kernel.dll", {
add: { parameters: ["i32", "buffer", "buffer"], result: "pointer" },
addBackprop: { parameters: ["i32", "buffer", "buffer", "buffer"], result: "void" }
});
const backward = Symbol("backward");
export class Tensor {
#shape;
#values;
#children;
#op;
#label;
[backward] = () => { };
#gradient = 0;
constructor({ values, shape, children, op, label }) {
const totalLength = getTotalLength(shape);
this.#shape = shape;
if (values) {
if (values.length != totalLength) throw new Error(`Supplied values are the wrong length, need ${totalLength}`);
if (values instanceof Float32Array) {
this.#values = values;
} else {
this.#values = new Float32Array(values)
}
} else {
this.#values = new Float32Array(totalLength);
}
this.#children = children ?? [];
this.#gradient = new Float32Array(totalLength);
this.#op = op ?? null;
this.#label = label ?? null;
}
get totalLength() {
return this.#values.length;
}
get shape() {
return this.#shape;
}
get values() {
return this.#values;
}
get gradient() {
return this.#gradient;
}
get children() {
return this.#children;
}
add(other) {
if (other.totalLength != this.totalLength) throw new Error(`Tensor not the right length, argument was ${other.totalLength}, needs to be ${this.totalLength}`);
const resultPointer = kernel.symbols.add(this.totalLength, this.#values, other.values);
const outBuffer = Deno.UnsafePointerView.getArrayBuffer(resultPointer, this.totalLength * 4);
const values = new Float32Array(outBuffer);
const result = new Tensor({
values,
shape: this.#shape,
children: [this, other],
op: "+"
});
result[backward] = () => {
kernel.symbols.addBackprop(this.totalLength, this.#gradient, other.gradient, result.gradient);
};
return result;
}
backward() {
this.#gradient = new Float32Array(this.totalLength).fill(1);
const sortedDependencies = topologicalSort(this, x => x.children).reverse();
for (const node of sortedDependencies) {
node[backward]();
}
}
}
While we already covered the add
op, we need its backprop so lets add something for that in the kernel.
//kernel.cu
__global__ void addBackpropKernel(const int size, float *gradA, float *gradB, float *gradResult)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size)
{
gradA[idx] = gradA[idx] + gradResult[idx];
gradB[idx] = gradB[idx] + gradResult[idx];
}
}
void addBackprop(const int size, float *gradA, float *gradB, const float *gradResult)
{
int BLOCK_SIZE = 32;
int GRID_SIZE = (int)ceil(size / (float)BLOCK_SIZE);
float *gradA_d;
float *gradB_d;
float *gradResult_d;
cudaMalloc((void **)&gradA_d, size * sizeof(float));
cudaMalloc((void **)&gradB_d, size * sizeof(float));
cudaMalloc((void **)&gradResult_d, size * sizeof(float));
cudaMemcpy(gradA_d, gradA, size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(gradB_d, gradB, size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(gradResult_d, gradResult, size * sizeof(float), cudaMemcpyHostToDevice);
addBackpropKernel<<<GRID_SIZE, BLOCK_SIZE>>>(size, gradA_d, gradB_d, gradResult_d);
cudaMemcpy(gradA, gradA_d, size * sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(gradB, gradB_d, size * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(gradA_d);
cudaFree(gradB_d);
cudaFree(gradResult_d);
}
And remember to add it to the .h
:
//kernel.h
extern "C" __declspec(dllexport) void addBackprop(const int size, float *gradA, float *gradB, const float *gradResult);
I think this should make sense since it's also adding but instead of one output buffer we have two. This also means that we are not going to return a value but rather modify the 2 array references passed in (dunno if it would make more sense to modify the single return of add to work the same way...). However, while this would work in most cases it actually has a bug. If both tensors are the same tensor, this will actually get the wrong result because they have to be added together. Since we need to copy memory onto the GPU it doesn't seem possible to share a reference for gradA_d
and gradB_d
. This sucks and means we need to special case it.
void addBackprop(const int size, float *gradA, float *gradB, const float *gradResult)
{
int BLOCK_SIZE = 32;
int GRID_SIZE = (int)ceil(size / (float)BLOCK_SIZE);
float *gradA_d;
float *gradB_d;
float *gradResult_d;
cudaMalloc((void **)&gradA_d, size * sizeof(float));
cudaMalloc((void **)&gradB_d, size * sizeof(float));
cudaMalloc((void **)&gradResult_d, size * sizeof(float));
cudaMemcpy(gradA_d, gradA, size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(gradB_d, gradB, size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(gradResult_d, gradResult, size * sizeof(float), cudaMemcpyHostToDevice);
addBackpropKernel<<<GRID_SIZE, BLOCK_SIZE>>>(size, gradA_d, gradB_d, gradResult_d);
+ if(gradA == gradB){ //if same reference then we need to add them
+ float *gradUnified_d;
+ cudaMalloc((void **)&gradUnified_d, size * sizeof(float));
+ addKernel<<<GRID_SIZE, BLOCK_SIZE>>>(size, gradA_d, gradB_d, gradUnified_d);
+ cudaMemcpy(gradA, gradUnified_d, size * sizeof(float), cudaMemcpyDeviceToHost);
+ cudaMemcpy(gradB, gradUnified_d, size * sizeof(float), cudaMemcpyDeviceToHost);
+ cudaFree(gradUnified_d);
+ } else {
cudaMemcpy(gradA, gradA_d, size * sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(gradB, gradB_d, size * sizeof(float), cudaMemcpyDeviceToHost);
+ }
cudaFree(gradA_d);
cudaFree(gradB_d);
cudaFree(gradResult_d);
}
Here we check if gradA
and gradB
are the same (the pointer value is equal). If so, we special case it. We create a new output buffer and use our existing addKernel
to add the tensors together and copy the result back to gradA
and gradB
. There's probably more elegant ways to deal with this but it's beyond my knowledge at this point. On the plus side since gradA_d
and gradB_d
are still sitting in GPU memory for reuse this shouldn't be too bad and is strictly better than calling add
again from Deno. One thing I never did figure out was why I needed gradUnified_d
instead of just re-using grad_A_d
. Doing that will produce an empty buffer and from what I could tell the assignment silently fails. Something to do with how the GPU allocates it? const
issues? I don't know but we could save an allocation if I could figure it out.
Anyway, the rest of this is just an exercise in reimplementation so I'm not going to spell it out. We have all the tools to reimplement sub
, div
, mul
and other binary ops.
Note if you export a function "div" you will get errors!
error: more than one instance of overloaded function "div" has "C" linkage
This is because it conflicts with the function div
from stdio
. Since we probably do want to keep stdio
we'll need to rename it (other strategies got complicated). So from here I had to end all my exported functions in _op
, eg div_op
.
Reduction Ops
Last time reduction ops gave me trouble and they certainly were complex and required a lot of debugging so this time so I think it's worth looking at again.
//kernel.cu
__device__ int* insertAtIndex(const int* array, const int length, const int index, const int value){
int* outArray = (int *)malloc((length + 1) * sizeof(int));
int sourceIndex = 0;
int destinationIndex = 0;
while(destinationIndex < length + 1){
if(destinationIndex != index){
outArray[destinationIndex] = array[sourceIndex];
sourceIndex++;
destinationIndex++;
} else {
outArray[destinationIndex] = value;
destinationIndex++;
}
}
return outArray;
}
__device__ int *removeAtIndex(const int *array, const int length, const int index)
{
int *outArray = (int *)malloc((length - 1) * sizeof(int));
int sourceIndex = 0;
int destinationIndex = 0;
while (destinationIndex < length - 1)
{
if (sourceIndex != index)
{
outArray[destinationIndex] = array[sourceIndex];
sourceIndex++;
destinationIndex++;
}
else
{
sourceIndex++;
}
}
return outArray;
}
//shape is column major
__device__ int* getDimensionalIndex(int flatIndex, const int shape[], const int shapeSize)
{
int* indices = (int*)malloc(sizeof(shape));
for(int i = 0; i < shapeSize; i++){
indices[i] = flatIndex % shape[i];
flatIndex = flatIndex / shape[i];
}
return indices;
}
//shape and dimensionalIndex are column major
__device__ int getFlatIndex(const int dimensionalIndex[], const int shape[], const int shapeSize)
{
int index = 0;
for (int i = 0; i < shapeSize; i++)
{
index *= shape[shapeSize - 1 - i];
index += dimensionalIndex[shapeSize - 1 - i];
}
return index;
}
__global__ void sumKernel(const int* shape, const int shapeSize, const int dimToReduce, const float* values, float* output)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int size = 1;
for (int j = 0; j < shapeSize; j++)
{
if (j != dimToReduce)
{
size *= shape[j];
}
}
if (idx < size)
{
int *newShape = removeAtIndex(shape, shapeSize, dimToReduce);
int *partialDimIndex = getDimensionalIndex(idx, shape, shapeSize - 1);
for(int i = 0; i < shape[dimToReduce]; i++){
int* dimIndex = insertAtIndex(partialDimIndex, shapeSize - 1, dimToReduce, i);
int flatIdx = getFlatIndex(dimIndex, shape, shapeSize);
output[idx] += values[getFlatIndex(dimIndex, shape, shapeSize)];
free(dimIndex);
}
free(partialDimIndex);
}
}
float* sum_op(const int* shape, const int shapeSize, const int dimToReduce, const float* values)
{
int BLOCK_SIZE = 32;
int GRID_SIZE = (int)ceil(size / (float)BLOCK_SIZE);
int size = 1;
for (int i = 0; i < shapeSize; i++)
{
size *= shape[i];
}
int outSize = size / shape[dimToReduce];
float* output = new float[outSize];
float* values_d;
float* output_d;
int* shape_d;
cudaMalloc((void**)&values_d, size * sizeof(float));
cudaMalloc((void**)&shape_d, shapeSize * sizeof(int));
cudaMalloc((void**)&output_d, outSize * sizeof(float));
cudaMemcpy(values_d, values, size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(shape_d, shape, shapeSize * sizeof(int), cudaMemcpyHostToDevice);
sumKernel<<GRID_SIZE, BLOCK_SIZE >>> (shape_d, shapeSize, dimToReduce, values_d, output_d);
cudaMemcpy(output, output_d, outSize * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(values_d);
cudaFree(output_d);
return output;
}
Alright, here's what I came up with. The sum
function is easy enough to understand I think. We have 3 buffers, the values, the output and the shape. Even though shape
is relatively simple being an array of dimensions, it still needs a length shapeSize
and has to be passed over as a typed array. We also need the dimension that we're going to reduce dimToReduce
which will be an index into shape
. The only other somewhat interesting this is that we recover the total size from the shape so it's not necessary to pass it in. We actually need the starting and ending shape, with the ending shape just dividing out the dimension we're reducing.
For the GPU kernel itself, we figure out our index and recalculate the shape. This could potentially be optimized by passing the total size since we already has it. As is, this is recalculated for every element of the tensor. I might go back and do that, so if it looks different in the code that's why.
The algorithm is the same as in javascript. I feel deep down that there's a non-allocating way to do this but I never figured it out. I apologize to any FAANG interviewers. So we need methods to insert and remove from the shape array, which are immutable and allocate new arrays. Note the __device__
syntax to make a private function on the GPU. This mirrors what the js code was working with toSpliced
. We first get the new shape, which is the old shape with that dimension removed. We calculate all the other dimensional indices based on our work block index, and leave a placeholder for the one we're reducing. Then, in a loop over that dimension, we insert the missing dimensional index, convert the array of indices into a flat index over the buffer and add the results up into the corresponding output slot.
Tips if you are having issues
Use printf
printf
functions like console.log for C (you need to add your own new line). You can get it by importing stdio.h
. The nice thing is that this works on the GPU code! If you've ever worked on WebGL or WebGPU you've probably felt the pain of not having a way to log things. However, if you use it on the GPU code you should also call cudaDeviceSynchronize()
.
//calling the GPU code
cudaMemcpy(gradA_d, gradA, size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(gradB_d, gradB, size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(gradResult_d, gradResult, size * sizeof(float), cudaMemcpyHostToDevice);
addKernel<<<GRID_SIZE, BLOCK_SIZE>>>(size, gradA_d, gradB_d, gradResult_d);
cudaDeviceSynchronize();
This waits for the print buffers to flush so that the prints don't get missed. Since those prints run once-per-thread on the GPU they can easily get unreadable and they will actually interleave themselves in the console. It's recommended that you add the thread id to the message so that you can tell where it came from:
int idx = int i = blockIdx.x * blockDim.x + threadIdx.x;
printf("thread: %i", idx)
Also it you just want to check a global thing, like say the values of a buffer being passed in, you can use an if
block to print only once:
if(idx === 0){
for(int i = 0; i < valueSize; i++){
printf("%i", values[i])
}
}
Catch CUDA errors
Sometimes something in the GPU code will not work at runtime but it will silently fail. In this case it will complete but you won't get the expected result and it might impact later unrelated calls too! You can add some debug logging to the kernels. Right after the call to the kernel function:
addKernel<<<GRID_SIZE, BLOCK_SIZE>>>(size, gradA_d, gradB_d, gradResult_d);
cudaDeviceSynchronize();
cudaError_t err = cudaGetLastError();
if(err != cudaSuccess){
printf("CUDA error: %s\n", cudaGetErrorString(err));
}
This isn't strictly needed if everything works but might be good to guard you from future issues.
Double check your sizes
Once of the more common problems I ran into was using the wrong size for a buffer which would truncate it with zeros or give you otherwise garbage data. One place where it's easy to get tripped up is the length of an array versus the byteLength of an array. For example I mistook sizeof(array)
to get the number of elements, when it actually gets the byte size (sorry, haven't used C since college). Or in another case I forgot ciel
on my grid size and rounded down to 0
grid size which gives errors. Just check for things like that if values seem off.
Double check you aren't using CPU pointers on the GPU and vice-versa
At one point I passed in valuesA
instead of valuesA_d
, that is I used the CPU pointer instead of the GPU pointer. This will cause an illegal memory access.
Double check your .h matches your .cu
Another amateur mistake, but it's hard to tell when this happens. It will compile fine but Deno will usually tell you that it cannot find a symbol. This also means that types should match exactly. If you use const
parameters for immutability, they need to be const
in the header file.
Double check you are passing the right param types in Deno
If you copy paste:
negBackprop_op: { parameters: ["i32", "buffer", "buffer"], result: "pointer" },
But change it to:
expBackprop_op: { parameters: ["i32", "buffer", "buffer"], result: "pointer" },
You will get weird crashes. This is because expBackprop_op
should have looked like:
expBackprop_op: { parameters: ["i32", "buffer", "buffer", "buffer"], result: "pointer" }
Missing that last buffer with not cause an error but it will cause an exception in the library when you try to access values
. You won't get an error, it will just stop, so be careful you are defining it correctly on the Deno side too.
If the kernel runtime errors it can impact future calls
This was a weird thing to debug but adding a new test broke old existing tests. I was getting silent errors from the runtime and this would break future unrelated calls. You may need to back out new changes if you see odd behavior like this and make sure you check the CUDA errors.
Check for name collisions
As noted above, it took me a while to figure out that stdio
exported a function "div". If you are getting errors about overloaded functions but you aren't overloading them, this could be a cause.
Code
The code for this post can be found here: https://github.com/ndesmic/mljs/compare/v1.0...v1.1
References
- https://www.youtube.com/playlist?list=PL5XwKDZZlwaY7t0M5OLprpkJUIrF8Lc9j
- https://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html
- https://en.wikipedia.org/wiki/CUDA
- https://docs.deno.com/runtime/reference/deno_namespace_apis/#non-blocking-ffi
- https://medium.com/deno-the-complete-reference/calling-c-functions-from-deno-part-2-pass-buffers-ad168a3b6cc7
- https://stackoverflow.com/questions/8125826/error-compiling-cuda-from-command-prompt
- https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html
- https://leimao.github.io/blog/Proper-CUDA-Error-Checking/
Top comments (0)