Cemonix

Posted on Oct 4

Building a CUDA-Accelerated Neural Network Library in Rust

#rust #cuda #machinelearning #beginners

The library is organized into three crates: corrosive-tensor handles the core tensor operations and multi-dimensional arrays, corrosive-nn provides neural network components like layers and activations, and corrosive-cuda implements the GPU acceleration backend.

Right now, the tensor operations are mostly working. I've got multi-dimensional arrays with proper shape and stride handling, element-wise operations like addition, subtraction, multiplication, and division, along with unary ops like negation, absolute value, square root, exponentiation, and logarithm. Scalar operations work too, and broadcasting follows NumPy's rules. Reshaping and transposition are implemented as well.

The CUDA support is where things get interesting. I've implemented device management for CPU ↔ GPU transfers, written custom CUDA kernels for element-wise operations, and built a PyTorch-style .to(device) API. The tensors use an exclusive storage model, meaning they live on one device at a time rather than maintaining dual CPU/GPU copies.

For the neural network side, I have the basic layer structure sketched out with Linear layers and ReLU activations in progress. Parameter management is working, and there's a Sequential model container that lets you stack layers together.

The big missing pieces? Automatic differentiation is the elephant in the room. Without autograd, you can't really train networks effectively. Matrix multiplication on GPU needs cuBLAS integration, which is planned but not done yet. And obviously without these core pieces, actual training loops aren't happening. I've started implementing optimizers like SGD and Adam, but they're only partially complete.

Learning from PyTorch

Why reinvent the wheel when you can learn from the masters? PyTorch has been battle-tested on millions of models, so I'm borrowing their design patterns. Here's what I learned and why these decisions matter.

1. The `.to(device)` API

PyTorch's device transfer API is elegant, so I copied it:

// Create tensor on CPU
let x = Tensor::randn(&[1000, 1000]);

// Move to GPU
let x_gpu = x.cuda()?;

// Move back to CPU
let x_cpu = x_gpu.cpu()?;

// Or use the general form
let x_gpu = x.to(Device::cuda())?;

The .to() method borrows the tensor rather than consuming it, which means the original stays valid. PyTorch does this because it's more flexible - you might want to keep both CPU and GPU versions for debugging, or transfer the same tensor to multiple devices. It's a proven design.

2. Exclusive Storage (No Dual CPU/GPU Copies)

Should tensors keep both CPU and GPU copies in sync? I looked at how PyTorch handles this:

pub enum TensorStorage<T> {
    CPU(Vec<T>),
    #[cfg(feature = "cuda")]
    CUDA {
        context: Arc<CudaContext>,
        buffer: CudaSlice<T>,
    }
}

A tensor lives on one device at a time. PyTorch does this for good reasons: memory efficiency (no wasting GPU memory on dual copies), clear ownership (you always know where your data is), and explicit transfers force you to think about data movement costs. When PyTorch has been using this pattern successfully for years, it's worth following.

3. Strict Device Checking

PyTorch fails fast if you mix CPU and GPU tensors in the same operation. I implemented the same behavior:

let a = Tensor::randn(&[100, 100]).cuda()?;  // GPU
let b = Tensor::randn(&[100, 100]);          // CPU
let c = &a + &b;  // ❌ ERROR: tensors must be on same device

This prevents hidden performance bugs. If I auto-transferred tensors behind your back, you'd never know why your training is slow. PyTorch chose explicit over convenient, and they were right - performance bugs are much harder to debug than type errors.

The Tensor Implementation

Let me show you the actual core structures. Here's how the Tensor is defined:

/// Internal storage for tensor data - either on CPU or GPU
pub enum TensorStorage<T> {
    CPU(Vec<T>),
    #[cfg(feature = "cuda")]
    CUDA {
        context: Arc<CudaContext>,
        buffer: CudaSlice<T>,
        device_idx: usize,
    }
}

pub struct Tensor<T> {
    storage: TensorStorage<T>,  // Data lives here
    shape: Vec<usize>,          // Dimensions [batch, height, width, ...]
    strides: Vec<usize>,        // How to traverse the data
}

The key insight is that TensorStorage is an enum - data lives in one place at a time, not both. This is what enables the exclusive storage model I mentioned earlier.

Here's how device checking works in practice:

impl<T> Tensor<T> {
    fn device(&self) -> Device {
        match &self.storage {
            TensorStorage::CPU(_) => Device::CPU,
            #[cfg(feature = "cuda")]
            TensorStorage::CUDA { device_idx, .. } => Device::CUDA(*device_idx),
        }
    }

    fn has_same_device(&self, other: &Tensor<T>) -> bool {
        match (self.device(), other.device()) {
            (Device::CPU, Device::CPU) => true,
            (Device::CUDA(idx1), Device::CUDA(idx2)) => idx1 == idx2,
            _ => false,
        }
    }
}

And the .to() method that handles device transfers:

fn to(&self, device: Device) -> Result<Self, TensorError> {
    if self.device() == device {
        return Ok(self.clone());
    }

    match (&self.storage, device) {
        // CPU -> GPU: Copy data to GPU memory
        (TensorStorage::CPU(data), Device::CUDA(idx)) => {
            let backend = CudaBackend::new(idx)?;
            let buffer = backend.context().htod_sync_copy(data)?;
            // ... create new tensor with CUDA storage
        }
        // GPU -> CPU: Copy data back to CPU memory
        #[cfg(feature = "cuda")]
        (TensorStorage::CUDA { buffer, .. }, Device::CPU) => {
            let data = buffer.dtoh_sync_copy()?;
            // ... create new tensor with CPU storage
        }
        _ => Err(TensorError::new("Unsupported device transfer"))
    }
}

The beauty of this design is that it's type-safe and explicit. You can't accidentally use GPU memory as CPU memory because the type system won't let you. Pattern matching on the TensorStorage enum forces you to handle both cases.

CUDA Integration: The Fun Part

Getting CUDA working from Rust was... an adventure. I'm using cudarc which provides safe Rust bindings.

My First CUDA Kernel

Here's the element-wise addition kernel (yes, I'm aware this is trivial, but you gotta start somewhere):

extern "C" __global__ void elementwise_add_f32(
    const float* a,
    const float* b,
    float* c,
    size_t n
) {
    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

Calling it from Rust:

#[cfg(feature = "cuda")]
impl Tensor<f32> {
    fn cuda_add(&self, other: &Tensor<f32>) -> Result<Tensor<f32>, TensorError> {
        let backend = self.get_cuda_backend()?;

        // Allocate output buffer on GPU
        let mut c_buf = backend.alloc_zeros::<f32>(self.size())?;

        // Launch kernel
        let kernel = backend.load_kernel("elementwise_add")?;
        backend.launch_elementwise(
            kernel,
            self.buffer(),
            other.buffer(),
            &mut c_buf,
            self.size()
        )?;

        Ok(Tensor::from_cuda_buffer(c_buf, self.shape().to_vec()))
    }
}

Feature Flags: CPU by Default

The library compiles without CUDA dependencies by default. You build with cargo build for CPU-only, or cargo build --features cuda for GPU support. This means smaller binaries for CPU-only users, clear compile errors if you try to use CUDA without the feature flag, and the ability to develop on machines without GPUs.

What I Learned (The Hard Way)

1. Getting cudarc Working Was the Real Challenge

The hardest part wasn't understanding CUDA concepts - it was actually getting the Rust bindings to work correctly. Wrestling with cudarc's API, figuring out the right feature flags, getting kernels to compile and load, and dealing with device context management took way longer than writing the actual CUDA code. Once I had the infrastructure working, writing kernels was straightforward.

2. Type Traits Make Generic Tensors Possible

I defined custom traits like TensorNum and TensorElement to constrain which types can be used in tensors. This lets the same tensor code work with f32, f64, and other numeric types without code duplication. Having all the trait definitions in one place makes it easy to see what operations each type needs to support.

3. Kernel Organization Matters

I organized the CUDA kernels by operation type:

corrosive-cuda/kernels/
├── elementwise/
│   ├── add.cu
│   ├── sub.cu
│   ├── mul.cu
│   └── div.cu
├── activations/
│   ├── relu.cu
│   └── sigmoid.cu
└── reduce/
    └── sum.cu

Each kernel file is small, usually 10-20 lines of CUDA code, which makes them easy to find and modify. This structure also makes it clear what's implemented and what's still missing.

4. Error Handling Needs Work

Right now most of my error handling is just string formatting. I need to properly distinguish between different error types: shape mismatches are user errors that should be recoverable, CUDA out of memory is a system error that might warrant falling back to CPU, device mismatches are user errors with a clear fix, and kernel launch failures are developer errors that might be panic-worthy. This is definitely on the TODO list.

The Elephant in the Room: Autograd

The big missing piece is automatic differentiation. Without it, this is basically a fancy array library - you can do forward passes, but manually computing gradients for backpropagation isn't practical for anything beyond toy examples.

Autograd can seem overwhelming when you first approach it. Building a computational graph, tracking operations and their derivatives, making it work efficiently with CUDA so gradients stay on the GPU, and doing all this bookkeeping without destroying performance - it's complex. But it's also the core feature that makes a neural network library actually useful.

The foundation is solid now. Tensor operations work, device management works, CUDA kernels work. A tape-based autograd system is the natural next step, and while it's challenging, it's definitely achievable. This is what I'm planning to tackle next.

What's Next

The immediate priority is getting cuBLAS integrated for matrix multiplication, since that's where neural networks spend 90% of their training time. After that comes autograd - the feature that will transform this from an array library into something that can actually train models. I also need to implement activation functions with proper forward and backward passes, and most importantly, get one complete working example running - something like MNIST training from scratch.

Looking further ahead, I want to finish the optimizer implementations for SGD and Adam, add more layer types like Conv2D and BatchNorm, run proper benchmarks to see if the GPU is actually faster (it should be, but I need to prove it), and improve those error messages I mentioned earlier.

Learning in Public

Sharing this journey serves multiple purposes. Posting publicly creates accountability - it's harder to abandon a project when you've told people about it. Explaining things forces me to actually understand them deeply, not just get them working. And hopefully someone who knows CUDA or Rust better than me will point out mistakes or suggest improvements. Plus, future me will appreciate having documented these early design decisions.

The Rust ML ecosystem is still young. There are great projects like burn and candle, but there's room for more exploration and different approaches. Even if CorrosiveNet never becomes production-ready, understanding how ML frameworks work under the hood is valuable.

The code is open source at github.com/Cemonix/CorrosiveNet. If you're building ML infrastructure in Rust, or just curious about CUDA and GPU programming, feel free to check it out. I'm learning as I go, and feedback is always welcome.

Current status: 15% of a neural network library, 100% learning experience.

Most satisfying moment: First successful GPU tensor addition. It was just adding two arrays, but it was my arrays on my GPU with my kernel. 🎉

DEV Community

Building a CUDA-Accelerated Neural Network Library in Rust

Learning from PyTorch

1. The `.to(device)` API

2. Exclusive Storage (No Dual CPU/GPU Copies)

3. Strict Device Checking

The Tensor Implementation

CUDA Integration: The Fun Part

My First CUDA Kernel

Feature Flags: CPU by Default

What I Learned (The Hard Way)

1. Getting cudarc Working Was the Real Challenge

2. Type Traits Make Generic Tensors Possible

3. Kernel Organization Matters

4. Error Handling Needs Work

The Elephant in the Room: Autograd

What's Next

Learning in Public

Top comments (0)

Learning from PyTorch

1. The .to(device) API

2. Exclusive Storage (No Dual CPU/GPU Copies)

3. Strict Device Checking

The Tensor Implementation

CUDA Integration: The Fun Part

My First CUDA Kernel

Feature Flags: CPU by Default

What I Learned (The Hard Way)

1. Getting cudarc Working Was the Real Challenge

2. Type Traits Make Generic Tensors Possible

3. Kernel Organization Matters

4. Error Handling Needs Work

The Elephant in the Room: Autograd

What's Next

Learning in Public

1. The `.to(device)` API