DEV Community

Cover image for Advanced GPU Optimization: How can I teach an LLM with CUDA and ROCm?
Javad
Javad

Posted on

Advanced GPU Optimization: How can I teach an LLM with CUDA and ROCm?

Hey Dev Community!
I'm glad that I'm back after months! Reason is because of my Country!
But I'm here again, and I want to teach you some new interesting things about AI and GPU!

Part 1: From Zero to Matrix Multiplication – Building the Foundation

Introduction

Training a Large Language Model (LLM) is not magic. At its core, it’s massive matrix multiplication (GEMM), attention mechanisms, and gradient descent – all accelerated by GPUs. But how do you actually teach an LLM using CUDA (NVIDIA) or ROCm (AMD)? This two-part guide will take you from absolute zero to a fully functional training loop, with code examples for both platforms.

By the end, you will understand:

  • How to write custom CUDA/HIP kernels for LLM building blocks.
  • How to implement a transformer layer from scratch on GPU.
  • How to optimize memory and compute for large-scale training.

Prerequisites: Basic C++, some Python, and a GPU (NVIDIA or AMD) with CUDA 11+ or ROCm 5+.


1. Setting Up the Environment

For CUDA (NVIDIA)

nvcc --version   # should show 11.x or 12.x
Enter fullscreen mode Exit fullscreen mode

For ROCm (AMD)

hipcc --version
Enter fullscreen mode Exit fullscreen mode

We'll write portable HIP code that runs on both NVIDIA and AMD. HIP is essentially CUDA with a different prefix (hip instead of cuda). We'll use macros to keep the same codebase.

Example kernel launch:

// HIP (works on both)
#include <hip/hip_runtime.h>
#define CHECK_HIP(cmd) { hipError_t err = cmd; if(err != hipSuccess) { printf("HIP error: %s\n", hipGetErrorString(err)); exit(1); } }
Enter fullscreen mode Exit fullscreen mode

2. The Absolute Basic: Vector Addition (Warm-up)

Before an LLM, let's ensure GPU communication works.

__global__ void vec_add(const float* a, const float* b, float* c, int n) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if(idx < n) c[idx] = a[idx] + b[idx];
}

int main() {
    int n = 1 << 20;
    float *a, *b, *c; // device pointers
    CHECK_HIP(hipMalloc(&a, n*sizeof(float)));
    CHECK_HIP(hipMalloc(&b, n*sizeof(float)));
    CHECK_HIP(hipMalloc(&c, n*sizeof(float)));
    // ... copy data, launch kernel, copy back
}
Enter fullscreen mode Exit fullscreen mode

This is trivial, but an LLM is just billions of such operations.


3. Matrix Multiplication (GEMM) – The Heart of LLMs

Every transformer block is 90% matrix multiplications: Q·Kᵀ, attention·V, FFN projections.

We'll implement a naïve matmul and then tiled with shared memory.

3.1 Naïve matmul kernel

__global__ void matmul_naive(const float* A, const float* B, float* C, int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if(row < M && col < N) {
        float sum = 0.0f;
        for(int k = 0; k < K; ++k)
            sum += A[row * K + k] * B[k * N + col];
        C[row * N + col] = sum;
    }
}
Enter fullscreen mode Exit fullscreen mode

Launch configuration:

dim3 block(16,16);
dim3 grid((N+15)/16, (M+15)/16);
matmul_naive<<<grid, block>>>(d_A, d_B, d_C, M, N, K);
Enter fullscreen mode Exit fullscreen mode

This is terribly slow for large matrices (1000+). We need tiling.

3.2 Tiled matmul with shared memory

#define TILE_SIZE 16

__global__ void matmul_tiled(const float* A, const float* B, float* C, int M, int N, int K) {
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];

    int bx = blockIdx.x, by = blockIdx.y;
    int tx = threadIdx.x, ty = threadIdx.y;

    int row = by * TILE_SIZE + ty;
    int col = bx * TILE_SIZE + tx;

    float sum = 0.0f;

    for(int k_tile = 0; k_tile < (K + TILE_SIZE - 1) / TILE_SIZE; ++k_tile) {
        // Load A and B tiles into shared memory with bounds checking
        if(row < M && (k_tile * TILE_SIZE + tx) < K)
            As[ty][tx] = A[row * K + k_tile * TILE_SIZE + tx];
        else As[ty][tx] = 0.0f;

        if(col < N && (k_tile * TILE_SIZE + ty) < K)
            Bs[ty][tx] = B[(k_tile * TILE_SIZE + ty) * N + col];
        else Bs[ty][tx] = 0.0f;

        __syncthreads();

        for(int k = 0; k < TILE_SIZE; ++k)
            sum += As[ty][k] * Bs[k][tx];

        __syncthreads();
    }

    if(row < M && col < N)
        C[row * N + col] = sum;
}
Enter fullscreen mode Exit fullscreen mode

This tiled kernel is ~10x faster than naïve for large matrices. For LLMs, we use highly optimized libraries like cuBLAS/rocBLAS, but writing your own teaches you memory hierarchy.


4. Building a Single Transformer Block on GPU

A transformer block consists of:

  1. Multi-head self-attention (softmax(QKᵀ/√d) V)
  2. Feed-forward network (two linear layers with ReLU)
  3. LayerNorm + residual connections

We'll implement a simplified version in mixed HIP/C++ that can be called from Python via pybind11 or directly used in a C++ training loop.

4.1 Attention kernel (simplified, single head)

We need:

  • Q, K, V matrices of shape [seq_len, d_head]
  • Output = softmax(Q·Kᵀ/√d) · V

Let's write a kernel that computes the attention output for one head.

__global__ void attention_kernel(const float* Q, const float* K, const float* V,
                                 float* O, int seq_len, int d_head) {
    extern __shared__ float shared_mem[]; // dynamic shared memory for partial softmax
    int row = threadIdx.y + blockIdx.y * blockDim.y; // query position
    int col = threadIdx.x + blockIdx.x * blockDim.x; // output feature

    if(row >= seq_len || col >= d_head) return;

    // Step 1: compute scores for this query row: scores[j] = Q[row,:] dot K[j,:] / sqrt(d_head)
    // We'll compute in registers and store in shared memory for softmax.
    // For simplicity, we compute the whole row of scores sequentially.
    float* scores = shared_mem; // size = seq_len floats
    float scale = rsqrtf((float)d_head);

    float max_score = -INFINITY;
    for(int j = 0; j < seq_len; ++j) {
        float dot = 0.0f;
        for(int k = 0; k < d_head; ++k) {
            dot += Q[row * d_head + k] * K[j * d_head + k];
        }
        scores[j] = dot * scale;
        if(scores[j] > max_score) max_score = scores[j];
    }

    // Step 2: exp and sum (numerically stable)
    float sum_exp = 0.0f;
    for(int j = 0; j < seq_len; ++j) {
        scores[j] = expf(scores[j] - max_score);
        sum_exp += scores[j];
    }

    // Step 3: weighted sum of V
    float out_val = 0.0f;
    for(int j = 0; j < seq_len; ++j) {
        float attn = scores[j] / sum_exp;
        out_val += attn * V[j * d_head + col];
    }
    O[row * d_head + col] = out_val;
}
Enter fullscreen mode Exit fullscreen mode

This is a single-head, no masking version. Real LLMs have multiple heads, causal masking, and flash attention. But this kernel already teaches you the entire attention mechanism on GPU.

Launch with:

dim3 threads(16, 16); // each block handles 16x16 output elements
dim3 blocks((d_head+15)/16, (seq_len+15)/16);
size_t smem = seq_len * sizeof(float); // shared memory for scores
attention_kernel<<<blocks, threads, smem>>>(d_Q, d_K, d_V, d_O, seq_len, d_head);
Enter fullscreen mode Exit fullscreen mode

5. Multi-Head Attention with Batching

In reality, we have batch_size, num_heads, seq_len, d_head. The total dimension d_model = num_heads * d_head.

We can reshape Q,K,V to [batch, num_heads, seq_len, d_head] and launch the above kernel per head. Or better, use cuBLAS/rocBLAS for matmuls and custom softmax kernels.

Here's a batched multi-head attention using rocBLAS (identical to cuBLAS):

#include <hipblas/hipblas.h>
hipblasHandle_t handle;
hipblasCreate(&handle);

// Assume Q, K, V are [batch, seq_len, num_heads, d_head] in row-major.
// We'll use hipblasGemmStridedBatched.
float alpha = 1.0f / sqrtf(d_head), beta = 0.0f;
hipblasSgemmStridedBatched(handle, HIPBLAS_OP_T, HIPBLAS_OP_N,
                           seq_len, seq_len, d_head,
                           &alpha, K, d_head, seq_len*d_head,
                           Q, d_head, seq_len*d_head,
                           &beta, scores, seq_len, seq_len*seq_len,
                           batch_size * num_heads);
// Then softmax (custom kernel), then gemm with V.
Enter fullscreen mode Exit fullscreen mode

This is much faster than custom kernels because it uses tensor cores/matrix cores.


6. The Forward Pass of a Transformer Layer (Full Code)

We'll now combine all pieces into a single HIP/C++ function that runs a transformer block. This is the core of training an LLM.

void transformer_block_forward(
    float* d_X,        // input [batch, seq_len, d_model]
    float* d_Wq, float* d_Wk, float* d_Wv, float* d_Wo, // weights
    float* d_ffn1, float* d_ffn2,
    float* d_Y,        // output
    int B, int S, int D, int H, int D_head) // dimensions
{
    // D = H * D_head
    // Q = X * Wq  etc.
    float alpha = 1.0f, beta = 0.0f;
    // Gemm for Q, K, V
    hipblasSgemm(handle, HIPBLAS_OP_N, HIPBLAS_OP_N,
                 S, D, D, &alpha, d_X, S, d_Wq, D, &beta, d_Q, S);
    // similarly for K, V
    // ... then reshape to [B, H, S, D_head] (using kernel or view)

    // Attention: compute scores = Q*K^T / sqrt(D_head)
    // Use batched gemm as shown above.

    // Softmax (custom kernel)
    softmax_kernel<<<...>>>(d_scores, B*H*S, S);

    // Output = scores * V
    hipblasSgemmStridedBatched(...);

    // Then Wo projection
    // Residual add
    // LayerNorm
    // FFN: two linear layers with ReLU
}
Enter fullscreen mode Exit fullscreen mode

This is the exact blueprint of a transformer block on GPU.

Do you have any question? Leave them here as comments and we will respond to them.

See ya on Part 2 of this interesting blog! Have a nice time!

Top comments (0)