Hey Dev Community!
I'm glad that I'm back after months! Reason is because of my Country!
But I'm here again, and I want to teach you some new interesting things about AI and GPU!
Part 1: From Zero to Matrix Multiplication – Building the Foundation
Introduction
Training a Large Language Model (LLM) is not magic. At its core, it’s massive matrix multiplication (GEMM), attention mechanisms, and gradient descent – all accelerated by GPUs. But how do you actually teach an LLM using CUDA (NVIDIA) or ROCm (AMD)? This two-part guide will take you from absolute zero to a fully functional training loop, with code examples for both platforms.
By the end, you will understand:
- How to write custom CUDA/HIP kernels for LLM building blocks.
- How to implement a transformer layer from scratch on GPU.
- How to optimize memory and compute for large-scale training.
Prerequisites: Basic C++, some Python, and a GPU (NVIDIA or AMD) with CUDA 11+ or ROCm 5+.
1. Setting Up the Environment
For CUDA (NVIDIA)
nvcc --version # should show 11.x or 12.x
For ROCm (AMD)
hipcc --version
We'll write portable HIP code that runs on both NVIDIA and AMD. HIP is essentially CUDA with a different prefix (hip instead of cuda). We'll use macros to keep the same codebase.
Example kernel launch:
// HIP (works on both)
#include <hip/hip_runtime.h>
#define CHECK_HIP(cmd) { hipError_t err = cmd; if(err != hipSuccess) { printf("HIP error: %s\n", hipGetErrorString(err)); exit(1); } }
2. The Absolute Basic: Vector Addition (Warm-up)
Before an LLM, let's ensure GPU communication works.
__global__ void vec_add(const float* a, const float* b, float* c, int n) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(idx < n) c[idx] = a[idx] + b[idx];
}
int main() {
int n = 1 << 20;
float *a, *b, *c; // device pointers
CHECK_HIP(hipMalloc(&a, n*sizeof(float)));
CHECK_HIP(hipMalloc(&b, n*sizeof(float)));
CHECK_HIP(hipMalloc(&c, n*sizeof(float)));
// ... copy data, launch kernel, copy back
}
This is trivial, but an LLM is just billions of such operations.
3. Matrix Multiplication (GEMM) – The Heart of LLMs
Every transformer block is 90% matrix multiplications: Q·Kᵀ, attention·V, FFN projections.
We'll implement a naïve matmul and then tiled with shared memory.
3.1 Naïve matmul kernel
__global__ void matmul_naive(const float* A, const float* B, float* C, int M, int N, int K) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row < M && col < N) {
float sum = 0.0f;
for(int k = 0; k < K; ++k)
sum += A[row * K + k] * B[k * N + col];
C[row * N + col] = sum;
}
}
Launch configuration:
dim3 block(16,16);
dim3 grid((N+15)/16, (M+15)/16);
matmul_naive<<<grid, block>>>(d_A, d_B, d_C, M, N, K);
This is terribly slow for large matrices (1000+). We need tiling.
3.2 Tiled matmul with shared memory
#define TILE_SIZE 16
__global__ void matmul_tiled(const float* A, const float* B, float* C, int M, int N, int K) {
__shared__ float As[TILE_SIZE][TILE_SIZE];
__shared__ float Bs[TILE_SIZE][TILE_SIZE];
int bx = blockIdx.x, by = blockIdx.y;
int tx = threadIdx.x, ty = threadIdx.y;
int row = by * TILE_SIZE + ty;
int col = bx * TILE_SIZE + tx;
float sum = 0.0f;
for(int k_tile = 0; k_tile < (K + TILE_SIZE - 1) / TILE_SIZE; ++k_tile) {
// Load A and B tiles into shared memory with bounds checking
if(row < M && (k_tile * TILE_SIZE + tx) < K)
As[ty][tx] = A[row * K + k_tile * TILE_SIZE + tx];
else As[ty][tx] = 0.0f;
if(col < N && (k_tile * TILE_SIZE + ty) < K)
Bs[ty][tx] = B[(k_tile * TILE_SIZE + ty) * N + col];
else Bs[ty][tx] = 0.0f;
__syncthreads();
for(int k = 0; k < TILE_SIZE; ++k)
sum += As[ty][k] * Bs[k][tx];
__syncthreads();
}
if(row < M && col < N)
C[row * N + col] = sum;
}
This tiled kernel is ~10x faster than naïve for large matrices. For LLMs, we use highly optimized libraries like cuBLAS/rocBLAS, but writing your own teaches you memory hierarchy.
4. Building a Single Transformer Block on GPU
A transformer block consists of:
- Multi-head self-attention (softmax(QKᵀ/√d) V)
- Feed-forward network (two linear layers with ReLU)
- LayerNorm + residual connections
We'll implement a simplified version in mixed HIP/C++ that can be called from Python via pybind11 or directly used in a C++ training loop.
4.1 Attention kernel (simplified, single head)
We need:
- Q, K, V matrices of shape [seq_len, d_head]
- Output = softmax(Q·Kᵀ/√d) · V
Let's write a kernel that computes the attention output for one head.
__global__ void attention_kernel(const float* Q, const float* K, const float* V,
float* O, int seq_len, int d_head) {
extern __shared__ float shared_mem[]; // dynamic shared memory for partial softmax
int row = threadIdx.y + blockIdx.y * blockDim.y; // query position
int col = threadIdx.x + blockIdx.x * blockDim.x; // output feature
if(row >= seq_len || col >= d_head) return;
// Step 1: compute scores for this query row: scores[j] = Q[row,:] dot K[j,:] / sqrt(d_head)
// We'll compute in registers and store in shared memory for softmax.
// For simplicity, we compute the whole row of scores sequentially.
float* scores = shared_mem; // size = seq_len floats
float scale = rsqrtf((float)d_head);
float max_score = -INFINITY;
for(int j = 0; j < seq_len; ++j) {
float dot = 0.0f;
for(int k = 0; k < d_head; ++k) {
dot += Q[row * d_head + k] * K[j * d_head + k];
}
scores[j] = dot * scale;
if(scores[j] > max_score) max_score = scores[j];
}
// Step 2: exp and sum (numerically stable)
float sum_exp = 0.0f;
for(int j = 0; j < seq_len; ++j) {
scores[j] = expf(scores[j] - max_score);
sum_exp += scores[j];
}
// Step 3: weighted sum of V
float out_val = 0.0f;
for(int j = 0; j < seq_len; ++j) {
float attn = scores[j] / sum_exp;
out_val += attn * V[j * d_head + col];
}
O[row * d_head + col] = out_val;
}
This is a single-head, no masking version. Real LLMs have multiple heads, causal masking, and flash attention. But this kernel already teaches you the entire attention mechanism on GPU.
Launch with:
dim3 threads(16, 16); // each block handles 16x16 output elements
dim3 blocks((d_head+15)/16, (seq_len+15)/16);
size_t smem = seq_len * sizeof(float); // shared memory for scores
attention_kernel<<<blocks, threads, smem>>>(d_Q, d_K, d_V, d_O, seq_len, d_head);
5. Multi-Head Attention with Batching
In reality, we have batch_size, num_heads, seq_len, d_head. The total dimension d_model = num_heads * d_head.
We can reshape Q,K,V to [batch, num_heads, seq_len, d_head] and launch the above kernel per head. Or better, use cuBLAS/rocBLAS for matmuls and custom softmax kernels.
Here's a batched multi-head attention using rocBLAS (identical to cuBLAS):
#include <hipblas/hipblas.h>
hipblasHandle_t handle;
hipblasCreate(&handle);
// Assume Q, K, V are [batch, seq_len, num_heads, d_head] in row-major.
// We'll use hipblasGemmStridedBatched.
float alpha = 1.0f / sqrtf(d_head), beta = 0.0f;
hipblasSgemmStridedBatched(handle, HIPBLAS_OP_T, HIPBLAS_OP_N,
seq_len, seq_len, d_head,
&alpha, K, d_head, seq_len*d_head,
Q, d_head, seq_len*d_head,
&beta, scores, seq_len, seq_len*seq_len,
batch_size * num_heads);
// Then softmax (custom kernel), then gemm with V.
This is much faster than custom kernels because it uses tensor cores/matrix cores.
6. The Forward Pass of a Transformer Layer (Full Code)
We'll now combine all pieces into a single HIP/C++ function that runs a transformer block. This is the core of training an LLM.
void transformer_block_forward(
float* d_X, // input [batch, seq_len, d_model]
float* d_Wq, float* d_Wk, float* d_Wv, float* d_Wo, // weights
float* d_ffn1, float* d_ffn2,
float* d_Y, // output
int B, int S, int D, int H, int D_head) // dimensions
{
// D = H * D_head
// Q = X * Wq etc.
float alpha = 1.0f, beta = 0.0f;
// Gemm for Q, K, V
hipblasSgemm(handle, HIPBLAS_OP_N, HIPBLAS_OP_N,
S, D, D, &alpha, d_X, S, d_Wq, D, &beta, d_Q, S);
// similarly for K, V
// ... then reshape to [B, H, S, D_head] (using kernel or view)
// Attention: compute scores = Q*K^T / sqrt(D_head)
// Use batched gemm as shown above.
// Softmax (custom kernel)
softmax_kernel<<<...>>>(d_scores, B*H*S, S);
// Output = scores * V
hipblasSgemmStridedBatched(...);
// Then Wo projection
// Residual add
// LayerNorm
// FFN: two linear layers with ReLU
}
This is the exact blueprint of a transformer block on GPU.
Do you have any question? Leave them here as comments and we will respond to them.
See ya on Part 2 of this interesting blog! Have a nice time!
Top comments (0)