CUDA Deep Dive: Demystifying Kernels, Thread Hierarchies, and the GPU Execution Model: P-1

#programming #cuda #deeplearning #cpp

CUDA Deep Dive: Demystifying Kernels, Thread Hierarchies, and the GPU Execution Model: P-1

Welcome back! In our last discussion, we scratched the surface of CUDA programming, looking at how it extends C to harness the power of GPUs. Now, let's take a more technical plunge. We're still drawing from "Programming Massively Parallel Processors" (see context here - focusing on Chapter 3 concepts), but this time we'll dissect the execution model, memory implications, and the intricate dance of threads with more precision.

The Dichotomy: Host (CPU) vs. Device (GPU) Architectures

At its core, CUDA programming acknowledges two distinct computational domains:

The Host: Your system's CPU, managing system resources, peripherals, and orchestrating the overall application flow. It operates with its own dedicated DRAM (system memory).
The Device: The NVIDIA GPU, a massively parallel processor with its own high-bandwidth memory (GPU device memory, often GDDR). It's comprised of multiple Streaming Multiprocessors (SMs), each containing numerous CUDA cores.

The art of CUDA programming lies in efficiently partitioning tasks and managing data transfers between these two domains. Data must be explicitly moved from host memory to device memory for GPU processing, and results moved back. These transfers, typically over the PCIe bus, can be a significant performance bottleneck if not managed carefully.

CUDA C Function Specifiers: Defining Execution Space

CUDA C extends standard C with keywords to specify where functions are compiled for and where they execute.

__host__ Functions:
- Standard C functions, compiled for and executed on the host (CPU).
- Can only be called from other __host__ functions or from the global scope.
- This is the default if no CUDA-specific keyword is present, simplifying the porting of legacy C/C++ code.
__device__ Functions:
- Compiled for and executed on the device (GPU).
- Can only be called from __global__ functions (kernels) or other __device__ functions.
- They are often inlined by the NVCC compiler to avoid function call overhead on the GPU, which can be substantial compared to CPU function calls.
- Crucial Limitations:
  - No recursion: The hardware stack support for deep recursion is limited.
  - No traditional static variables: The concept of a single static variable shared across all threads in the way C expects doesn't directly map well. There are ways to achieve shared state (e.g., using shared memory or special device-wide variables), but static inside a __device__ function behaves differently (each thread might get its own instance, or it might be disallowed depending on context and compiler version).
  - No indirect function calls through function pointers (in older CUDA compute capabilities): This restriction has been relaxed in newer compute capabilities (e.g., dynamic parallelism, separate compilation), but for foundational understanding, assume it's a constraint. Direct calls are the norm.
__global__ Functions (Kernels):
- These are the entry points for GPU computation launched from the host.
- Executed on the device (GPU).
- Must have a void return type.
- The call to a __global__ function from the host is asynchronous by default: the CPU initiates the kernel launch and can continue executing other host code without waiting for the kernel to complete (synchronization points like cudaDeviceSynchronize() or blocking memory copies are needed to wait).
- When a kernel is launched, its execution is configured by specifying the grid dimensions and block dimensions.
__host__ __device__ Functions:
- A powerful directive telling NVCC to compile two versions of the function: one for the host and one for the device.
- Allows for code reuse when the same logic is applicable in both execution spaces.
- Useful for common utility functions (e.g., math operations, data transformations) that don't rely on execution-space-specific features.
```
// Can be called from host code or device code
__host__ __device__ int clamp_value(int val, int min_val, int max_val) {
    if (val < min_val) return min_val;
    if (val > max_val) return max_val;
    return val;
}
```

The GPU Execution Model: Grids, Blocks, Warps, and Threads

When a __global__ kernel is launched, it executes as a grid of thread blocks. This hierarchical structure is fundamental to how CUDA maps parallelism to the GPU hardware.

Threads: The most basic unit of execution. Each thread executes the kernel code. Threads are extremely lightweight.
- Identified within their block by threadIdx (a uint3 variable: threadIdx.x, threadIdx.y, threadIdx.z).
Warps: Threads are grouped by the hardware into warps. A warp consists of a fixed number of threads (typically 32 in current NVIDIA architectures). Threads within a warp execute in SIMT (Single Instruction, Multiple Thread) fashion.
- SIMT: All threads in a warp execute the same instruction at the same time. If threads in a warp diverge due to conditional branching (e.g., an if-else statement where some threads take one path and others take another), the warp serially executes each branch path, disabling threads that are not on that path. This thread divergence can significantly impact performance and should be minimized.
Thread Blocks (Cooperative Thread Arrays - CTAs): A group of threads (organized in 1D, 2D, or 3D up to a maximum number, e.g., 1024 threads per block).
- Threads within the same block can cooperate by:
  - Sharing data via __shared__ memory: A low-latency, on-chip memory space private to that block and visible to all threads within it. Data in shared memory persists for the lifetime of the block.
  - Synchronizing execution using __syncthreads(): This is a barrier synchronization primitive. When a thread reaches __syncthreads(), it waits until all other threads in its block have also reached that point before any thread proceeds. This is crucial for coordinating access to shared memory (e.g., ensuring all reads happen before any writes, or vice-versa).
- A block is scheduled by the CUDA runtime to execute on a single Streaming Multiprocessor (SM). Once scheduled on an SM, a block runs to completion on that SM (though its warps may be interleaved with warps from other blocks on the same SM). An SM can often execute multiple blocks concurrently if it has sufficient resources (registers, shared memory).
- Identified within the grid by blockIdx (a uint3 variable: blockIdx.x, blockIdx.y, blockIdx.z).
- The dimensions of a block are available within the kernel via blockDim (a dim3 variable: blockDim.x, blockDim.y, blockDim.z).
Grid: Composed of all thread blocks launched for a given kernel call. Can be 1D, 2D, or 3D.
- Blocks within a grid execute independently and, generally, cannot directly synchronize with each other, except through global memory operations (which can be slow and require careful handling, often with atomic operations) or by terminating the kernel and launching a new one.
- The dimensions of the grid (in terms of blocks) are available within the kernel via gridDim (a dim3 variable: gridDim.x, gridDim.y, gridDim.z).

Computing a Global Thread ID

Since each thread needs to work on a unique piece of data, it's essential to calculate a global index. For a 1D grid of 1D blocks:
int globalThreadId_x = blockIdx.x * blockDim.x + threadIdx.x;

For a 2D grid of 2D blocks, computing a unique 2D global index (gx, gy):
int gx = blockIdx.x * blockDim.x + threadIdx.x;
int gy = blockIdx.y * blockDim.y + threadIdx.y;

This global ID is then used to access elements in global memory arrays.

Matrix Multiplication Revisited (with a Glimpse of Multiple Blocks)

The book's Figure 3.11 presents a kernel that calculates one element of the product matrix P = M * N per thread, using only threadIdx.
P_ij = Σ_k (M_ik * N_kj)

If M is height_M x width_M and N is width_M x width_N, then P is height_M x width_N.
A kernel to compute P might look like this (assuming 2D blocks covering the entire P matrix, and enough blocks to cover it):

__global__ void matrixMulKernel(float *P_d, const float *M_d, const float *N_d, 
                                int P_height, int P_width, int M_width) {
    // Global row index (for P and M)
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    // Global column index (for P and N)
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    // Boundary check: Ensure thread is within the matrix dimensions
    if (row < P_height && col < P_width) {
        float p_value = 0.0f;
        // Dot product for P[row][col]
        for (int k = 0; k < M_width; ++k) {
            // M_d is row-major: M_d[row * M_width + k]
            // N_d is assumed column-major by convention for this loop structure,
            // or if row-major: N_d[k * P_width + col]
            // Let's assume M_d and N_d are row-major for C-style array access.
            // M_d element: M[row][k]
            // N_d element: N[k][col]
            p_value += M_d[row * M_width + k] * N_d[k * P_width + col];
        }
        P_d[row * P_width + col] = p_value; // P[row][col]
    }
}

This is a more complete version. The kernel in Figure 3.11 is simplified for pedagogical reasons by assuming a single block, thus implicitly blockIdx.x = 0 and blockIdx.y = 0, and row = threadIdx.x, col = threadIdx.y (the book maps tx to row and ty to col from threadIdx.x and threadIdx.y respectively). Chapter 4 will elaborate on multi-block implementations and tiling for performance.

Kernel Launch Configuration: `<<<...>>>`

The host launches a kernel using the triple-chevron syntax:
kernelName<<< Dg, Db, Ns, S >>>(argument_list);

Dg: dim3 type, specifies the dimensions of the grid (number of blocks in x, y, z). Dg.x * Dg.y * Dg.z total blocks.
Db: dim3 type, specifies the dimensions of each thread block (number of threads in x, y, z). Db.x * Db.y * Db.z threads per block. The total number of threads per block cannot exceed a device-specific limit (e.g., 1024).
Ns (Optional): size_t type, specifies the bytes of dynamically allocated __shared__ memory per block, in addition to statically allocated shared memory. Defaults to 0.
S (Optional): cudaStream_t type, specifies the CUDA stream the kernel is launched into. Streams allow for managing concurrency of multiple operations. Defaults to stream 0 (the default stream).

Example from Figure 3.14:

dim3 dimGrid(1, 1); // Only one block in the grid (for the simplified example)
dim3 dimBlock(16, 16); // Each block has 16x16 = 256 threads
matrixMulKernel<<<dimGrid, dimBlock>>>(d_Pd, d_Md, d_Nd, WIDTH);

Here, dimGrid becomes gridDim inside the kernel, and dimBlock becomes blockDim.

Essential Runtime API Functions

The CUDA runtime API provides functions for managing the GPU:

cudaMalloc(void **devPtr, size_t size): Allocates size bytes of linear global memory on the device and returns a pointer to it in *devPtr.
cudaMemcpy(void *dst, const void *src, size_t count, cudaMemcpyKind kind): Copies count bytes of data. The kind argument specifies the direction:
- cudaMemcpyHostToDevice: Host to Device
- cudaMemcpyDeviceToHost: Device to Host
- cudaMemcpyDeviceToDevice: Device to Device
- cudaMemcpyHostToHost: (Usually less efficient than standard memcpy) cudaMemcpy operations are generally blocking/synchronous with respect to the host, unless they are part of asynchronous operations involving streams.
cudaFree(void *devPtr): Frees memory allocated with cudaMalloc.
cudaDeviceSynchronize(): Blocks host execution until all previously issued CUDA calls (kernels, memory copies in any stream) on the current device have completed. Essential for ensuring results are ready or for accurate timing.

Deeper Implications and Performance Considerations

Occupancy: The ratio of active warps on an SM to the maximum number of warps that SM can support. Higher occupancy can help hide memory latency, as the SM can switch to other ready warps when one warp is stalled (e.g., waiting for global memory access). Block dimensions, register usage per thread, and shared memory usage per block heavily influence occupancy.
Memory Coalescing: When threads in a warp access global memory, if their accesses fall into a contiguous, aligned segment, the hardware can "coalesce" these into a single (or few) memory transaction(s), which is much more efficient than scattered accesses. Designing data layouts and access patterns for coalescing is critical for good performance.
Shared Memory Banking: Shared memory is divided into banks. Concurrent accesses by threads in a warp to different banks can proceed in parallel. Accesses to the same bank (bank conflicts) are serialized, reducing effective bandwidth. Understanding and avoiding bank conflicts is key when using shared memory heavily.