Javad

Posted on Feb 14

Advanced GPU Optimization: CUDA & HIP from zero to hero

#ai #discuss #programming #tutorial

Welcome, fellow developer! If you’ve ever wanted to tap into the immense power of graphics processors for general-purpose computing, you’re in the right place. GPU programming used to be an NVIDIA-only game with CUDA, but the rise of AMD GPUs in workstations, servers, and supercomputers has made vendor lock‑in a thing of the past. Enter HIP – a C++ runtime API that lets you write code that runs on both NVIDIA and AMD GPUs with minimal changes.

In this comprehensive guide, we’ll start from the very basics (no prior GPU experience assumed) and work our way up to advanced topics like streams, shared memory optimizations, and multi‑GPU programming. Along the way, you’ll learn CUDA inside out and then discover how HIP gives you portability without sacrificing performance. By the end, you’ll be comfortable writing efficient, portable GPU kernels and ready to tackle real‑world parallel problems.

Let’s dive in!

Part 1: CUDA Fundamentals

1.1 Setting Up CUDA

Before we write any code, we need a working CUDA development environment.

Check your hardware: You need an NVIDIA GPU with Compute Capability 3.0 or higher. Run lspci | grep -i nvidia on Linux or check Device Manager on Windows.
Install the CUDA Toolkit:
- Download the appropriate installer from NVIDIA CUDA Toolkit.
- Follow the installation guide for your OS.
Verify installation:

   nvcc --version

You should see the CUDA version. Also run nvidia-smi to check driver and GPU status.

Set environment variables (if not done automatically):

   export PATH=/usr/local/cuda/bin:$PATH
   export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Now you’re ready to compile CUDA code with nvcc.

1.2 The CUDA Programming Model

A GPU contains hundreds or thousands of cores. To use them effectively, we organize work into a hierarchy:

Thread: The smallest unit of execution. Each thread runs the same kernel function but works on different data.
Block: A group of threads that can cooperate via shared memory and synchronize with __syncthreads(). Threads in the same block run on the same streaming multiprocessor (SM).
Grid: A collection of blocks that together cover the entire problem.

When you launch a kernel, you specify the grid and block dimensions (in 1D, 2D, or 3D). Each thread gets unique indices:

threadIdx.x, threadIdx.y, threadIdx.z – position inside its block.
blockIdx.x, blockIdx.y, blockIdx.z – position of the block in the grid.
blockDim.x, blockDim.y, blockDim.z – size of each block.
gridDim.x, gridDim.y, gridDim.z – size of the grid.

The typical formula to map a thread to a data element is:

int i = blockIdx.x * blockDim.x + threadIdx.x;

1.3 Memory Hierarchy

Global memory: Large (several GB), high latency, accessible by all threads. It persists across kernel launches.
Shared memory: Small (tens of KB per block), low latency, shared among threads in the same block. Programmer‑managed cache.
Registers: Private to each thread, fastest.
Constant memory: Read‑only, cached, good for broadcast data.
Texture memory: Specialized for 2D spatial locality.

Understanding this hierarchy is key to writing fast kernels.

1.4 Your First CUDA Kernel: Vector Addition

Let’s implement a simple vector addition. This will introduce all the essential steps.

#include <iostream>
#include <cuda_runtime.h>

// Kernel definition
__global__ void vecAdd(const float *a, const float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {  // Guard against extra threads
        c[i] = a[i] + b[i];
    }
}

int main() {
    int n = 1 << 20;  // 1,048,576 elements
    size_t bytes = n * sizeof(float);

    // Allocate host memory
    float *h_a = new float[n];
    float *h_b = new float[n];
    float *h_c = new float[n];

    // Initialize input arrays
    for (int i = 0; i < n; i++) {
        h_a[i] = i * 1.0f;
        h_b[i] = (i + 1) * 1.0f;
    }

    // Allocate device memory
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    // Copy data from host to device
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

    // Launch kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
    vecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);

    // Copy result back to host
    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    // Verify (simple check)
    for (int i = 0; i < n; i++) {
        if (h_c[i] != h_a[i] + h_b[i]) {
            std::cout << "Mismatch at index " << i << "\n";
            break;
        }
    }

    // Cleanup
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    delete[] h_a;
    delete[] h_b;
    delete[] h_c;

    return 0;
}

Explanation:

__global__ marks a function as a kernel – callable from host and executed on device.
We compute a global index i and, if within bounds, perform the addition.
<<<blocksPerGrid, threadsPerBlock>>> sets the execution configuration.
cudaMalloc allocates memory on the GPU.
cudaMemcpy transfers data between host and device (the last argument specifies direction).
Always free device memory with cudaFree to avoid leaks.

Compile with:

nvcc -o vecAdd vecAdd.cu
./vecAdd

If everything works, you should see no output (or a success message if you add one).

1.5 Understanding Kernel Launch Parameters

The launch configuration <<<grid, block, sharedMem, stream>>> has two required parameters and two optional:

grid (dim3): number of blocks per grid. Can be 1D, 2D, or 3D. E.g., dim3 grid(16, 16) creates a 16x16 grid (256 blocks).
block (dim3): number of threads per block. Usually 1D, but can be 2D/3D. Total threads per block ≤ 1024 (for modern GPUs).
sharedMem (size_t): dynamic shared memory per block (default 0).
stream (cudaStream_t): specifies which stream to enqueue the kernel in (default 0, the null stream).

In our example we used 1D grids and blocks. For 2D data (like images), you might use 2D blocks:

dim3 block(16, 16);
dim3 grid((width + block.x - 1) / block.x, (height + block.y - 1) / block.y);
kernel<<<grid, block>>>(...);

Inside the kernel, you would then compute:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int idx = y * width + x;

1.6 Memory Management in Depth

cudaMalloc(void **ptr, size_t size) – allocates linear memory on device.
cudaFree(void *ptr) – frees device memory.
cudaMemcpy(void *dst, const void *src, size_t count, cudaMemcpyKind kind) – synchronously copies memory.
cudaMemcpyAsync – asynchronous version, requires pinned host memory.
cudaMallocHost – allocates page‑locked (pinned) host memory, which enables faster transfers and asynchronous copies.
cudaMemset – sets device memory to a value.

Pinned memory example:

float *h_a;
cudaMallocHost(&h_a, bytes);   // pinned
cudaMemcpyAsync(d_a, h_a, bytes, cudaMemcpyHostToDevice, stream);

Use pinned memory when you have many transfers or overlapping with computation.

1.7 Error Handling

CUDA functions return an error code of type cudaError_t. You should always check them:

cudaError_t err = cudaMalloc(&d_a, bytes);
if (err != cudaSuccess) {
    std::cerr << "cudaMalloc failed: " << cudaGetErrorString(err) << std::endl;
    exit(EXIT_FAILURE);
}

For kernel launches, which are asynchronous, you can check for launch errors with cudaGetLastError() and wait for completion with cudaDeviceSynchronize() to catch execution errors:

kernel<<<grid, block>>>(args);
err = cudaGetLastError();   // catch launch errors
if (err != cudaSuccess) { ... }
cudaDeviceSynchronize();    // wait and catch execution errors
err = cudaGetLastError();
if (err != cudaSuccess) { ... }

Better yet, wrap calls in a macro:

#define CUDA_CHECK(call) \
    do { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            std::cerr << "CUDA error in " << __FILE__ << ":" << __LINE__ << ": " \
                      << cudaGetErrorString(err) << std::endl; \
            exit(EXIT_FAILURE); \
        } \
    } while(0)

// Usage:
CUDA_CHECK(cudaMalloc(&d_a, bytes));

1.8 Profiling and Debugging Basics

Profiling: NVIDIA provides nvprof (deprecated but still useful) and the newer Nsight Systems / Nsight Compute. To profile a simple app:

  nvprof ./vecAdd

This shows kernel execution time, memory transfer times, etc. For deeper analysis, use Nsight Compute: ncu ./vecAdd.

Debugging: Use cuda-gdb (Linux) or the Visual Studio debugger on Windows. Compile with -g -G to generate debug info:

  nvcc -g -G -o vecAdd vecAdd.cu
  cuda-gdb ./vecAdd

Inside cuda-gdb, you can set breakpoints, inspect threads, and step through device code.

Memory error checking: compute-sanitizer (formerly cuda-memcheck) helps detect out‑of‑bounds accesses, race conditions, etc.

  compute-sanitizer ./vecAdd

1.9 Practice: Matrix Multiplication

Let's solidify our understanding with a classic: matrix multiplication C = A * B (N x N). We'll start with a naive kernel.

Naive implementation (each thread computes one element of C):

__global__ void matMulNaive(const float *A, const float *B, float *C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < N && col < N) {
        float sum = 0.0f;
        for (int k = 0; k < N; k++) {
            sum += A[row * N + k] * B[k * N + col];
        }
        C[row * N + col] = sum;
    }
}

Launch configuration (assuming 16x16 blocks):

dim3 block(16, 16);
dim3 grid((N + block.x - 1) / block.x, (N + block.y - 1) / block.y);
matMulNaive<<<grid, block>>>(d_A, d_B, d_C, N);

Performance issues: This naive version has poor memory access patterns (especially for B, where each thread accesses a different column, causing non‑coalesced reads) and no data reuse. We'll improve it later with shared memory.

Part 2: Intermediate CUDA

Now that you're comfortable with basic CUDA, let's explore techniques to write faster kernels.

2.1 Performance Considerations

Occupancy: The ratio of active warps to maximum possible warps on an SM. Higher occupancy can hide latency but isn't always the goal. Use the CUDA Occupancy Calculator or the cudaOccupancyMaxPotentialBlockSize function to tune block size.

Warp divergence: Threads in a warp execute in lockstep. If they take different paths (e.g., an if‑else), the warp serially executes both paths, reducing performance. Try to keep control flow uniform within warps.

Coalesced memory access: Global memory bandwidth is best utilized when consecutive threads access consecutive memory locations. In our vector addition, thread i accesses a[i] – perfectly coalesced. In naive matrix multiplication, row‑major access of A is coalesced, but column‑major access of B is not. We'll fix that with shared memory.

2.2 Shared Memory and Bank Conflicts

Shared memory is divided into banks. If multiple threads access the same bank (with some exceptions), we get a bank conflict and the accesses are serialized. To avoid conflicts, ensure that threads within a warp access different banks (or all access the same word in broadcast mode). This is an advanced topic, but be aware of it when designing shared memory layouts.

2.3 Tiled Matrix Multiplication with Shared Memory

Let's rewrite matrix multiplication to use shared memory. The idea: load tiles of A and B into shared memory, then compute partial sums.

#define TILE_SIZE 16

__global__ void matMulShared(const float *A, const float *B, float *C, int N) {
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];

    int bx = blockIdx.x, by = blockIdx.y;
    int tx = threadIdx.x, ty = threadIdx.y;

    // Identify the row and column of the C element to compute
    int row = by * TILE_SIZE + ty;
    int col = bx * TILE_SIZE + tx;

    float sum = 0.0f;

    // Loop over all tiles
    for (int tile = 0; tile < N / TILE_SIZE; ++tile) {
        // Load one tile of A and B into shared memory
        As[ty][tx] = A[row * N + (tile * TILE_SIZE + tx)];
        Bs[ty][tx] = B[(tile * TILE_SIZE + ty) * N + col];
        __syncthreads();  // Ensure tile is loaded before computation

        // Compute partial product for this tile
        for (int k = 0; k < TILE_SIZE; ++k) {
            sum += As[ty][k] * Bs[k][tx];
        }
        __syncthreads();  // Avoid overwriting shared memory before next tile
    }

    if (row < N && col < N) {
        C[row * N + col] = sum;
    }
}

What changed?

Each block loads a tile of A and a tile of B into shared memory.
All threads in the block cooperate to load the tiles (coalesced because threads in a warp load consecutive elements from global memory).
Computation uses the fast shared memory.
__syncthreads() ensures that all threads have finished loading before any thread starts using the data, and that no thread starts the next tile before all have finished the current tile.

This kernel can achieve near‑peak performance for large matrices.

2.4 Atomic Operations

When multiple threads need to update the same location (e.g., histogram, reduction), you can use atomic operations to avoid race conditions.

__global__ void histogram(const unsigned char *data, unsigned int *bins, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        atomicAdd(&bins[data[i]], 1);
    }
}

CUDA provides atomicAdd, atomicSub, atomicExch, atomicMin, atomicMax, atomicAnd, atomicOr, atomicXor for various data types.

Performance: Atomics can be slow if many threads contend for the same location. Use shared memory to reduce contention when possible (e.g., each block computes its own partial histogram, then combine with atomics).

2.5 Streams and Concurrency

Streams allow you to overlap kernel execution with data transfers and other operations. A stream is a sequence of operations that execute in order. Operations from different streams may overlap.

Example: overlapping copy and compute:

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Split data into two halves
int half = n / 2;

// Asynchronous memory copies (requires pinned host memory)
cudaMemcpyAsync(d_a1, h_a1, half_bytes, cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(d_a2, h_a2, half_bytes, cudaMemcpyHostToDevice, stream2);

// Launch kernels in respective streams
kernel<<<grid, block, 0, stream1>>>(d_a1, d_b1, d_c1, half);
kernel<<<grid, block, 0, stream2>>>(d_a2, d_b2, d_c2, half);

// Copy back
cudaMemcpyAsync(h_c1, d_c1, half_bytes, cudaMemcpyDeviceToHost, stream1);
cudaMemcpyAsync(h_c2, d_c2, half_bytes, cudaMemcpyDeviceToHost, stream2);

cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

Note: Use pinned memory (cudaMallocHost) for asynchronous transfers. Also, kernels from different streams may run concurrently if resources permit.

2.6 Events for Timing

CUDA events provide a way to measure elapsed time on the device accurately.

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
kernel<<<grid, block>>>(args);
cudaEventRecord(stop);

cudaEventSynchronize(stop);  // Wait for the stop event to be recorded
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
std::cout << "Kernel time: " << milliseconds << " ms\n";

cudaEventDestroy(start);
cudaEventDestroy(stop);

2.7 Unified Memory

Unified Memory (UM) creates a single pointer accessible from both CPU and GPU. Data is automatically migrated on demand. It simplifies programming but may introduce overhead.

Allocate UM:

float *x;
cudaMallocManaged(&x, N * sizeof(float));

You can then use x on both host and device without explicit cudaMemcpy. To ensure data is ready on the device, you can prefetch:

cudaMemPrefetchAsync(x, N * sizeof(float), deviceId, 0);

UM is great for prototyping and for algorithms with irregular data access. For performance‑critical code, manual transfers are still recommended.

2.8 Multi‑GPU Programming

Modern systems may have multiple GPUs. You can use them to scale your application.

Use cudaGetDeviceCount to find number of devices.
Set current device with cudaSetDevice(deviceId) before allocating memory or launching kernels.
Each device has its own memory space; you must manage data distribution and communication between devices (e.g., via PCIe or NVLink). Use peer‑to‑peer access if available (cudaDeviceCanAccessPeer, cudaDeviceEnablePeerAccess).

Simple multi‑GPU example:

int numDevices;
cudaGetDeviceCount(&numDevices);

#pragma omp parallel for
for (int dev = 0; dev < numDevices; ++dev) {
    cudaSetDevice(dev);
    // Allocate memory and launch kernel on this device
    // Copy data to/from host (strided or partitioned)
}

2.9 Advanced Example: Parallel Reduction

Reduction (summing an array) is a common pattern. We'll implement an optimized version that uses shared memory and avoids warp divergence.

__global__ void reduce(float *g_idata, float *g_odata, int n) {
    extern __shared__ float sdata[];  // Dynamic shared memory

    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

    // Load data into shared memory (with bounds check)
    sdata[tid] = (i < n) ? g_idata[i] : 0.0f;
    __syncthreads();

    // Perform reduction in shared memory
    for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    // Write result for this block to global memory
    if (tid == 0) {
        g_odata[blockIdx.x] = sdata[0];
    }
}

Launch:

int threads = 256;
int blocks = (n + threads - 1) / threads;
reduce<<<blocks, threads, threads * sizeof(float)>>>(d_in, d_out, n);

After this kernel, you have partial sums per block. You may need a second kernel to combine them, or run a single block on the CPU.

Part 3: Introduction to HIP

3.1 What is HIP? Why Portability?

HIP (Heterogeneous‑compute Interface for Portability) is an AMD initiative to provide a C++ runtime API that can target both NVIDIA and AMD GPUs. It is essentially a thin abstraction layer: HIP code compiles to CUDA (using a CUDA backend) when targeting NVIDIA, and to AMD’s ROCm when targeting AMD. This means you can write your kernel once and run it everywhere.

Major benefits:

Vendor independence: No lock‑in to NVIDIA.
Single source: One codebase for multiple platforms.
Ease of porting: CUDA code can be mechanically converted using hipify tools.

3.2 Setting Up HIP

On AMD systems with ROCm:

Follow the ROCm installation guide. On Ubuntu:

  sudo apt update
  sudo apt install rocm-dev

Verify with hipcc --version.

On NVIDIA systems (HIP on CUDA):

Install the CUDA Toolkit first.
Then install HIP from ROCm’s GitHub or via package manager. Many Linux distributions provide a hip-cuda package.
After installation, hipcc will use the CUDA backend.

Environment:

Ensure hipcc is in your PATH.
For AMD, you may need to set ROCM_PATH if not default.

3.3 HIP API Overview

HIP intentionally mimics CUDA. Most CUDA functions have direct HIP equivalents with the hip prefix:

CUDA	HIP
`cudaMalloc`	`hipMalloc`
`cudaFree`	`hipFree`
`cudaMemcpy`	`hipMemcpy`
`cudaMemcpyAsync`	`hipMemcpyAsync`
`cudaDeviceSynchronize`	`hipDeviceSynchronize`
`cudaStreamCreate`	`hipStreamCreate`
`cudaEventCreate`	`hipEventCreate`
`cudaGetLastError`	`hipGetLastError`
`cudaError_t`	`hipError_t`
`cudaSuccess`	`hipSuccess`

The kernel language is essentially the same: __global__, __device__, __shared__, threadIdx, blockIdx, etc., are all supported.

The only major syntactic difference is the kernel launch. Instead of triple chevrons, HIP uses a function macro hipLaunchKernelGGL:

hipLaunchKernelGGL(kernel_name, dim3(grid), dim3(block), sharedMem, stream, args...);

If you prefer the <<<>>> syntax, you can define a macro (but it's not provided by default).

3.4 Porting a CUDA Program to HIP

Let’s port our vector addition from CUDA to HIP.

Original CUDA kernel (unchanged):

__global__ void vecAdd(const float *a, const float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

Host code converted to HIP:

#include <iostream>
#include <hip/hip_runtime.h>

int main() {
    int n = 1 << 20;
    size_t bytes = n * sizeof(float);

    float *h_a = new float[n];
    float *h_b = new float[n];
    float *h_c = new float[n];

    for (int i = 0; i < n; i++) {
        h_a[i] = i * 1.0f;
        h_b[i] = (i + 1) * 1.0f;
    }

    float *d_a, *d_b, *d_c;
    hipMalloc(&d_a, bytes);
    hipMalloc(&d_b, bytes);
    hipMalloc(&d_c, bytes);

    hipMemcpy(d_a, h_a, bytes, hipMemcpyHostToDevice);
    hipMemcpy(d_b, h_b, bytes, hipMemcpyHostToDevice);

    int threadsPerBlock = 256;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;

    // HIP kernel launch
    hipLaunchKernelGGL(vecAdd, dim3(blocksPerGrid), dim3(threadsPerBlock), 0, 0, d_a, d_b, d_c, n);

    hipMemcpy(h_c, d_c, bytes, hipMemcpyDeviceToHost);

    // Verification (omitted for brevity)

    hipFree(d_a); hipFree(d_b); hipFree(d_c);
    delete[] h_a; delete[] h_b; delete[] h_c;

    return 0;
}

Compile with:

hipcc -o vecAdd vecAdd.cpp   # on AMD
# or on NVIDIA
hipcc --offload-arch=sm_70 -o vecAdd vecAdd.cpp

Notice that the kernel itself is identical to the CUDA version – that’s the beauty of HIP!

3.5 HIP Kernel Launch Syntax Explained

hipLaunchKernelGGL takes the following arguments:

Kernel name (symbol)
Grid dimensions (dim3)
Block dimensions (dim3)
Dynamic shared memory size per block (size_t, default 0)
Stream (hipStream_t, default 0 for null stream)
Kernel arguments (comma‑separated)

If you have many arguments, it can become verbose. Some projects define a macro like:

#define HIP_LAUNCH(kernel, grid, block, shared, stream, ...) \
    hipLaunchKernelGGL(kernel, grid, block, shared, stream, __VA_ARGS__)

3.6 Error Handling in HIP

HIP error handling is identical to CUDA:

hipError_t err = hipMalloc(&d_a, bytes);
if (err != hipSuccess) {
    std::cerr << "HIP error: " << hipGetErrorString(err) << std::endl;
    exit(EXIT_FAILURE);
}

After a kernel launch, you can check for launch errors with hipGetLastError() and for execution errors after synchronization with hipDeviceSynchronize().

3.7 Debugging and Profiling HIP

Debugging:

On AMD, use rocgdb (similar to cuda‑gdb). Compile with -g and run rocgdb ./vecAdd.
On NVIDIA, you can still use cuda-gdb because the HIP code is translated to CUDA under the hood.

Profiling:

On AMD: Use rocprof to collect performance counters and traces.

  rocprof --hip-trace ./vecAdd

This generates a results.json file that can be viewed with tools like perfetto.

On NVIDIA: Use nvprof or nsight-compute as usual.

Memory checking:

On AMD, use compute-sanitizer (if available) or ROCm’s hip-memcheck.

3.8 HIP Examples: Matrix Multiplication

Our tiled matrix multiplication kernel works unchanged in HIP. Simply change the header and launch syntax. Let's see:

#include <hip/hip_runtime.h>
#include <iostream>

#define TILE_SIZE 16

__global__ void matMulShared(const float *A, const float *B, float *C, int N) {
    __shared__ float As[TILE_SIZE][TILE_SIZE];
    __shared__ float Bs[TILE_SIZE][TILE_SIZE];

    int bx = blockIdx.x, by = blockIdx.y;
    int tx = threadIdx.x, ty = threadIdx.y;
    int row = by * TILE_SIZE + ty;
    int col = bx * TILE_SIZE + tx;

    float sum = 0.0f;
    for (int tile = 0; tile < N / TILE_SIZE; ++tile) {
        As[ty][tx] = A[row * N + (tile * TILE_SIZE + tx)];
        Bs[ty][tx] = B[(tile * TILE_SIZE + ty) * N + col];
        __syncthreads();

        for (int k = 0; k < TILE_SIZE; ++k) {
            sum += As[ty][k] * Bs[k][tx];
        }
        __syncthreads();
    }

    if (row < N && col < N) {
        C[row * N + col] = sum;
    }
}

int main() {
    int N = 1024;
    size_t bytes = N * N * sizeof(float);

    // ... host allocations and initialization ...

    float *d_A, *d_B, *d_C;
    hipMalloc(&d_A, bytes);
    hipMalloc(&d_B, bytes);
    hipMalloc(&d_C, bytes);

    hipMemcpy(d_A, h_A, bytes, hipMemcpyHostToDevice);
    hipMemcpy(d_B, h_B, bytes, hipMemcpyHostToDevice);

    dim3 block(TILE_SIZE, TILE_SIZE);
    dim3 grid((N + block.x - 1) / block.x, (N + block.y - 1) / block.y);

    hipLaunchKernelGGL(matMulShared, grid, block, 0, 0, d_A, d_B, d_C, N);

    hipMemcpy(h_C, d_C, bytes, hipMemcpyDeviceToHost);

    // ... verification and cleanup ...

    return 0;
}

Compile with hipcc matmul.cpp -o matmul.

Part 4: Advanced HIP and Porting

4.1 Subtle Differences Between CUDA and HIP

While HIP aims for source compatibility, there are some differences to be aware of:

Warp vs Wavefront: NVIDIA warps are 32 threads; AMD wavefronts are 64 (on current architectures). Code that assumes a warp size of 32 (e.g., warp shuffle intrinsics) may need adjustment. HIP provides __hip_warp_size which returns the actual warp/wavefront size (32 on NVIDIA, 64 on AMD). Use it to write portable code.
Intrinsics: Some CUDA intrinsics (e.g., __popc, __ballot_sync) have HIP equivalents (__popc, __ballot_sync work, but check the HIP documentation for availability). For warp‑level primitives, prefer the portable __sync_warp and __all, __any.
Math functions: Most math functions (sin, exp, etc.) are identical. However, double‑precision performance may vary between vendors.
Architecture‑specific features: Features like tensor cores (NVIDIA) or matrix cores (AMD) are not portable. If you need them, you may have to write vendor‑specific code guarded by macros.

4.2 Using hipify Tools

AMD provides tools to automatically convert CUDA source to HIP:

hipify-perl: A Perl script that performs text‑based replacement.

  hipify-perl vecAdd.cu > vecAdd.cpp

It handles most API renames and some launch syntax, but you'll likely need to manually adjust the kernel launches (the triple chevrons are converted to hipLaunchKernelGGL).

hipify-clang: A more robust tool based on Clang’s AST. It can handle complex code and provides better accuracy.

  hipify-clang vecAdd.cu --cuda-path=/usr/local/cuda

It outputs HIP code to stdout or a file.

After conversion, review the output, especially for any use of CUDA‑specific features that don’t have direct HIP equivalents.

4.3 Writing Portable Code: Best Practices

To ensure your code runs efficiently on both platforms:

Use hip runtime API for memory management, streams, events.
Avoid hard‑coding warp size: Use __hip_warp_size or compute dynamically.
Use portable intrinsics: Prefer __syncthreads, __threadfence, __syncwarp (though __syncwarp may need a mask; HIP provides __syncwarp(mask) with similar semantics).
Conditional compilation for vendor‑specific optimizations:

  #ifdef __HIP_PLATFORM_NVIDIA__
      // NVIDIA-specific code (e.g., using CUDA tensor cores)
  #elif __HIP_PLATFORM_AMD__
      // AMD-specific code
  #endif

Profile on both platforms to identify bottlenecks. What's fast on NVIDIA may not be on AMD due to different cache hierarchies, wavefront size, etc. Tune block sizes and tile sizes accordingly.

4.4 Performance Optimization on AMD GPUs

When targeting AMD, use ROCm tools:

rocprof: For profiling.
rocm-smi: To monitor GPU utilization, clock speeds, temperature.
OmniTrace: For tracing API calls and kernel execution.

Key optimization considerations for AMD:

Wavefront size: 64 threads. Your block size should be a multiple of 64 to avoid idle threads.
Shared memory bank conflicts: AMD’s shared memory has 32 banks (like NVIDIA) but the mapping may differ. Experiment with padding to reduce conflicts.
Memory coalescing: Still crucial; consecutive threads should access consecutive memory.
Occupancy: Similar concepts apply, but the register file and shared memory sizes differ. Use hipOccupancyMaxPotentialBlockSize to get recommendations.

4.5 Real-world Example: Sobel Edge Detection (Portable)

Let's combine everything into a real‑world kernel: Sobel edge detection on an 8‑bit grayscale image. We'll write it portably for HIP/CUDA.

#include <hip/hip_runtime.h>  // or <cuda_runtime.h> for CUDA

#define TILE_SIZE 16
#define MASK_SIZE 3

__global__ void sobel(const unsigned char *in, unsigned char *out, int width, int height) {
    // Shared memory for tile + halo (1 pixel halo on each side)
    __shared__ unsigned char tile[TILE_SIZE + 2][TILE_SIZE + 2];

    int x = blockIdx.x * TILE_SIZE + threadIdx.x;
    int y = blockIdx.y * TILE_SIZE + threadIdx.y;

    // Global coordinates including halo
    int gx = x - 1;
    int gy = y - 1;

    // Load tile with boundary check
    if (gx >= 0 && gx < width && gy >= 0 && gy < height)
        tile[threadIdx.y + 1][threadIdx.x + 1] = in[gy * width + gx];
    else
        tile[threadIdx.y + 1][threadIdx.x + 1] = 0;

    // Load halo (simplified: we rely on neighboring threads)
    // For a full implementation, you'd also load the border regions.
    __syncthreads();

    // Compute Sobel only for inner tile pixels (not halo)
    if (threadIdx.x < TILE_SIZE && threadIdx.y < TILE_SIZE && x < width && y < height) {
        int Gx = -tile[threadIdx.y][threadIdx.x] + tile[threadIdx.y][threadIdx.x + 2]
                 -2*tile[threadIdx.y+1][threadIdx.x] + 2*tile[threadIdx.y+1][threadIdx.x+2]
                 -tile[threadIdx.y+2][threadIdx.x] + tile[threadIdx.y+2][threadIdx.x+2];
        int Gy = -tile[threadIdx.y][threadIdx.x] -2*tile[threadIdx.y][threadIdx.x+1] - tile[threadIdx.y][threadIdx.x+2]
                 +tile[threadIdx.y+2][threadIdx.x] +2*tile[threadIdx.y+2][threadIdx.x+1] + tile[threadIdx.y+2][threadIdx.x+2];
        int mag = min(255, (int)hypot(Gx, Gy));
        out[y * width + x] = mag;
    }
}

Launch:

dim3 block(TILE_SIZE, TILE_SIZE);
dim3 grid((width + block.x - 1) / block.x, (height + block.y - 1) / block.y);
sobel<<<grid, block>>>(d_in, d_out, width, height); // CUDA
// HIP: hipLaunchKernelGGL(sobel, grid, block, 0, 0, d_in, d_out, width, height);

This kernel uses shared memory to reduce global memory traffic. Each block loads a tile plus halo, then computes Sobel for its interior. It's a great example of a real‑world, portable GPU kernel.

Part 5: Conclusion and Next Steps

Congratulations! You've journeyed from the absolute basics of GPU programming to writing advanced, portable kernels that run on both NVIDIA and AMD hardware. You've learned:

The CUDA programming model, memory hierarchy, and essential APIs.
How to write and optimize kernels for maximum performance.
The HIP ecosystem and how to write truly portable code.
Debugging and profiling techniques for both platforms.

Where to Go from Here

Explore official documentation:
- CUDA C++ Programming Guide
- HIP Documentation
Study advanced examples: Reduction, scan, stencils, FFTs.
Contribute to open‑source projects that use CUDA/HIP (e.g., PyTorch, TensorFlow, ROCm libraries).
Experiment with multi‑GPU and distributed computing (e.g., using MPI with GPUs).
Learn about new hardware features like tensor cores (NVIDIA) and matrix cores (AMD) – but remember to keep your code portable or use conditional compilation.

GPU programming is a powerful skill that opens doors to high‑performance computing, machine learning, and graphics. With CUDA and HIP under your belt, you're well‑equipped to tackle the most demanding computational challenges. Now go forth and write some fast, portable code!

If you found this guide helpful, please share it with your fellow developers. Have questions or want to share your own experiences? Drop a comment below – I’d love to hear from you. Happy coding!

DEV Community

Advanced GPU Optimization: CUDA & HIP from zero to hero

Part 1: CUDA Fundamentals

1.1 Setting Up CUDA

1.2 The CUDA Programming Model

1.3 Memory Hierarchy

1.4 Your First CUDA Kernel: Vector Addition

1.5 Understanding Kernel Launch Parameters

1.6 Memory Management in Depth

1.7 Error Handling

1.8 Profiling and Debugging Basics

1.9 Practice: Matrix Multiplication

Part 2: Intermediate CUDA

2.1 Performance Considerations

2.2 Shared Memory and Bank Conflicts

2.3 Tiled Matrix Multiplication with Shared Memory

2.4 Atomic Operations

2.5 Streams and Concurrency

2.6 Events for Timing

2.7 Unified Memory

2.8 Multi‑GPU Programming

2.9 Advanced Example: Parallel Reduction

Part 3: Introduction to HIP

3.1 What is HIP? Why Portability?

3.2 Setting Up HIP

3.3 HIP API Overview

3.4 Porting a CUDA Program to HIP

3.5 HIP Kernel Launch Syntax Explained

3.6 Error Handling in HIP

3.7 Debugging and Profiling HIP

3.8 HIP Examples: Matrix Multiplication

Part 4: Advanced HIP and Porting

4.1 Subtle Differences Between CUDA and HIP

4.2 Using hipify Tools

4.3 Writing Portable Code: Best Practices

4.4 Performance Optimization on AMD GPUs

4.5 Real-world Example: Sobel Edge Detection (Portable)

Part 5: Conclusion and Next Steps

Where to Go from Here

Top comments (0)