Myoungho Shin

Posted on Mar 2

Profiling GPU (CUDA) — What Is Actually Limiting Your Kernel?

#performance #cuda #gpu #cpp

In my last post I introduced GPU Flight — a lightweight CUDA observability tool that acts like a flight recorder for your GPU. We covered what it collects: system metrics, device capabilities, and per-kernel events.

Today I want to talk about one specific metric that GPU Flight captures: occupancy. It's one of the most important numbers for understanding GPU performance, and also one of the most misunderstood.

What Is Occupancy?

A GPU is organized around Streaming Multiprocessors (SMs). Each SM can run many threads simultaneously — not by context-switching like a CPU, but by actually running them in parallel. The unit of scheduling on an SM is a warp: a group of 32 threads that execute the same instruction in lockstep.

An SM has a fixed warp budget — say, 48 warps on a typical Ampere GPU. When you launch a kernel with blocks of 256 threads (8 warps each), the SM can hold up to 6 blocks concurrently to fill those 48 warp slots. If something prevents that — too many registers, too much shared memory — fewer blocks fit, and some warp slots sit idle.

Occupancy measures how well those warp slots are filled:

occupancy = active warps / maximum warps per SM

A value of 1.0 means every slot is in use. A value of 0.5 means half the SM's compute capacity is being wasted while your kernel runs.

How GPU Flight Captures It

GPU Flight records occupancy automatically for every kernel launch. No code changes needed — just initialize with enableKernelDetails: true and it shows up in the log:

{
  "type": "kernel_event",
  "name": "_Z18block_reduce_naivePKfPfi",
  "occupancy": 0.833333,
  "num_regs": 16,
  "static_shared_bytes": 16384,
  "dyn_shared_bytes": 0,
  "block": "(256,1,1)",
  "grid": "(16384,1,1)",
  "max_active_blocks": 5,
...
}

Under the hood, GPU Flight calls cudaOccupancyMaxActiveBlocksPerMultiprocessor at kernel launch time to get max_active_blocks, then divides by the SM's warp budget to compute occupancy. This happens inside the CUPTI callback — zero overhead to your kernel execution.

That 0.833333 immediately tells you something is off. This kernel only fills 5 out of 6 possible concurrent blocks on each SM. Some compute is being left on the table.

But What Is Actually Causing It?

Here's where a single number hits its limit.

Is it registers? Shared memory? The hardware block count cap? Looking at the log fields, you can make an educated guess — static_shared_bytes: 16384 is 16 KB of shared memory per block, which is pretty large. But you still have to do the math yourself against your specific GPU's properties to confirm.

That manual detective work is exactly what I wanted to eliminate. So GPU Flight now also computes a per-resource occupancy breakdown and identifies the limiting resource automatically. Let me show what this looks like with a concrete kernel.

The kernel

Here's a simple parallel block reduction — it sums an array by having all 256 threads in a block cooperate through shared memory:

__global__ void block_reduce_naive(const float* in, float* out, int n) {
    __shared__ float smem[4096]; // 16 KB — statically reserved

    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + tid;

    // Load one element per thread into shared memory
    smem[tid] = (gid < n) ? in[gid] : 0.0f;
    __syncthreads();

    // Reduce in shared memory — each step halves the active threads
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) smem[tid] += smem[tid + s];
        __syncthreads();
    }

    // Thread 0 writes the block's result
    if (tid == 0) out[blockIdx.x] = smem[0];
}

Launched with 256 threads per block across 4M elements:

const int BLOCK = 256;
const int GRID  = (N + BLOCK - 1) / BLOCK; // ~16384 blocks
block_reduce_naive<<<GRID, BLOCK>>>(d_in, d_out, N);

Nothing unusual here — this is a textbook reduction. But GPU Flight flags a problem immediately.

What GPU Flight sees

{
...,
  "occupancy":         0.833333,
  "reg_occupancy":     1.0,
  "smem_occupancy":    0.833333,
  "warp_occupancy":    1.0,
  "block_occupancy":   1.0,
  "limiting_resource": "shared_mem"
}

Each *_occupancy field answers: "if only this constraint existed, what would occupancy be?" The limiting_resource field names the one that's actually binding. Here — smem_occupancy matches occupancy and everything else is 1.0 — shared memory is definitively the culprit.

Why

The problem is __shared__ float smem[4096]. Static shared memory is sized at compile time and reserved in full for the block's entire lifetime — even if the kernel only uses part of it. With 256 threads per block, this reduction only ever touches smem[0] through smem[255], but all 4096 floats (16 KB) are locked up on the SM regardless. Every block is paying a 16 KB reservation it doesn't actually need, and that prevents the SM from scheduling as many concurrent blocks as the warp budget would otherwise allow.

The fix

Switch to dynamic shared memory, which is sized at launch time rather than compiled in:

__global__ void block_reduce_optimized(const float* in, float* out, int n) {
    extern __shared__ float smem[]; // size comes from the launch call

    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + tid;

    smem[tid] = (gid < n) ? in[gid] : 0.0f;
    __syncthreads();

    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) smem[tid] += smem[tid + s];
        __syncthreads();
    }

    if (tid == 0) out[blockIdx.x] = smem[0];
}

The kernel body is completely unchanged. The only differences are extern __shared__ instead of a fixed-size array, and passing the size as the third launch argument:

size_t smem_bytes = BLOCK * sizeof(float); // 256 × 4 = 1 KB
block_reduce_optimized<<<GRID, BLOCK, smem_bytes>>>(d_in, d_out, N);

The shared memory footprint drops from 16 KB to 1 KB per block — 16× smaller — and now the SM can fit all 6 concurrent blocks instead of 5.

GPU Flight confirms the fix worked:

{
  "occupancy":         1.0,
  "limiting_resource": "warps"
}

"warps" as the limiting resource means full occupancy — every SM warp slot is filled and shared memory is no longer in the way.

Full Sample Code: GitHub Repo

DEV Community