DEV Community

Javad
Javad

Posted on

Advanced GPU Optimization: Microarchitecture of GPU

(This blog is from "Advanced GPU Optimization" series, if you don't know what's going on, please visit here)

Hey Dev Community!

High‑level GPU optimization (tiling, shared memory, occupancy, etc.) will take you far.

But if you want to squeeze every last drop of performance out of a GPU, you must understand what’s happening inside the SM/CU—at the microarchitecture level.

This post is Part 2 of the series:

  • Part 1: Advanced GPU Optimization — Complete Technical Guide
  • Part 2 (this post): Microarchitecture of GPU

Here, we go from zero to expert on:

  • SIMD lane scheduling
  • Warp/wavefront reconvergence
  • Register file banking
  • Instruction-level parallelism (ILP)
  • Dual-issue pipelines
  • Operand collectors
  • Scoreboarding
  • Thread block scheduling
  • SM/CU residency algorithms

With practical examples, code patterns, and concrete optimization strategies.


📘 Table of contents

  1. SIMD lane scheduling
  2. Warp/wavefront reconvergence
  3. Register file banking
  4. Instruction-level parallelism (ILP)
  5. Dual-issue pipelines
  6. Operand collector
  7. Scoreboarding
  8. Thread block scheduler (TBS)
  9. SM/CU residency algorithms
  10. End-to-end example: from naive kernel to microarchitecture-aware kernel
  11. Final checklist

  1. SIMD lane scheduling

How the GPU actually executes your threads

GPUs execute threads in lockstep groups:

  • NVIDIA: warp = 32 threads
  • AMD: wavefront = 64 threads

Each warp executes one instruction at a time, across all active lanes.

1.1. Basic model

Think of a warp as:

  • A program counter (PC)
  • A mask of active lanes
  • A set of registers per lane

Each cycle, the warp scheduler:

  1. Picks a warp that is ready (not stalled, not waiting on memory, etc.)
  2. Issues the next instruction for that warp
  3. All active lanes execute that instruction in parallel

If a warp is stalled (e.g., waiting for memory), the scheduler picks another warp.

This is how GPUs hide latency: by switching between warps.


1.2. Practical example: memory latency hiding

Consider this naive CUDA kernel:

global void naiveloadcompute(const float* restrict in,
                                   float* restrict out,
                                   int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x = in[idx];          // global memory load (high latency)
    float y = x * 2.0f;         // simple compute
    out[idx] = y;
}
Enter fullscreen mode Exit fullscreen mode

If:

  • Each warp has only one outstanding memory load
  • There are few warps per SM

Then the scheduler quickly runs out of ready warps, and the SM stalls.


1.3. Improving warp scheduling with more work per thread

We can increase instruction-level work per thread and warp count:

global void betterloadcompute(const float* restrict in,
                                    float* restrict out,
                                    int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx * 4 >= N) return;

    // Load 4 elements per thread
    float x0 = in[idx * 4 + 0];
    float x1 = in[idx * 4 + 1];
    float x2 = in[idx * 4 + 2];
    float x3 = in[idx * 4 + 3];

    // Do more math per load
    float y0 = x0 * 2.0f + 1.0f;
    float y1 = x1 * 2.0f + 1.0f;
    float y2 = x2 * 2.0f + 1.0f;
    float y3 = x3 * 2.0f + 1.0f;

    out[idx * 4 + 0] = y0;
    out[idx * 4 + 1] = y1;
    out[idx * 4 + 2] = y2;
    out[idx * 4 + 3] = y3;
}
Enter fullscreen mode Exit fullscreen mode

Now:

  • Each warp has more independent instructions
  • The scheduler has more ready warps
  • Latency is better hidden

  1. Warp/wavefront reconvergence

What really happens when threads diverge

When threads in a warp take different branches, the GPU serializes the paths.

Example:

global void branchykernel(const float* restrict_ in,
                               float* restrict out,
                               int N, float threshold) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x = in[idx];

    if (x > threshold) {
        out[idx] = x * x;       // path A
    } else {
        out[idx] = sqrtf(x);    // path B
    }
}
Enter fullscreen mode Exit fullscreen mode

If half the threads in a warp go to path A and half to path B:

  • The warp executes path A with only those lanes active
  • Then executes path B with the other lanes active
  • Then reconverges

Effective throughput is cut roughly in half.


2.1. Reconvergence mechanism

Internally, the GPU maintains:

  • A stack of masks
  • A stack of PCs

At a branch:

  1. It computes which lanes go where
  2. Pushes reconvergence info
  3. Executes each path with its mask
  4. Pops and reconverges

2.2. Reducing divergence with predication

Instead of branching, sometimes you can use predication:

global void predicatedkernel(const float* restrict_ in,
                                  float* restrict out,
                                  int N, float threshold) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x = in[idx];

    float a = x * x;
    float b = sqrtf(x);

    // Select without branch
    out[idx] = (x > threshold) ? a : b;
}
Enter fullscreen mode Exit fullscreen mode

Now:

  • All lanes execute both computations
  • But there is no divergence
  • This is often faster when both paths are cheap

2.3. Practical rule

  • If both branches are cheap → predication is often better
  • If one branch is rare and expensive → branch is better

  1. Register file banking

Why some register accesses are slower than others

GPU register files are banked to allow multiple simultaneous accesses.

Simplified model:

  • Suppose there are 4 banks
  • Each bank can serve 1 read per cycle
  • Registers are mapped to banks by index: bank = reg_id % 4

If multiple lanes in a warp access registers that map to the same bank in the same cycle → bank conflict → serialized access.


3.1. Example: artificial bank conflict

This is conceptual—actual mapping is vendor-specific, but the idea holds.

global void bankconflictexample(float* restrict out) {
    int lane = threadIdx.x & 31; // warp-local lane ID

    // Hypothetical: registers r0, r4, r8, ... map to same bank
    float r0 = lane * 1.0f;
    float r4 = lane * 2.0f;
    float r8 = lane * 3.0f;
    float r12 = lane * 4.0f;

    // If all these are read in the same instruction group,
    // they may cause bank conflicts.
    float sum = r0 + r4 + r8 + r12;

    out[lane] = sum;
}
Enter fullscreen mode Exit fullscreen mode

In practice, you don’t control physical register IDs directly, but:

  • High register pressure
  • Complex expressions
  • Many live variables

→ increase the chance of conflicts and spills.


3.2. Reducing register pressure

You can often reduce register usage by:

  • Breaking kernels into smaller pieces
  • Reusing variables
  • Avoiding large local arrays
  • Using restrict to help the compiler

Example:

// High register pressure
global void heavy_kernel(float a, float b, float* c, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x0 = a[idx];
    float x1 = b[idx];
    float x2 = c[idx];

    float t0 = x0 * 2.0f;
    float t1 = x1 * 3.0f;
    float t2 = x2 * 4.0f;

    float t3 = t0 + t1;
    float t4 = t2 + t3;

    c[idx] = t4;
}
Enter fullscreen mode Exit fullscreen mode

We can compress:

global void lighter_kernel(float a, float b, float* c, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x0 = a[idx];
    float x1 = b[idx];
    float x2 = c[idx];

    float t = x0  2.0f + x1  3.0f + x2 * 4.0f;

    c[idx] = t;
}
Enter fullscreen mode Exit fullscreen mode

Fewer live temporaries → fewer registers → fewer spills → fewer bank conflicts.


  1. Instruction-level parallelism (ILP)

Making the GPU do more per warp per cycle

ILP is about having multiple independent instructions ready to execute from the same warp.

Example of low ILP:

float a = x[i] * 2.0f;
float b = a + 3.0f;
float c = b * 4.0f;
Enter fullscreen mode Exit fullscreen mode

Each instruction depends on the previous one.

Example of higher ILP:

float a = x[i] * 2.0f;
float b = y[i] * 3.0f;
float c = z[i] * 4.0f;
float d = a + b + c;
Enter fullscreen mode Exit fullscreen mode

Now the first three instructions are independent.


4.1. ILP in a kernel

Naive version:

global void lowilpkernel(const float* restrict in,
                               float* restrict out,
                               int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x = in[idx];
    float y = x * 2.0f;
    float z = y + 3.0f;
    float w = z * 4.0f;

    out[idx] = w;
}
Enter fullscreen mode Exit fullscreen mode

Higher ILP version:

global void highilpkernel(const float* restrict in1,
                                const float* restrict in2,
                                const float* restrict in3,
                                float* restrict out,
                                int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x = in1[idx];
    float y = in2[idx];
    float z = in3[idx];

    float a = x * 2.0f;
    float b = y * 3.0f;
    float c = z * 4.0f;

    float w = a + b + c;

    out[idx] = w;
}
Enter fullscreen mode Exit fullscreen mode

The GPU can overlap the independent multiplies, improving throughput.


4.2. ILP vs occupancy

Sometimes:

  • You can trade occupancy for ILP
  • Fewer warps, but each warp has more independent work

This is often beneficial for compute-bound kernels.


  1. Dual-issue pipelines

Using multiple execution units per cycle

Modern GPUs have multiple execution pipelines:

  • FP32 ALU
  • INT ALU
  • SFU (special functions: sin, cos, exp, etc.)
  • Load/store units

If two instructions target different pipelines and are independent, they can be dual-issued.


5.1. Example: mixing INT and FP

global void dualissuekernel(const float* restrict in,
                                  float* restrict out,
                                  int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x = in[idx];

    // FP operation
    float y = x * 2.0f;

    // INT operation
    int lane = threadIdx.x & 31;
    int offset = lane + 1;

    // Use both results
    out[idx] = y + float(offset);
}
Enter fullscreen mode Exit fullscreen mode

Here:

  • y = x * 2.0f; → FP pipeline
  • offset = lane + 1; → INT pipeline

The GPU can often issue these in the same cycle.


5.2. Practical advice

  • Avoid long chains of same-type dependent instructions
  • Interleave integer index math with floating-point math
  • Let the compiler schedule mixed instructions

  1. Operand collector

The hidden buffer that can stall you

Before an instruction executes, its operands must be:

  • Read from registers
  • Possibly forwarded from previous instructions
  • Possibly fetched from shared memory

The operand collector gathers these inputs.

If too many instructions simultaneously demand operands, the collector can become a bottleneck → collector stalls.


6.1. Recognizing collector pressure

You typically see this in profilers as:

  • “Operand collector stall”
  • “Input dependency stall”
  • “Register dependency stall”

6.2. Reducing collector pressure

  • Reduce register pressure
  • Avoid long dependency chains
  • Avoid issuing many complex instructions back-to-back
  • Use fewer live variables

Example: instead of:

float a = f0(x);
float b = f1(a);
float c = f2(b);
float d = f3(c);
float e = f4(d);
Enter fullscreen mode Exit fullscreen mode

Try to restructure:

float a = f0(x);
float b = f1(x);
float c = f2(a, b);
float d = f3(c);
float e = f4(d);
Enter fullscreen mode Exit fullscreen mode

More independence → less pressure.


  1. Scoreboarding

How the GPU knows when an instruction is ready to execute

The scoreboard tracks:

  • Which registers are ready
  • Which instructions are waiting on which registers
  • Which warps are stalled

A warp cannot issue an instruction until:

  • All its operands are ready
  • No structural hazards exist

7.1. Example: load-use dependency

global void loadusekernel(const float* restrict in,
                                float* restrict out,
                                int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x = in[idx];      // global load (high latency)
    float y = x * 2.0f;     // depends on x
    out[idx] = y;
}
Enter fullscreen mode Exit fullscreen mode

The scoreboard will:

  • Mark x as not ready until the load completes
  • Prevent the multiply from issuing

If there are no other ready warps, the SM stalls.


7.2. Hiding load latency with independent work

global void betterloadusekernel(const float* restrict_ in,
                                       float* restrict out,
                                       int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N - 4) return;

    // Issue multiple loads
    float x0 = in[idx + 0];
    float x1 = in[idx + 1];
    float x2 = in[idx + 2];
    float x3 = in[idx + 3];

    // Do some independent math while loads complete
    float t0 = float(idx) * 0.1f;
    float t1 = t0 * t0;

    // Now use loaded values
    out[idx + 0] = x0 * 2.0f + t1;
    out[idx + 1] = x1 * 2.0f + t1;
    out[idx + 2] = x2 * 2.0f + t1;
    out[idx + 3] = x3 * 2.0f + t1;
}
Enter fullscreen mode Exit fullscreen mode

Now the scoreboard has more ready instructions to issue while loads are in flight.


  1. Thread Block Scheduler (TBS)

How blocks are assigned to SMs

Each SM has finite resources:

  • Max threads
  • Max warps
  • Max blocks
  • Register file size
  • Shared memory size

The thread block scheduler assigns blocks to SMs subject to these constraints.


8.1. Example: resource-limited occupancy

Suppose an SM has:

  • 2048 max threads
  • 64 max warps
  • 96 KB shared memory

Your kernel uses:

  • 256 threads per block
  • 48 KB shared memory per block

Then:

  • Max blocks by threads: 2048 / 256 = 8
  • Max blocks by shared memory: 96 / 48 = 2

So only 2 blocks per SM can be resident.

If each block has 256 threads → 512 threads per SM → 25% occupancy.


8.2. Reducing shared memory to increase residency

If you can reduce shared memory to 24 KB per block:

  • Max blocks by shared memory: 96 / 24 = 4
  • Now 4 blocks per SM → 1024 threads → 50% occupancy

This often improves performance.


  1. SM/CU residency algorithms

How many blocks/warps can live on an SM at once

Residency is determined by:

  • Threads per block
  • Registers per thread
  • Shared memory per block
  • Hardware limits

The compiler and runtime compute:

  • How many blocks can fit per SM
  • How many warps that yields
  • What occupancy that implies

9.1. Example calculation

Assume:

  • SM supports 2048 threads, 64 warps, 96 KB shared memory
  • Kernel uses 256 threads/block, 32 registers/thread, 24 KB shared memory/block

Then:

  • Max blocks by threads: 2048 / 256 = 8
  • Max blocks by shared memory: 96 / 24 = 4
  • Max blocks by warps: 64 / (256 / 32) = 64 / 8 = 8

Final residency: 4 blocks per SM.

Total threads per SM: 4 × 256 = 1024 → 50% occupancy.


9.2. Using tools

Use:

  • nvcc --ptxas-options=-v to see register usage
  • Nsight Compute to see occupancy and limiting factors
  • ROCm tools on AMD

  1. End-to-end example

From naive kernel to microarchitecture-aware kernel

Let’s take a simple reduction and progressively make it microarchitecture-aware.


10.1. Naive reduction

global void naivereduce(const float* restrict_ in,
                             float* restrict out,
                             int N) {
    shared float sdata[256];

    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + tid;

    float x = (idx < N) ? in[idx] : 0.0f;
    sdata[tid] = x;
    syncthreads();

    // Naive reduction
    for (int stride = 1; stride < blockDim.x; stride *= 2) {
        if (tid % (2 * stride) == 0) {
            sdata[tid] += sdata[tid + stride];
        }
        syncthreads();
    }

    if (tid == 0)
        out[blockIdx.x] = sdata[0];
}
Enter fullscreen mode Exit fullscreen mode

Problems:

  • Divergence (tid % (2 * stride) == 0)
  • Poor ILP
  • Many synchronizations
  • Shared memory bank conflicts likely

10.2. Microarchitecture-aware reduction

global void optimizedreduce(const float* restrict_ in,
                                 float* restrict out,
                                 int N) {
    extern shared float sdata[];

    int tid = threadIdx.x;
    int idx = blockIdx.x  (blockDim.x  2) + tid;

    float sum = 0.0f;

    // Load two elements per thread (more ILP, better memory throughput)
    if (idx < N) {
        sum += in[idx];
        if (idx + blockDim.x < N)
            sum += in[idx + blockDim.x];
    }

    sdata[tid] = sum;
    syncthreads();

    // Reduction in shared memory with no divergence
    for (int stride = blockDim.x / 2; stride > 32; stride >>= 1) {
        if (tid < stride) {
            sdata[tid] += sdata[tid + stride];
        }
        syncthreads();
    }

    // Warp-level reduction (no shared memory, no sync)
    if (tid < 32) {
        float val = sdata[tid];

        // Warp shuffle reduction
        #pragma unroll
        for (int offset = 16; offset > 0; offset >>= 1) {
            val += shfldownsync(0xffffffff, val, offset);
        }

        if (tid == 0)
            out[blockIdx.x] = val;
    }
}
Enter fullscreen mode Exit fullscreen mode

  1. Final microarchitecture-aware optimization checklist

When optimizing a GPU kernel at the microarchitecture level, ask:

  • SIMD & warps:

    • Are warps doing similar work?
    • Is there heavy divergence?
  • Reconvergence:

    • Can I replace branches with predication?
    • Can I restructure control flow?
  • Registers & banking:

    • Is register usage too high?
    • Am I causing spills?
  • ILP:

    • Do I have independent instructions?
    • Can I unroll loops to expose more ILP?
  • Dual-issue:

    • Am I mixing INT + FP operations?
  • Operand collector & scoreboard:

    • Do I have long dependency chains?
    • Can I insert independent work between loads and uses?
  • TBS & residency:

    • Is shared memory limiting blocks per SM?
    • Is register usage limiting occupancy?
  • Profiling:

    • What do Nsight / ROCm tools say is the main stall reason?
    • Am I memory-bound or compute-bound?

I hope you enjoy, have nice times!

Top comments (0)