Javad

Posted on Jan 4

Advanced GPU Optimization: Microarchitecture of GPU

#programming #tutorial #devops #discuss

(This blog is from "Advanced GPU Optimization" series, if you don't know what's going on, please visit here)

Hey Dev Community!

High‑level GPU optimization (tiling, shared memory, occupancy, etc.) will take you far.

But if you want to squeeze every last drop of performance out of a GPU, you must understand what’s happening inside the SM/CU—at the microarchitecture level.

This post is Part 2 of the series:

Part 1: Advanced GPU Optimization — Complete Technical Guide
Part 2 (this post): Microarchitecture of GPU

Here, we go from zero to expert on:

SIMD lane scheduling
Warp/wavefront reconvergence
Register file banking
Instruction-level parallelism (ILP)
Dual-issue pipelines
Operand collectors
Scoreboarding
Thread block scheduling
SM/CU residency algorithms

With practical examples, code patterns, and concrete optimization strategies.

📘 Table of contents

SIMD lane scheduling
Warp/wavefront reconvergence
Register file banking
Instruction-level parallelism (ILP)
Dual-issue pipelines
Operand collector
Scoreboarding
Thread block scheduler (TBS)
SM/CU residency algorithms
End-to-end example: from naive kernel to microarchitecture-aware kernel
Final checklist

SIMD lane scheduling

How the GPU actually executes your threads

GPUs execute threads in lockstep groups:

NVIDIA: warp = 32 threads
AMD: wavefront = 64 threads

Each warp executes one instruction at a time, across all active lanes.

1.1. Basic model

Think of a warp as:

A program counter (PC)
A mask of active lanes
A set of registers per lane

Each cycle, the warp scheduler:

Picks a warp that is ready (not stalled, not waiting on memory, etc.)
Issues the next instruction for that warp
All active lanes execute that instruction in parallel

If a warp is stalled (e.g., waiting for memory), the scheduler picks another warp.

This is how GPUs hide latency: by switching between warps.

1.2. Practical example: memory latency hiding

Consider this naive CUDA kernel:

global void naiveloadcompute(const float* restrict in,
                                   float* restrict out,
                                   int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x = in[idx];          // global memory load (high latency)
    float y = x * 2.0f;         // simple compute
    out[idx] = y;
}

If:

Each warp has only one outstanding memory load
There are few warps per SM

Then the scheduler quickly runs out of ready warps, and the SM stalls.

1.3. Improving warp scheduling with more work per thread

We can increase instruction-level work per thread and warp count:

global void betterloadcompute(const float* restrict in,
                                    float* restrict out,
                                    int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx * 4 >= N) return;

    // Load 4 elements per thread
    float x0 = in[idx * 4 + 0];
    float x1 = in[idx * 4 + 1];
    float x2 = in[idx * 4 + 2];
    float x3 = in[idx * 4 + 3];

    // Do more math per load
    float y0 = x0 * 2.0f + 1.0f;
    float y1 = x1 * 2.0f + 1.0f;
    float y2 = x2 * 2.0f + 1.0f;
    float y3 = x3 * 2.0f + 1.0f;

    out[idx * 4 + 0] = y0;
    out[idx * 4 + 1] = y1;
    out[idx * 4 + 2] = y2;
    out[idx * 4 + 3] = y3;
}

Now:

Each warp has more independent instructions
The scheduler has more ready warps
Latency is better hidden

Warp/wavefront reconvergence

What really happens when threads diverge

When threads in a warp take different branches, the GPU serializes the paths.

Example:

global void branchykernel(const float* restrict_ in,
                               float* restrict out,
                               int N, float threshold) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x = in[idx];

    if (x > threshold) {
        out[idx] = x * x;       // path A
    } else {
        out[idx] = sqrtf(x);    // path B
    }
}

If half the threads in a warp go to path A and half to path B:

The warp executes path A with only those lanes active
Then executes path B with the other lanes active
Then reconverges

Effective throughput is cut roughly in half.

2.1. Reconvergence mechanism

Internally, the GPU maintains:

A stack of masks
A stack of PCs

At a branch:

It computes which lanes go where
Pushes reconvergence info
Executes each path with its mask
Pops and reconverges

2.2. Reducing divergence with predication

Instead of branching, sometimes you can use predication:

global void predicatedkernel(const float* restrict_ in,
                                  float* restrict out,
                                  int N, float threshold) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x = in[idx];

    float a = x * x;
    float b = sqrtf(x);

    // Select without branch
    out[idx] = (x > threshold) ? a : b;
}

Now:

All lanes execute both computations
But there is no divergence
This is often faster when both paths are cheap

2.3. Practical rule

If both branches are cheap → predication is often better
If one branch is rare and expensive → branch is better

Why some register accesses are slower than others

GPU register files are banked to allow multiple simultaneous accesses.

Simplified model:

Suppose there are 4 banks
Each bank can serve 1 read per cycle
Registers are mapped to banks by index: bank = reg_id % 4

If multiple lanes in a warp access registers that map to the same bank in the same cycle → bank conflict → serialized access.

3.1. Example: artificial bank conflict

This is conceptual—actual mapping is vendor-specific, but the idea holds.

global void bankconflictexample(float* restrict out) {
    int lane = threadIdx.x & 31; // warp-local lane ID

    // Hypothetical: registers r0, r4, r8, ... map to same bank
    float r0 = lane * 1.0f;
    float r4 = lane * 2.0f;
    float r8 = lane * 3.0f;
    float r12 = lane * 4.0f;

    // If all these are read in the same instruction group,
    // they may cause bank conflicts.
    float sum = r0 + r4 + r8 + r12;

    out[lane] = sum;
}

In practice, you don’t control physical register IDs directly, but:

High register pressure
Complex expressions
Many live variables

→ increase the chance of conflicts and spills.

3.2. Reducing register pressure

You can often reduce register usage by:

Breaking kernels into smaller pieces
Reusing variables
Avoiding large local arrays
Using restrict to help the compiler

Example:

// High register pressure
global void heavy_kernel(float a, float b, float* c, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x0 = a[idx];
    float x1 = b[idx];
    float x2 = c[idx];

    float t0 = x0 * 2.0f;
    float t1 = x1 * 3.0f;
    float t2 = x2 * 4.0f;

    float t3 = t0 + t1;
    float t4 = t2 + t3;

    c[idx] = t4;
}

We can compress:

global void lighter_kernel(float a, float b, float* c, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x0 = a[idx];
    float x1 = b[idx];
    float x2 = c[idx];

    float t = x0  2.0f + x1  3.0f + x2 * 4.0f;

    c[idx] = t;
}

Fewer live temporaries → fewer registers → fewer spills → fewer bank conflicts.

Instruction-level parallelism (ILP)

Making the GPU do more per warp per cycle

ILP is about having multiple independent instructions ready to execute from the same warp.

Example of low ILP:

float a = x[i] * 2.0f;
float b = a + 3.0f;
float c = b * 4.0f;

Each instruction depends on the previous one.

Example of higher ILP:

float a = x[i] * 2.0f;
float b = y[i] * 3.0f;
float c = z[i] * 4.0f;
float d = a + b + c;

Now the first three instructions are independent.

4.1. ILP in a kernel

Naive version:

global void lowilpkernel(const float* restrict in,
                               float* restrict out,
                               int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x = in[idx];
    float y = x * 2.0f;
    float z = y + 3.0f;
    float w = z * 4.0f;

    out[idx] = w;
}

Higher ILP version:

global void highilpkernel(const float* restrict in1,
                                const float* restrict in2,
                                const float* restrict in3,
                                float* restrict out,
                                int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x = in1[idx];
    float y = in2[idx];
    float z = in3[idx];

    float a = x * 2.0f;
    float b = y * 3.0f;
    float c = z * 4.0f;

    float w = a + b + c;

    out[idx] = w;
}

The GPU can overlap the independent multiplies, improving throughput.

4.2. ILP vs occupancy

Sometimes:

You can trade occupancy for ILP
Fewer warps, but each warp has more independent work

This is often beneficial for compute-bound kernels.

Dual-issue pipelines

Using multiple execution units per cycle

Modern GPUs have multiple execution pipelines:

FP32 ALU
INT ALU
SFU (special functions: sin, cos, exp, etc.)
Load/store units

If two instructions target different pipelines and are independent, they can be dual-issued.

5.1. Example: mixing INT and FP

global void dualissuekernel(const float* restrict in,
                                  float* restrict out,
                                  int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x = in[idx];

    // FP operation
    float y = x * 2.0f;

    // INT operation
    int lane = threadIdx.x & 31;
    int offset = lane + 1;

    // Use both results
    out[idx] = y + float(offset);
}

Here:

y = x * 2.0f; → FP pipeline
offset = lane + 1; → INT pipeline

The GPU can often issue these in the same cycle.

5.2. Practical advice

Avoid long chains of same-type dependent instructions
Interleave integer index math with floating-point math
Let the compiler schedule mixed instructions

Operand collector

The hidden buffer that can stall you

Before an instruction executes, its operands must be:

Read from registers
Possibly forwarded from previous instructions
Possibly fetched from shared memory

The operand collector gathers these inputs.

If too many instructions simultaneously demand operands, the collector can become a bottleneck → collector stalls.

6.1. Recognizing collector pressure

You typically see this in profilers as:

“Operand collector stall”
“Input dependency stall”
“Register dependency stall”

6.2. Reducing collector pressure

Reduce register pressure
Avoid long dependency chains
Avoid issuing many complex instructions back-to-back
Use fewer live variables

Example: instead of:

float a = f0(x);
float b = f1(a);
float c = f2(b);
float d = f3(c);
float e = f4(d);

Try to restructure:

float a = f0(x);
float b = f1(x);
float c = f2(a, b);
float d = f3(c);
float e = f4(d);

More independence → less pressure.

Scoreboarding

How the GPU knows when an instruction is ready to execute

The scoreboard tracks:

Which registers are ready
Which instructions are waiting on which registers
Which warps are stalled

A warp cannot issue an instruction until:

All its operands are ready
No structural hazards exist

7.1. Example: load-use dependency

global void loadusekernel(const float* restrict in,
                                float* restrict out,
                                int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N) return;

    float x = in[idx];      // global load (high latency)
    float y = x * 2.0f;     // depends on x
    out[idx] = y;
}

The scoreboard will:

Mark x as not ready until the load completes
Prevent the multiply from issuing

If there are no other ready warps, the SM stalls.

7.2. Hiding load latency with independent work

global void betterloadusekernel(const float* restrict_ in,
                                       float* restrict out,
                                       int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= N - 4) return;

    // Issue multiple loads
    float x0 = in[idx + 0];
    float x1 = in[idx + 1];
    float x2 = in[idx + 2];
    float x3 = in[idx + 3];

    // Do some independent math while loads complete
    float t0 = float(idx) * 0.1f;
    float t1 = t0 * t0;

    // Now use loaded values
    out[idx + 0] = x0 * 2.0f + t1;
    out[idx + 1] = x1 * 2.0f + t1;
    out[idx + 2] = x2 * 2.0f + t1;
    out[idx + 3] = x3 * 2.0f + t1;
}

Now the scoreboard has more ready instructions to issue while loads are in flight.

Thread Block Scheduler (TBS)

How blocks are assigned to SMs

Each SM has finite resources:

Max threads
Max warps
Max blocks
Register file size
Shared memory size

The thread block scheduler assigns blocks to SMs subject to these constraints.

8.1. Example: resource-limited occupancy

Suppose an SM has:

2048 max threads
64 max warps
96 KB shared memory

Your kernel uses:

256 threads per block
48 KB shared memory per block

Then:

Max blocks by threads: 2048 / 256 = 8
Max blocks by shared memory: 96 / 48 = 2

So only 2 blocks per SM can be resident.

If each block has 256 threads → 512 threads per SM → 25% occupancy.

8.2. Reducing shared memory to increase residency

If you can reduce shared memory to 24 KB per block:

Max blocks by shared memory: 96 / 24 = 4
Now 4 blocks per SM → 1024 threads → 50% occupancy

This often improves performance.

SM/CU residency algorithms

How many blocks/warps can live on an SM at once

Residency is determined by:

Threads per block
Registers per thread
Shared memory per block
Hardware limits

The compiler and runtime compute:

How many blocks can fit per SM
How many warps that yields
What occupancy that implies

9.1. Example calculation

Assume:

SM supports 2048 threads, 64 warps, 96 KB shared memory
Kernel uses 256 threads/block, 32 registers/thread, 24 KB shared memory/block

Then:

Max blocks by threads: 2048 / 256 = 8
Max blocks by shared memory: 96 / 24 = 4
Max blocks by warps: 64 / (256 / 32) = 64 / 8 = 8

Final residency: 4 blocks per SM.

Total threads per SM: 4 × 256 = 1024 → 50% occupancy.

9.2. Using tools

Use:

nvcc --ptxas-options=-v to see register usage
Nsight Compute to see occupancy and limiting factors
ROCm tools on AMD

End-to-end example

From naive kernel to microarchitecture-aware kernel

Let’s take a simple reduction and progressively make it microarchitecture-aware.

10.1. Naive reduction

global void naivereduce(const float* restrict_ in,
                             float* restrict out,
                             int N) {
    shared float sdata[256];

    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + tid;

    float x = (idx < N) ? in[idx] : 0.0f;
    sdata[tid] = x;
    syncthreads();

    // Naive reduction
    for (int stride = 1; stride < blockDim.x; stride *= 2) {
        if (tid % (2 * stride) == 0) {
            sdata[tid] += sdata[tid + stride];
        }
        syncthreads();
    }

    if (tid == 0)
        out[blockIdx.x] = sdata[0];
}

Problems:

Divergence (tid % (2 * stride) == 0)
Poor ILP
Many synchronizations
Shared memory bank conflicts likely

10.2. Microarchitecture-aware reduction

global void optimizedreduce(const float* restrict_ in,
                                 float* restrict out,
                                 int N) {
    extern shared float sdata[];

    int tid = threadIdx.x;
    int idx = blockIdx.x  (blockDim.x  2) + tid;

    float sum = 0.0f;

    // Load two elements per thread (more ILP, better memory throughput)
    if (idx < N) {
        sum += in[idx];
        if (idx + blockDim.x < N)
            sum += in[idx + blockDim.x];
    }

    sdata[tid] = sum;
    syncthreads();

    // Reduction in shared memory with no divergence
    for (int stride = blockDim.x / 2; stride > 32; stride >>= 1) {
        if (tid < stride) {
            sdata[tid] += sdata[tid + stride];
        }
        syncthreads();
    }

    // Warp-level reduction (no shared memory, no sync)
    if (tid < 32) {
        float val = sdata[tid];

        // Warp shuffle reduction
        #pragma unroll
        for (int offset = 16; offset > 0; offset >>= 1) {
            val += shfldownsync(0xffffffff, val, offset);
        }

        if (tid == 0)
            out[blockIdx.x] = val;
    }
}

Final microarchitecture-aware optimization checklist

When optimizing a GPU kernel at the microarchitecture level, ask:

SIMD & warps:
- Are warps doing similar work?
- Is there heavy divergence?
Reconvergence:
- Can I replace branches with predication?
- Can I restructure control flow?
Registers & banking:
- Is register usage too high?
- Am I causing spills?
ILP:
- Do I have independent instructions?
- Can I unroll loops to expose more ILP?
Dual-issue:
- Am I mixing INT + FP operations?
Operand collector & scoreboard:
- Do I have long dependency chains?
- Can I insert independent work between loads and uses?
TBS & residency:
- Is shared memory limiting blocks per SM?
- Is register usage limiting occupancy?
Profiling:
- What do Nsight / ROCm tools say is the main stall reason?
- Am I memory-bound or compute-bound?

I hope you enjoy, have nice times!

DEV Community

Advanced GPU Optimization: Microarchitecture of GPU

Top comments (0)