(This blog is from "Advanced GPU Optimization" series, if you don't know what's going on, please visit here)
Hey Dev Community!
High‑level GPU optimization (tiling, shared memory, occupancy, etc.) will take you far.
But if you want to squeeze every last drop of performance out of a GPU, you must understand what’s happening inside the SM/CU—at the microarchitecture level.
This post is Part 2 of the series:
- Part 1: Advanced GPU Optimization — Complete Technical Guide
- Part 2 (this post): Microarchitecture of GPU
Here, we go from zero to expert on:
- SIMD lane scheduling
- Warp/wavefront reconvergence
- Register file banking
- Instruction-level parallelism (ILP)
- Dual-issue pipelines
- Operand collectors
- Scoreboarding
- Thread block scheduling
- SM/CU residency algorithms
With practical examples, code patterns, and concrete optimization strategies.
📘 Table of contents
- SIMD lane scheduling
- Warp/wavefront reconvergence
- Register file banking
- Instruction-level parallelism (ILP)
- Dual-issue pipelines
- Operand collector
- Scoreboarding
- Thread block scheduler (TBS)
- SM/CU residency algorithms
- End-to-end example: from naive kernel to microarchitecture-aware kernel
- Final checklist
- SIMD lane scheduling
How the GPU actually executes your threads
GPUs execute threads in lockstep groups:
- NVIDIA: warp = 32 threads
- AMD: wavefront = 64 threads
Each warp executes one instruction at a time, across all active lanes.
1.1. Basic model
Think of a warp as:
- A program counter (PC)
- A mask of active lanes
- A set of registers per lane
Each cycle, the warp scheduler:
- Picks a warp that is ready (not stalled, not waiting on memory, etc.)
- Issues the next instruction for that warp
- All active lanes execute that instruction in parallel
If a warp is stalled (e.g., waiting for memory), the scheduler picks another warp.
This is how GPUs hide latency: by switching between warps.
1.2. Practical example: memory latency hiding
Consider this naive CUDA kernel:
global void naiveloadcompute(const float* restrict in,
float* restrict out,
int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;
float x = in[idx]; // global memory load (high latency)
float y = x * 2.0f; // simple compute
out[idx] = y;
}
If:
- Each warp has only one outstanding memory load
- There are few warps per SM
Then the scheduler quickly runs out of ready warps, and the SM stalls.
1.3. Improving warp scheduling with more work per thread
We can increase instruction-level work per thread and warp count:
global void betterloadcompute(const float* restrict in,
float* restrict out,
int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx * 4 >= N) return;
// Load 4 elements per thread
float x0 = in[idx * 4 + 0];
float x1 = in[idx * 4 + 1];
float x2 = in[idx * 4 + 2];
float x3 = in[idx * 4 + 3];
// Do more math per load
float y0 = x0 * 2.0f + 1.0f;
float y1 = x1 * 2.0f + 1.0f;
float y2 = x2 * 2.0f + 1.0f;
float y3 = x3 * 2.0f + 1.0f;
out[idx * 4 + 0] = y0;
out[idx * 4 + 1] = y1;
out[idx * 4 + 2] = y2;
out[idx * 4 + 3] = y3;
}
Now:
- Each warp has more independent instructions
- The scheduler has more ready warps
- Latency is better hidden
- Warp/wavefront reconvergence
What really happens when threads diverge
When threads in a warp take different branches, the GPU serializes the paths.
Example:
global void branchykernel(const float* restrict_ in,
float* restrict out,
int N, float threshold) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;
float x = in[idx];
if (x > threshold) {
out[idx] = x * x; // path A
} else {
out[idx] = sqrtf(x); // path B
}
}
If half the threads in a warp go to path A and half to path B:
- The warp executes path A with only those lanes active
- Then executes path B with the other lanes active
- Then reconverges
Effective throughput is cut roughly in half.
2.1. Reconvergence mechanism
Internally, the GPU maintains:
- A stack of masks
- A stack of PCs
At a branch:
- It computes which lanes go where
- Pushes reconvergence info
- Executes each path with its mask
- Pops and reconverges
2.2. Reducing divergence with predication
Instead of branching, sometimes you can use predication:
global void predicatedkernel(const float* restrict_ in,
float* restrict out,
int N, float threshold) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;
float x = in[idx];
float a = x * x;
float b = sqrtf(x);
// Select without branch
out[idx] = (x > threshold) ? a : b;
}
Now:
- All lanes execute both computations
- But there is no divergence
- This is often faster when both paths are cheap
2.3. Practical rule
- If both branches are cheap → predication is often better
- If one branch is rare and expensive → branch is better
- Register file banking
Why some register accesses are slower than others
GPU register files are banked to allow multiple simultaneous accesses.
Simplified model:
- Suppose there are 4 banks
- Each bank can serve 1 read per cycle
- Registers are mapped to banks by index: bank = reg_id % 4
If multiple lanes in a warp access registers that map to the same bank in the same cycle → bank conflict → serialized access.
3.1. Example: artificial bank conflict
This is conceptual—actual mapping is vendor-specific, but the idea holds.
global void bankconflictexample(float* restrict out) {
int lane = threadIdx.x & 31; // warp-local lane ID
// Hypothetical: registers r0, r4, r8, ... map to same bank
float r0 = lane * 1.0f;
float r4 = lane * 2.0f;
float r8 = lane * 3.0f;
float r12 = lane * 4.0f;
// If all these are read in the same instruction group,
// they may cause bank conflicts.
float sum = r0 + r4 + r8 + r12;
out[lane] = sum;
}
In practice, you don’t control physical register IDs directly, but:
- High register pressure
- Complex expressions
- Many live variables
→ increase the chance of conflicts and spills.
3.2. Reducing register pressure
You can often reduce register usage by:
- Breaking kernels into smaller pieces
- Reusing variables
- Avoiding large local arrays
- Using restrict to help the compiler
Example:
// High register pressure
global void heavy_kernel(float a, float b, float* c, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;
float x0 = a[idx];
float x1 = b[idx];
float x2 = c[idx];
float t0 = x0 * 2.0f;
float t1 = x1 * 3.0f;
float t2 = x2 * 4.0f;
float t3 = t0 + t1;
float t4 = t2 + t3;
c[idx] = t4;
}
We can compress:
global void lighter_kernel(float a, float b, float* c, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;
float x0 = a[idx];
float x1 = b[idx];
float x2 = c[idx];
float t = x0 2.0f + x1 3.0f + x2 * 4.0f;
c[idx] = t;
}
Fewer live temporaries → fewer registers → fewer spills → fewer bank conflicts.
- Instruction-level parallelism (ILP)
Making the GPU do more per warp per cycle
ILP is about having multiple independent instructions ready to execute from the same warp.
Example of low ILP:
float a = x[i] * 2.0f;
float b = a + 3.0f;
float c = b * 4.0f;
Each instruction depends on the previous one.
Example of higher ILP:
float a = x[i] * 2.0f;
float b = y[i] * 3.0f;
float c = z[i] * 4.0f;
float d = a + b + c;
Now the first three instructions are independent.
4.1. ILP in a kernel
Naive version:
global void lowilpkernel(const float* restrict in,
float* restrict out,
int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;
float x = in[idx];
float y = x * 2.0f;
float z = y + 3.0f;
float w = z * 4.0f;
out[idx] = w;
}
Higher ILP version:
global void highilpkernel(const float* restrict in1,
const float* restrict in2,
const float* restrict in3,
float* restrict out,
int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;
float x = in1[idx];
float y = in2[idx];
float z = in3[idx];
float a = x * 2.0f;
float b = y * 3.0f;
float c = z * 4.0f;
float w = a + b + c;
out[idx] = w;
}
The GPU can overlap the independent multiplies, improving throughput.
4.2. ILP vs occupancy
Sometimes:
- You can trade occupancy for ILP
- Fewer warps, but each warp has more independent work
This is often beneficial for compute-bound kernels.
- Dual-issue pipelines
Using multiple execution units per cycle
Modern GPUs have multiple execution pipelines:
- FP32 ALU
- INT ALU
- SFU (special functions: sin, cos, exp, etc.)
- Load/store units
If two instructions target different pipelines and are independent, they can be dual-issued.
5.1. Example: mixing INT and FP
global void dualissuekernel(const float* restrict in,
float* restrict out,
int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;
float x = in[idx];
// FP operation
float y = x * 2.0f;
// INT operation
int lane = threadIdx.x & 31;
int offset = lane + 1;
// Use both results
out[idx] = y + float(offset);
}
Here:
- y = x * 2.0f; → FP pipeline
- offset = lane + 1; → INT pipeline
The GPU can often issue these in the same cycle.
5.2. Practical advice
- Avoid long chains of same-type dependent instructions
- Interleave integer index math with floating-point math
- Let the compiler schedule mixed instructions
- Operand collector
The hidden buffer that can stall you
Before an instruction executes, its operands must be:
- Read from registers
- Possibly forwarded from previous instructions
- Possibly fetched from shared memory
The operand collector gathers these inputs.
If too many instructions simultaneously demand operands, the collector can become a bottleneck → collector stalls.
6.1. Recognizing collector pressure
You typically see this in profilers as:
- “Operand collector stall”
- “Input dependency stall”
- “Register dependency stall”
6.2. Reducing collector pressure
- Reduce register pressure
- Avoid long dependency chains
- Avoid issuing many complex instructions back-to-back
- Use fewer live variables
Example: instead of:
float a = f0(x);
float b = f1(a);
float c = f2(b);
float d = f3(c);
float e = f4(d);
Try to restructure:
float a = f0(x);
float b = f1(x);
float c = f2(a, b);
float d = f3(c);
float e = f4(d);
More independence → less pressure.
- Scoreboarding
How the GPU knows when an instruction is ready to execute
The scoreboard tracks:
- Which registers are ready
- Which instructions are waiting on which registers
- Which warps are stalled
A warp cannot issue an instruction until:
- All its operands are ready
- No structural hazards exist
7.1. Example: load-use dependency
global void loadusekernel(const float* restrict in,
float* restrict out,
int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N) return;
float x = in[idx]; // global load (high latency)
float y = x * 2.0f; // depends on x
out[idx] = y;
}
The scoreboard will:
- Mark x as not ready until the load completes
- Prevent the multiply from issuing
If there are no other ready warps, the SM stalls.
7.2. Hiding load latency with independent work
global void betterloadusekernel(const float* restrict_ in,
float* restrict out,
int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= N - 4) return;
// Issue multiple loads
float x0 = in[idx + 0];
float x1 = in[idx + 1];
float x2 = in[idx + 2];
float x3 = in[idx + 3];
// Do some independent math while loads complete
float t0 = float(idx) * 0.1f;
float t1 = t0 * t0;
// Now use loaded values
out[idx + 0] = x0 * 2.0f + t1;
out[idx + 1] = x1 * 2.0f + t1;
out[idx + 2] = x2 * 2.0f + t1;
out[idx + 3] = x3 * 2.0f + t1;
}
Now the scoreboard has more ready instructions to issue while loads are in flight.
- Thread Block Scheduler (TBS)
How blocks are assigned to SMs
Each SM has finite resources:
- Max threads
- Max warps
- Max blocks
- Register file size
- Shared memory size
The thread block scheduler assigns blocks to SMs subject to these constraints.
8.1. Example: resource-limited occupancy
Suppose an SM has:
- 2048 max threads
- 64 max warps
- 96 KB shared memory
Your kernel uses:
- 256 threads per block
- 48 KB shared memory per block
Then:
- Max blocks by threads: 2048 / 256 = 8
- Max blocks by shared memory: 96 / 48 = 2
So only 2 blocks per SM can be resident.
If each block has 256 threads → 512 threads per SM → 25% occupancy.
8.2. Reducing shared memory to increase residency
If you can reduce shared memory to 24 KB per block:
- Max blocks by shared memory: 96 / 24 = 4
- Now 4 blocks per SM → 1024 threads → 50% occupancy
This often improves performance.
- SM/CU residency algorithms
How many blocks/warps can live on an SM at once
Residency is determined by:
- Threads per block
- Registers per thread
- Shared memory per block
- Hardware limits
The compiler and runtime compute:
- How many blocks can fit per SM
- How many warps that yields
- What occupancy that implies
9.1. Example calculation
Assume:
- SM supports 2048 threads, 64 warps, 96 KB shared memory
- Kernel uses 256 threads/block, 32 registers/thread, 24 KB shared memory/block
Then:
- Max blocks by threads: 2048 / 256 = 8
- Max blocks by shared memory: 96 / 24 = 4
- Max blocks by warps: 64 / (256 / 32) = 64 / 8 = 8
Final residency: 4 blocks per SM.
Total threads per SM: 4 × 256 = 1024 → 50% occupancy.
9.2. Using tools
Use:
- nvcc --ptxas-options=-v to see register usage
- Nsight Compute to see occupancy and limiting factors
- ROCm tools on AMD
- End-to-end example
From naive kernel to microarchitecture-aware kernel
Let’s take a simple reduction and progressively make it microarchitecture-aware.
10.1. Naive reduction
global void naivereduce(const float* restrict_ in,
float* restrict out,
int N) {
shared float sdata[256];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + tid;
float x = (idx < N) ? in[idx] : 0.0f;
sdata[tid] = x;
syncthreads();
// Naive reduction
for (int stride = 1; stride < blockDim.x; stride *= 2) {
if (tid % (2 * stride) == 0) {
sdata[tid] += sdata[tid + stride];
}
syncthreads();
}
if (tid == 0)
out[blockIdx.x] = sdata[0];
}
Problems:
- Divergence (tid % (2 * stride) == 0)
- Poor ILP
- Many synchronizations
- Shared memory bank conflicts likely
10.2. Microarchitecture-aware reduction
global void optimizedreduce(const float* restrict_ in,
float* restrict out,
int N) {
extern shared float sdata[];
int tid = threadIdx.x;
int idx = blockIdx.x (blockDim.x 2) + tid;
float sum = 0.0f;
// Load two elements per thread (more ILP, better memory throughput)
if (idx < N) {
sum += in[idx];
if (idx + blockDim.x < N)
sum += in[idx + blockDim.x];
}
sdata[tid] = sum;
syncthreads();
// Reduction in shared memory with no divergence
for (int stride = blockDim.x / 2; stride > 32; stride >>= 1) {
if (tid < stride) {
sdata[tid] += sdata[tid + stride];
}
syncthreads();
}
// Warp-level reduction (no shared memory, no sync)
if (tid < 32) {
float val = sdata[tid];
// Warp shuffle reduction
#pragma unroll
for (int offset = 16; offset > 0; offset >>= 1) {
val += shfldownsync(0xffffffff, val, offset);
}
if (tid == 0)
out[blockIdx.x] = val;
}
}
- Final microarchitecture-aware optimization checklist
When optimizing a GPU kernel at the microarchitecture level, ask:
-
SIMD & warps:
- Are warps doing similar work?
- Is there heavy divergence?
-
Reconvergence:
- Can I replace branches with predication?
- Can I restructure control flow?
-
Registers & banking:
- Is register usage too high?
- Am I causing spills?
-
ILP:
- Do I have independent instructions?
- Can I unroll loops to expose more ILP?
-
Dual-issue:
- Am I mixing INT + FP operations?
-
Operand collector & scoreboard:
- Do I have long dependency chains?
- Can I insert independent work between loads and uses?
-
TBS & residency:
- Is shared memory limiting blocks per SM?
- Is register usage limiting occupancy?
-
Profiling:
- What do Nsight / ROCm tools say is the main stall reason?
- Am I memory-bound or compute-bound?
I hope you enjoy, have nice times!
Top comments (0)