In the previous post I showed how to set up GPU Flight with Python and read kernel-level profiling data — occupancy, register counts, and resource bottlenecks. That tells you how well a kernel uses the hardware. But it doesn't tell you what's happening inside the kernel.
Today I want to look at one specific problem: thread divergence. When threads within a warp take different code paths, the GPU serializes execution — it runs one branch, then the other, while idle threads wait. If half the threads branch left and half branch right, you're running at 50% efficiency on those instructions.
GPU Flight's SASS Metrics engine gives you a direct way to measure this. It instruments the GPU at the assembly (SASS) level and reports two key counters per instruction:
-
smsp__sass_inst_executed— the number of warp-level instruction executions -
smsp__sass_thread_inst_executed— the total number of thread-level instruction executions
The ratio thread_executed / (inst_executed × 32) tells you the average number of active threads per warp at each instruction. If it's 32.0, every thread was active. If it's 16.0, half were diverged. If it's 8.0, only a quarter was doing useful work.
The Demo: Five Divergence Patterns
I wrote a small CUDA program with five kernels, each demonstrating a different divergence pattern. The full source is in the GPU Flight repo at example/cuda/sass_divergence_demo.cu. Here's a summary:
| Kernel | Pattern | Expected Active Threads |
|---|---|---|
uniformWork |
No divergence (baseline) | 32 |
branchByWarpLane |
if (threadIdx.x % 2) — even/odd split |
16 in each branch |
branchByWarpQuad |
if (threadIdx.x % 4 == 0) — 1-in-4 |
8 in hot path |
earlyExit |
Data-dependent early return | Varies (~16) |
indirectBranch |
4-way switch on random data | Varies (~8) |
Each kernel is wrapped in a GFL_SCOPE so GPU Flight can attribute the SASS metrics to the right section.
Kernel 1: Uniform Work (Baseline)
__global__
void uniformWork(float* out, const float* in, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float val = in[idx];
for (int i = 0; i < 512; ++i) {
val = val * 1.01f + 0.001f;
}
out[idx] = val;
}
}
Every thread does the same thing. No branches inside the loop, no divergence. This is the baseline — you should see thread_executed / inst_executed close to 32 for the loop body instructions.
Kernel 2: Even/Odd Divergence
__global__
void branchByWarpLane(float* out, const float* in, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float val = in[idx];
if (threadIdx.x % 2 == 0) {
for (int i = 0; i < 512; ++i)
val = val * 1.01f + 0.001f;
} else {
for (int i = 0; i < 512; ++i)
val = val + 0.001f * (float)i;
}
out[idx] = val;
}
}
This is the classic divergence example. Within every warp, 16 threads go left, 16 go right. The GPU executes both paths sequentially with half the threads masked off each time. The SASS metrics will show ~16 active threads for instructions inside each branch.
Kernel 3: Quad Divergence
__global__
void branchByWarpQuad(float* out, const float* in, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float val = in[idx];
if (threadIdx.x % 4 == 0) {
for (int i = 0; i < 2048; ++i)
val = val * 1.001f + 0.0001f;
}
out[idx] = val;
}
}
Only every 4th thread enters the loop. That's 8 out of 32 threads doing the heavy work while 24 sit idle. Worse than 50/50 — 75% of the warp is wasted during the loop body.
Kernel 4: Early Exit
__global__
void earlyExit(float* out, const float* in, float threshold, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float val = in[idx];
if (val < threshold) {
out[idx] = val;
return;
}
for (int i = 0; i < 1024; ++i)
val = val * 1.01f - 0.005f;
out[idx] = val;
}
}
This is data-dependent. Threads whose input is below the threshold return early, while the rest do the expensive computation. With random inputs in [0, 1) and a threshold of 0.5, roughly half the threads will exit early. But unlike Kernel 2, the split isn't uniform across warps — some warps might have 20 threads exit, others might have 10.
Kernel 5: Data-Dependent Switch
__global__
void indirectBranch(float* out, const float* in, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float val = in[idx];
int category = (int)(val * 4.0f) % 4;
switch (category) {
case 0: for (int i = 0; i < 256; ++i) val = val * 1.01f; break;
case 1: for (int i = 0; i < 256; ++i) val = val + 0.01f; break;
case 2: for (int i = 0; i < 256; ++i) val = val - 0.005f; break;
case 3: for (int i = 0; i < 256; ++i) val = val * 0.99f; break;
}
out[idx] = val;
}
}
A 4-way branch driven by random data. On average, each case gets ~8 threads per warp, but the GPU must execute all 4 paths sequentially. This is the worst case — 4x the instruction count for the branch body.
Running the Demo
Build and run from the GPU Flight repo:
git clone https://github.com/gpu-flight/gpufl-client.git
cd gpufl-client
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target sass_divergence_demo
./build/example/cuda/sass_divergence_demo
The key part is in main() — initializing GPU Flight with the SASS Metrics engine:
gpufl::InitOptions opts;
opts.app_name = "sass_divergence_demo";
opts.log_path = "sass_divergence";
opts.enable_kernel_details = true;
opts.sampling_auto_start = true;
opts.profiling_engine = gpufl::ProfilingEngine::SassMetrics;
gpufl::init(opts);
Setting profiling_engine to SassMetrics tells GPU Flight to instrument every kernel at the SASS level. Each GFL_SCOPE block then collects per-instruction counters for the kernels launched inside it.
Results: RTX 3090
Here's what I got running on an NVIDIA GeForce RTX 3090 (Ampere, SM 8.6, 82 SMs) with 1M elements:
Kernel Weighted Avg Active Threads Instructions
----------------------------------------------------------------------
uniformWork 32.0 277
branchByWarpLane 16.3 796
branchByWarpQuad 8.2 281
earlyExit 16.2 280
indirectBranch 1.5 1062
The "Weighted Avg Active Threads" is thread_inst_executed / inst_executed across all SASS instructions in each kernel, weighted by execution count. "Instructions" is the number of unique PC offsets (SASS instructions) instrumented.
Let's walk through what this tells us:
uniformWork — 32.0 active threads. Perfect. Every warp runs at full width. This is the expected baseline for a kernel with no divergence.
branchByWarpLane — 16.3 active threads. Very close to the theoretical 16. The slight overshoot comes from instructions outside the branch (the if (idx < n) guard, loop control, and the final store) where all 32 threads are active. The 796 unique instructions — nearly 3x the baseline — show the cost: the compiler generates separate code for each branch, and both paths must be executed.
branchByWarpQuad — 8.2 active threads. Again close to the theoretical 8 (only 1 in 4 threads enters the loop). Similar instruction count to the baseline since there's only one branch path — but every instruction in the hot loop runs with 75% of threads idle.
earlyExit — 16.2 active threads. Matches the expectation for a 50% threshold with random data. Threads that exit early become inactive for the remaining instructions.
indirectBranch — 1.5 active threads, 1062 instructions. This is the most striking result. A 4-way switch on random data drops the weighted average to just 1.5 active threads per warp — far worse than the other kernels. It also generates the highest instruction count at 1062, nearly 4x the baseline. This is a crucial insight: divergence doesn't just halve your throughput — multi-way branching on random data can drop you below 5% when measured at the instruction level.
What This Means in Practice
Thread divergence is easy to create and hard to notice. Your kernel still produces correct results. But you might be leaving 50-95% of your GPU's compute on the table.
Here are the common patterns to watch for:
Lane-based branching — if (threadIdx.x % N). This is almost always unintentional. Consider rearranging your data so that threads within a warp take the same path.
Data-dependent branches — like the earlyExit kernel. If your input distribution is skewed, some warps diverge heavily while others don't. The average might look okay, but the worst warps are bottlenecks.
Switch statements on computed values — like indirectBranch. This was the worst offender in our test — each additional case multiplies the predicated instruction overhead.
The fix depends on the situation:
- Sort or bin your data so threads in the same warp hit the same branch
- Replace branches with predicated arithmetic — branchless code runs all threads at full width
- Restructure your algorithm so the branch happens at the warp or block level, not the thread level
Top comments (0)