In software engineering, if two approaches are both O(n), that is often good enough for the discussion.
But in low-level or performance engineering, that is not the end of the story. Even when two algorithms have the same time complexity, the actual performance can be very different depending on how they access memory.
A simple example is iterating through an array versus a linked list. Both are O(n), but arrays are usually much faster in practice because their memory layout is contiguous, which allows the CPU to use caches much more efficiently.
The same idea applies on GPUs too, but the effect is often much bigger because many threads are accessing memory at the same time.
What is Memory Coalescing?
On NVIDIA GPUs, threads execute in groups called warps, which contain 32 threads.
When those threads access memory in a well-structured way, the GPU can combine their requests into a small number of memory transactions. That is called memory coalescing.
When the access pattern is poor, the opposite happens. Instead of serving the whole warp efficiently, the GPU ends up issuing many separate memory transactions. That wastes bandwidth and increases latency.
So the idea is simple: neighboring threads should access neighboring memory whenever possible.
Measuring It in Practice
The concept itself is well known, but measuring it in real code is not always convenient.
Tools like NVIDIA Nsight Compute usually require attaching a profiler and replaying kernels. That is fine for deep analysis, but it is not something you continuously leave on during normal execution.
With GPUFlight, I wanted to measure this kind of issue continuously during normal runs, without a debugger and without replaying the kernel.
The Setup: Two Matmul Kernels
For this example, I used two simple matrix multiplication kernels:
C = A × B
Both kernels compute the exact same result. The only difference is how the work is assigned to threads.
Row-per-thread
Each thread computes one row of the output matrix:
__global__ void matmul_row_per_thread(const float* A, const float* B,
float* C, int M, int K, int N) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row >= M) return;
for (int col = 0; col < N; col++) {
float sum = 0.0f;
for (int i = 0; i < K; i++)
sum += A[row * K + i] * B[i * N + col];
C[row * N + col] = sum;
}
}
Col-per-thread — Each thread computes one column of the output matrix:
__global__ void matmul_col_per_thread(const float* A, const float* B,
float* C, int M, int K, int N) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (col >= N) return;
for (int row = 0; row < M; row++) {
float sum = 0.0f;
for (int i = 0; i < K; i++)
sum += A[row * K + i] * B[i * N + col];
C[row * N + col] = sum;
}
}
Same math. Same number of floating-point operations. The only difference is which dimension maps to threadIdx.x.
That small mapping change turns out to matter a lot.
Why It Matters: How GPUs Read Memory
A GPU does not read one float at a time in the way people often imagine. When a warp executes a load instruction, the hardware tries to combine the addresses from all 32 threads into as few memory transactions as possible.
In the best case, all 32 threads access consecutive floats, and the warp can be served efficiently.
In the worst case, each thread touches a different cache line, so the GPU ends up issuing many separate transactions. Most of the fetched data is not even used by that warp.
That is exactly what happens here.
In matmul_row_per_thread, adjacent threads (thread 0, 1, 2, ...) are assigned rows 0, 1, 2, .... When they read A[row * K + i], thread 0 reads address 0*K + i and thread 1 reads 1*K + i — these are K floats apart. With K=256, that's a stride of 1024 bytes between adjacent threads. Every thread hits a different cache line.
In matmul_col_per_thread, adjacent threads access columns 0, 1, 2, .... When they read B[i * N + col], thread 0 reads i*N + 0 and thread 1 reads i*N + 1 — consecutive addresses. One cache line serves all 32 threads.
Measuring with GPUFlight
GPUFlight instruments your CUDA application using CUPTI's SASS metrics and PC sampling APIs. You add a few lines to your code:
#include "gpufl/gpufl.hpp"
int main() {
gpufl::InitOptions opts;
opts.app_name = "memory_coalescing_demo";
opts.profiling_engine = gpufl::ProfilingEngine::PcSamplingWithSass;
gpufl::init(opts);
GFL_SCOPE("row-per-thread") {
matmul_row_per_thread<<<blocks, threads>>>(d_A, d_B, d_C, M, K, N);
}
GFL_SCOPE("col-per-thread") {
matmul_col_per_thread<<<blocks, threads>>>(d_A, d_B, d_C, M, K, N);
}
gpufl::shutdown();
gpufl::generateReport();
}
GPUFlight collects data during normal execution — no debugger, no replay, no kernel serialization.
The Results
Here's the report from an RTX 5060 (Blackwell, sm_120):
matmul_row_per_thread (13,268 stall samples)
------------------------------------------------------------------
Stalls:
Wait 4,592 34.6% #######
Wait (idle) 4,298 32.4% ######
Long Scoreboard 1,441 10.9% ##
Long Scoreboard (idle) 1,376 10.4% ##
Branch Resolving 459 3.5% #
Selected 351 2.6% #
Instructions:
Warp Insts: 12,042,560
Thread Insts: 385,361,920
Warp Efficiency: 32.0 / 32 (100.0%)
Memory:
Global Sectors: 69,468,160
Ideal Sectors: 10,518,528
Memory Efficiency: 15.1%
Hints:
* Low memory efficiency (15%) — consider coalesced access
patterns or shared memory tiling.
matmul_col_per_thread
------------------------------------------------------------------
Instructions:
Warp Insts: 10,428,736
Thread Insts: 333,719,552
Warp Efficiency: 32.0 / 32 (100.0%)
Memory:
Global Sectors: 10,518,528
Ideal Sectors: 10,518,528
Memory Efficiency: 100.0%
Breaking Down the Numbers
Memory Efficiency: 15% vs 100%
This is the main number to look at. GPUFlight measures two things per kernel:
- Global Sectors: actual 32-byte memory sectors transferred
- Ideal Sectors: minimum sectors needed if every access were perfectly coalesced
| Kernel | Actual Sectors | Ideal Sectors | Efficiency | Waste |
|---|---|---|---|---|
| Row-per-thread | 69,468,160 | 10,518,528 | 15.1% | 6.6× |
| Col-per-thread | 10,518,528 | 10,518,528 | 100.0% | 1.0× |
The row-per-thread kernel transfers 6.6× more data than necessary. For every useful float, the GPU fetches an entire cache line that only one thread uses.
Stall Analysis: Where the Time Goes
PC sampling tells us what each warp was doing when sampled:
- Wait (34.6%) + Wait idle (32.4%) = 67%
- Long Scoreboard (10.9%) — That means a large portion of the time, the warps are not doing useful math. They are mostly waiting for memory.
This is the part I like most about seeing the data together: the memory inefficiency is not just an abstract metric. You can see it show up directly in the stall breakdown.
The col-per-thread kernel has so few stalls that PC sampling barely accumulates much data there. It simply finishes too quickly.
Wall-Clock Impact
Row-per-thread (uncoalesced): 245 ms
Col-per-thread (coalesced): 155 ms
Speedup: 1.6×
The coalesced version is 1.6× faster on this setup.
That is already a meaningful gain, and this is from a very small change in how work is mapped to threads.
Warp Efficiency Can Be Misleading
Both kernels show 100% warp efficiency (32/32 active threads). That means there is no thread divergence here. Every thread in each warp follows the same control flow.
If you only looked at warp efficiency, both kernels would look healthy.
But they are not equally healthy. The real problem is memory access, and memory efficiency exposes it immediately.
What GPUFlight Collects Under the Hood
GPUFlight uses two CUPTI mechanisms that run during normal execution:
SASS Metrics — The GPU binary is patched at load time to count per-instruction execution, thread activity, and memory sector usage. This is how we get the Global Sectors and Ideal Sectors numbers. No sampling bias — every instruction is counted.
PC Sampling — The hardware periodically interrupts each SM and records what every warp is doing: executing, or stalled and why. This gives us the stall reason distribution (Wait, Long Scoreboard, etc.).
GPUFlight also disassembles the GPU binary (SASS assembly) so you can see exactly which instructions are hot:
/*0x2a0*/ LDG.E.CONSTANT R20, desc[UR12][R18.64] ← memory load (hot!)
/*0x2c0*/ LDG.E.CONSTANT R22, desc[UR12][R16.64] ← memory load (hot!)
/*0x340*/ FFMA R35, R20, R21, R37 ← fused multiply-add
The LDG.E.CONSTANT instructions are the global memory loads. In the row-per-thread kernel, these are where 67% of the time is spent waiting.
The Fix Is One Line
The entire difference between 15% and 100% memory efficiency comes down to which dimension you assign to threadIdx.x:
- int row = blockIdx.x * blockDim.x + threadIdx.x; // threads map to rows
+ int col = blockIdx.x * blockDim.x + threadIdx.x; // threads map to columns
That's it. Same algorithm, same math, same number of operations. Just a different mapping of threads to data.
Try It Yourself
The complete example is available as memory_coalescing_demo.cu in the GPUFlight client repository. To run it:
# Build with GPUFlight
cmake -B build -DCMAKE_CUDA_ARCHITECTURES=native
cmake --build build --target memory_coalescing_demo
# Run (admin/root for PC sampling on some platforms)
./build/example/cuda/memory_coalescing_demo
Final Thought
Memory coalescing is one of those concepts that sounds simple when explained in theory, but it becomes much more convincing when you can see the numbers in a real kernel.
In this example, it is not a tiny optimization. It is the difference between 15% and 100% memory efficiency, 6.6× more memory traffic than necessary, and a 1.6× wall-clock slowdown.
That is why memory access patterns matter so much on GPUs.
Top comments (0)