Notes on CUDA Tensor Core GEMM (WMMA)
2026-05-31 · CUDA / GPU kernels
Working notes on writing a matrix-multiply (GEMM) kernel in CUDA and climbing from a naive
implementation to Tensor Cores via the WMMA API — and, just as important, how to know
how good your kernel actually is by measuring it against the cuBLAS ceiling. GEMM is the
right thing to understand deeply: it is the operation underneath every linear layer and
attention projection in an LLM.
1. Why GEMM is the kernel that matters
A transformer forward pass is, to a first approximation, a stack of GEMMs. If you understand
what makes a GEMM kernel fast on a GPU, you understand where inference latency comes from and
why Tensor Cores exist. The progression below is the standard pedagogy — each step removes the
bottleneck the previous one exposed.
2. Naive GEMM: memory-bound by construction
The textbook kernel assigns one thread per output element C[i][j] and loops over k:
C[i][j] = Σ_k A[i][k] * B[k][j]
It's correct and it's slow. Every thread re-reads entire rows of A and columns of B from
global memory, so the same values are fetched from DRAM hundreds of times. The kernel is
memory-bandwidth-bound — the arithmetic units sit idle waiting on loads. The arithmetic
intensity (FLOPs per byte) is far too low to approach peak.
3. Tiled GEMM: shared memory turns the problem compute-bound
The fix is shared-memory tiling. Threads in a block cooperatively load a tile of A and a
tile of B into fast on-chip shared memory, then every thread in the block reuses those
tiles for its partial sums before loading the next tile:
- Load a
TILE × TILEblock ofAand ofBinto shared memory (coalesced). -
__syncthreads(). - Each thread accumulates its
Cpartial product from the shared tiles. -
__syncthreads(), advance to the next tile alongk.
This raises arithmetic intensity by a factor of TILE: each value loaded from global memory is
now reused TILE times. The kernel crosses from memory-bound toward compute-bound — now
the FP32 ALUs are the limit. This is the single biggest jump, and it's pure data-movement
strategy, not math.
4. Tensor Cores via WMMA: a different compute unit
Tiling saturates the CUDA cores. Tensor Cores are a separate unit that does a small
matrix-multiply-accumulate (MMA) in one instruction — e.g. a 16×16×16 D = A·B + C per warp,
on half-precision inputs with FP32 accumulation. The WMMA (Warp Matrix Multiply-Accumulate)
API exposes them in CUDA C++:
using namespace nvcuda::wmma;
fragment<matrix_a, 16,16,16, half, row_major> a_frag;
fragment<matrix_b, 16,16,16, half, col_major> b_frag;
fragment<accumulator, 16,16,16, float> c_frag;
fill_fragment(c_frag, 0.0f);
for (int k = 0; k < K; k += 16) {
load_matrix_sync(a_frag, A + ..., lda);
load_matrix_sync(b_frag, B + ..., ldb);
mma_sync(c_frag, a_frag, b_frag, c_frag); // Tensor Core MMA
}
store_matrix_sync(C + ..., c_frag, ldc, mem_row_major);
The mental model shifts from "threads computing elements" to "warps cooperating on
fragments." A fragment is an opaque, register-resident tile; you don't index its elements,
you feed whole fragments to mma_sync. Inputs are FP16/BF16 (or FP8/FP4 on newer
architectures), accumulation is FP32 — which is why mixed precision is the native language of
Tensor Cores.
5. The number that tells the truth: % of cuBLAS
A hand-written WMMA kernel will beat your tiled kernel, but it will not beat cuBLAS — and
that's the point. cuBLAS is the practical ceiling (it does register-tiling, double-buffering,
swizzled layouts, and architecture-specific tuning you won't replicate in an afternoon). So the
honest metric isn't raw TFLOP/s, it's percent of the cuBLAS ceiling on the same GPU:
| Kernel | Typical regime |
|---|---|
| Naive | a few % of peak — memory-bound |
| Tiled (shared memory) | much better, still CUDA-core bound |
| WMMA (Tensor Core) | a meaningful fraction of cuBLAS |
| cuBLAS | the ceiling (100%) |
Reporting "X % of cuBLAS on sm_90 and sm_120" is a self-describing result: it's reproducible,
it normalises across GPUs, and it's honest about the gap to a production library. Profiling the
WMMA kernel with Nsight Compute then tells you which wall you're against — memory
throughput, Tensor Core utilisation, or occupancy.
6. Why this matters going to Blackwell
Each GPU generation widens what the MMA unit accepts: Hopper added FP8, Blackwell adds
FP4 and a new generation of Tensor Core instructions (tcgen05). The WMMA mental model —
fragments fed to an MMA, FP32 accumulation — carries forward; what changes is the input
precision and the tile shapes. Understanding the FP16 WMMA path is the on-ramp to reasoning
about NVFP4 inference on Blackwell.
Takeaway
GEMM performance is a ladder: naive is memory-bound, shared-memory tiling makes it
compute-bound, and WMMA moves the compute onto Tensor Cores with mixed precision. Measure
every rung as % of cuBLAS on the same GPU — that's the metric that's honest about how close
you are to the ceiling and portable across Hopper and Blackwell.
→ More field notes on the NVIDIA stack: waynehacking8.github.io
Top comments (0)