member_2e5ba30f

Posted on May 31 • Originally published at waynehacking8.github.io

Notes on CUDA Tensor Core GEMM (WMMA)

#cuda #gpu #cpp #performance

Notes on CUDA Tensor Core GEMM (WMMA)

2026-05-31 · CUDA / GPU kernels

Working notes on writing a matrix-multiply (GEMM) kernel in CUDA and climbing from a naive
implementation to Tensor Cores via the WMMA API — and, just as important, how to know
how good your kernel actually is by measuring it against the cuBLAS ceiling. GEMM is the
right thing to understand deeply: it is the operation underneath every linear layer and
attention projection in an LLM.

1. Why GEMM is the kernel that matters

A transformer forward pass is, to a first approximation, a stack of GEMMs. If you understand
what makes a GEMM kernel fast on a GPU, you understand where inference latency comes from and
why Tensor Cores exist. The progression below is the standard pedagogy — each step removes the
bottleneck the previous one exposed.

2. Naive GEMM: memory-bound by construction

The textbook kernel assigns one thread per output element C[i][j] and loops over k:

C[i][j] = Σ_k A[i][k] * B[k][j]

It's correct and it's slow. Every thread re-reads entire rows of A and columns of B from
global memory, so the same values are fetched from DRAM hundreds of times. The kernel is
memory-bandwidth-bound — the arithmetic units sit idle waiting on loads. The arithmetic
intensity (FLOPs per byte) is far too low to approach peak.

3. Tiled GEMM: shared memory turns the problem compute-bound

The fix is shared-memory tiling. Threads in a block cooperatively load a tile of A and a
tile of B into fast on-chip shared memory, then every thread in the block reuses those
tiles for its partial sums before loading the next tile:

Load a TILE × TILE block of A and of B into shared memory (coalesced).
__syncthreads().
Each thread accumulates its C partial product from the shared tiles.
__syncthreads(), advance to the next tile along k.

This raises arithmetic intensity by a factor of TILE: each value loaded from global memory is
now reused TILE times. The kernel crosses from memory-bound toward compute-bound — now
the FP32 ALUs are the limit. This is the single biggest jump, and it's pure data-movement
strategy, not math.

4. Tensor Cores via WMMA: a different compute unit

Tiling saturates the CUDA cores. Tensor Cores are a separate unit that does a small
matrix-multiply-accumulate (MMA) in one instruction — e.g. a 16×16×16 D = A·B + C per warp,
on half-precision inputs with FP32 accumulation. The WMMA (Warp Matrix Multiply-Accumulate)
API exposes them in CUDA C++:

using namespace nvcuda::wmma;
fragment<matrix_a, 16,16,16, half, row_major> a_frag;
fragment<matrix_b, 16,16,16, half, col_major> b_frag;
fragment<accumulator, 16,16,16, float>        c_frag;

fill_fragment(c_frag, 0.0f);
for (int k = 0; k < K; k += 16) {
    load_matrix_sync(a_frag, A + ..., lda);
    load_matrix_sync(b_frag, B + ..., ldb);
    mma_sync(c_frag, a_frag, b_frag, c_frag);   // Tensor Core MMA
}
store_matrix_sync(C + ..., c_frag, ldc, mem_row_major);

The mental model shifts from "threads computing elements" to "warps cooperating on
fragments." A fragment is an opaque, register-resident tile; you don't index its elements,
you feed whole fragments to mma_sync. Inputs are FP16/BF16 (or FP8/FP4 on newer
architectures), accumulation is FP32 — which is why mixed precision is the native language of
Tensor Cores.

5. The number that tells the truth: % of cuBLAS

A hand-written WMMA kernel will beat your tiled kernel, but it will not beat cuBLAS — and
that's the point. cuBLAS is the practical ceiling (it does register-tiling, double-buffering,
swizzled layouts, and architecture-specific tuning you won't replicate in an afternoon). So the
honest metric isn't raw TFLOP/s, it's percent of the cuBLAS ceiling on the same GPU:

Kernel	Typical regime
Naive	a few % of peak — memory-bound
Tiled (shared memory)	much better, still CUDA-core bound
WMMA (Tensor Core)	a meaningful fraction of cuBLAS
cuBLAS	the ceiling (100%)

Reporting "X % of cuBLAS on sm_90 and sm_120" is a self-describing result: it's reproducible,
it normalises across GPUs, and it's honest about the gap to a production library. Profiling the
WMMA kernel with Nsight Compute then tells you which wall you're against — memory
throughput, Tensor Core utilisation, or occupancy.

6. Why this matters going to Blackwell

Each GPU generation widens what the MMA unit accepts: Hopper added FP8, Blackwell adds
FP4 and a new generation of Tensor Core instructions (tcgen05). The WMMA mental model —
fragments fed to an MMA, FP32 accumulation — carries forward; what changes is the input
precision and the tile shapes. Understanding the FP16 WMMA path is the on-ramp to reasoning
about NVFP4 inference on Blackwell.

Takeaway

GEMM performance is a ladder: naive is memory-bound, shared-memory tiling makes it
compute-bound, and WMMA moves the compute onto Tensor Cores with mixed precision. Measure
every rung as % of cuBLAS on the same GPU — that's the metric that's honest about how close
you are to the ceiling and portable across Hopper and Blackwell.

→ More field notes on the NVIDIA stack: waynehacking8.github.io

DEV Community

Notes on CUDA Tensor Core GEMM (WMMA)

Notes on CUDA Tensor Core GEMM (WMMA)

1. Why GEMM is the kernel that matters

2. Naive GEMM: memory-bound by construction

3. Tiled GEMM: shared memory turns the problem compute-bound

4. Tensor Cores via WMMA: a different compute unit

5. The number that tells the truth: % of cuBLAS

6. Why this matters going to Blackwell

Takeaway

Top comments (0)