NVIDIA CUTLASS Review: CUDA Templates for GEMM Kernels Behind Modern LLMs

#meta #blogging #webdev

If you have ever profiled an LLM inference path and watched 90 percent of the time evaporate inside a single matrix multiply, you already know why CUTLASS exists. General matrix-matrix multiplication (GEMM) is the hot loop of nearly every transformer, every diffusion model, and every classical training step. cuBLAS handles the common cases well, but the moment you need a non-standard data type, a fused epilogue, or a tile shape the vendor library does not ship, you fall off a cliff. CUTLASS is NVIDIA's open-source answer to that cliff: a hierarchy of CUDA C++ templates (and now Python DSLs) that exposes the same primitives the cuBLAS team uses internally, so you can assemble custom kernels without rewriting warp-level matrix instructions from scratch.

We spent time reading the CUTLASS source tree on GitHub and tracing how it is consumed by projects like FlashAttention, vLLM, and TensorRT-LLM. This review is for developers building inference engines, training frameworks, or fused kernels — not for application engineers who can stay on PyTorch.

What CUTLASS Actually Gives You

CUTLASS decomposes a GEMM into a stack of abstractions: thread-block tile, warp tile, instruction tile, and the data movement required to feed each level. You compose a kernel by picking template parameters at each layer — tile shapes, swizzling patterns, pipelining stages, epilogue functors — and the compiler emits a specialized kernel for that exact configuration. The library ships reference kernels for FP32, FP16, BF16, TF32, INT8, FP8 (E4M3 and E5M2), and the newer narrow-precision formats on Hopper and Blackwell.

The practical upshot is that you get three things you cannot easily build yourself:

Tensor Core access without inline PTX. The mma and wgmma instructions are wrapped in C++ types you can compose, instead of assembly you have to maintain across architectures.
Asynchronous data movement primitives. TMA (Tensor Memory Accelerator) on Hopper and the older cp.async family are exposed through the CuTe layer, so you can build software pipelines without hand-rolling barriers.
Fused epilogues. Bias add, ReLU, GELU, residual add, dequantize — all of these can be chained into the GEMM tail so you avoid a separate kernel launch and the extra round-trip to global memory.

The trade-off is that template error messages are notorious. A single wrong tile shape produces hundreds of lines of substitution-failure noise, and the iteration loop (edit, nvcc, wait) is slow on a cold cache.

CUTLASS compile times are not a rumor. A single non-trivial GEMM kernel can take 30 to 90 seconds to compile per architecture, and a full library build crosses an hour on a workstation. Budget for ccache, sccache, or a build server before you commit to a CUTLASS-heavy codebase. The Python DSL path mitigates this for prototyping, but the C++ templates remain the production target.

CuTe DSL and the Python On-Ramp

The biggest recent shift in CUTLASS is the introduction of CuTe — a layout algebra that describes how logical tensor coordinates map to physical memory — and a Python DSL on top of it. CuTe gives you a uniform vocabulary for tiles, shapes, strides, and partitioning across threads. The Python DSL lets you author kernels with the same semantics as the C++ templates, then JIT-compile them. For prototyping a custom attention variant or a quantization scheme, this collapses the edit-compile-test loop from minutes to seconds.

This matters because the audience for low-level GPU work has widened. Five years ago, the people writing custom GEMMs were a small group of HPC and graphics veterans. Today, a researcher at a six-person startup might need a fused FP8 GEMM with a custom dequantization epilogue to ship their inference path. The Python DSL is NVIDIA's bet that you should not need to learn 4,000 lines of template metaprogramming to get there.

That said, the Python DSL is newer and the documentation surface is thinner. Expect to read C++ examples and translate. If you are exploring kernels interactively, an AI-aware editor that can keep both languages in context shortens the loop considerably.

When to Reach for CUTLASS — and When Not To

The honest answer is that most teams should not write CUTLASS kernels. If cuBLAS, cuBLASLt, or PyTorch's native operators cover your shapes and data types, use them. NVIDIA ships hand-tuned heuristics that select kernels at runtime, and you will rarely beat them on the standard cases.

CUTLASS earns its place when one of these conditions holds:

You need a data type or layout the vendor library does not ship — for example, an INT4 weight-only GEMM, a grouped GEMM for mixture-of-experts routing, or a sparse format with a specific compression scheme.
You need to fuse a non-trivial epilogue that cannot be expressed as a simple bias-and-activation, such as a per-channel quantization scale fused with a residual add.
You are building a kernel library that will outlive cuBLAS releases, and you need control over the exact instructions emitted so you can profile and tune deterministically.
You are doing research on new tile schedules, pipelining strategies, or async patterns and need primitives at the warp and thread-block level.

Projects like FlashAttention-3, vLLM's CUTLASS-backed FP8 path, and several open-source MoE inference stacks fall into the first two buckets. They use CUTLASS because cuBLAS does not, and would not, ship the exact kernel shape they need.

A reasonable adoption path is to start with cuBLASLt's epilogue fusion API for any custom-tail work, fall back to CUTLASS only when you hit a wall, and treat the Python DSL as your prototyping surface before committing to C++ templates in production.

The Maintenance Question

One underdiscussed cost is architectural drift. CUTLASS kernels written for Ampere often need rework for Hopper (TMA, wgmma) and again for Blackwell. The library tracks each architecture with separate examples and template specializations, so a kernel you ship today may require a parallel implementation for the next generation. If you adopt CUTLASS, plan for one engineer to own kernel maintenance across the architectures you support — this is not a fire-and-forget dependency.

The payoff, when it lands, is real. Teams that have built custom CUTLASS kernels for FP8 inference report end-to-end throughput improvements that would be impossible to achieve by composing existing PyTorch ops. The question is whether your workload, team, and roadmap justify owning a kernel.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.