DEV Community: member_2e5ba30f

Notes on CUDA Tensor Core GEMM (WMMA)

member_2e5ba30f — Sun, 31 May 2026 15:26:36 +0000

Notes on CUDA Tensor Core GEMM (WMMA)

2026-05-31 · CUDA / GPU kernels

Working notes on writing a matrix-multiply (GEMM) kernel in CUDA and climbing from a naive
implementation to Tensor Cores via the WMMA API — and, just as important, how to know
how good your kernel actually is by measuring it against the cuBLAS ceiling. GEMM is the
right thing to understand deeply: it is the operation underneath every linear layer and
attention projection in an LLM.

1. Why GEMM is the kernel that matters

A transformer forward pass is, to a first approximation, a stack of GEMMs. If you understand
what makes a GEMM kernel fast on a GPU, you understand where inference latency comes from and
why Tensor Cores exist. The progression below is the standard pedagogy — each step removes the
bottleneck the previous one exposed.

2. Naive GEMM: memory-bound by construction

The textbook kernel assigns one thread per output element C[i][j] and loops over k:

C[i][j] = Σ_k A[i][k] * B[k][j]

It's correct and it's slow. Every thread re-reads entire rows of A and columns of B from
global memory, so the same values are fetched from DRAM hundreds of times. The kernel is
memory-bandwidth-bound — the arithmetic units sit idle waiting on loads. The arithmetic
intensity (FLOPs per byte) is far too low to approach peak.

3. Tiled GEMM: shared memory turns the problem compute-bound

The fix is shared-memory tiling. Threads in a block cooperatively load a tile of A and a
tile of B into fast on-chip shared memory, then every thread in the block reuses those
tiles for its partial sums before loading the next tile:

Load a TILE × TILE block of A and of B into shared memory (coalesced).
__syncthreads().
Each thread accumulates its C partial product from the shared tiles.
__syncthreads(), advance to the next tile along k.

This raises arithmetic intensity by a factor of TILE: each value loaded from global memory is
now reused TILE times. The kernel crosses from memory-bound toward compute-bound — now
the FP32 ALUs are the limit. This is the single biggest jump, and it's pure data-movement
strategy, not math.

4. Tensor Cores via WMMA: a different compute unit

Tiling saturates the CUDA cores. Tensor Cores are a separate unit that does a small
matrix-multiply-accumulate (MMA) in one instruction — e.g. a 16×16×16 D = A·B + C per warp,
on half-precision inputs with FP32 accumulation. The WMMA (Warp Matrix Multiply-Accumulate)
API exposes them in CUDA C++:

using namespace nvcuda::wmma;
fragment<matrix_a, 16,16,16, half, row_major> a_frag;
fragment<matrix_b, 16,16,16, half, col_major> b_frag;
fragment<accumulator, 16,16,16, float>        c_frag;

fill_fragment(c_frag, 0.0f);
for (int k = 0; k < K; k += 16) {
    load_matrix_sync(a_frag, A + ..., lda);
    load_matrix_sync(b_frag, B + ..., ldb);
    mma_sync(c_frag, a_frag, b_frag, c_frag);   // Tensor Core MMA
}
store_matrix_sync(C + ..., c_frag, ldc, mem_row_major);

The mental model shifts from "threads computing elements" to "warps cooperating on
fragments." A fragment is an opaque, register-resident tile; you don't index its elements,
you feed whole fragments to mma_sync. Inputs are FP16/BF16 (or FP8/FP4 on newer
architectures), accumulation is FP32 — which is why mixed precision is the native language of
Tensor Cores.

5. The number that tells the truth: % of cuBLAS

A hand-written WMMA kernel will beat your tiled kernel, but it will not beat cuBLAS — and
that's the point. cuBLAS is the practical ceiling (it does register-tiling, double-buffering,
swizzled layouts, and architecture-specific tuning you won't replicate in an afternoon). So the
honest metric isn't raw TFLOP/s, it's percent of the cuBLAS ceiling on the same GPU:

Kernel	Typical regime
Naive	a few % of peak — memory-bound
Tiled (shared memory)	much better, still CUDA-core bound
WMMA (Tensor Core)	a meaningful fraction of cuBLAS
cuBLAS	the ceiling (100%)

Reporting "X % of cuBLAS on sm_90 and sm_120" is a self-describing result: it's reproducible,
it normalises across GPUs, and it's honest about the gap to a production library. Profiling the
WMMA kernel with Nsight Compute then tells you which wall you're against — memory
throughput, Tensor Core utilisation, or occupancy.

6. Why this matters going to Blackwell

Each GPU generation widens what the MMA unit accepts: Hopper added FP8, Blackwell adds
FP4 and a new generation of Tensor Core instructions (tcgen05). The WMMA mental model —
fragments fed to an MMA, FP32 accumulation — carries forward; what changes is the input
precision and the tile shapes. Understanding the FP16 WMMA path is the on-ramp to reasoning
about NVFP4 inference on Blackwell.

Takeaway

GEMM performance is a ladder: naive is memory-bound, shared-memory tiling makes it
compute-bound, and WMMA moves the compute onto Tensor Cores with mixed precision. Measure
every rung as % of cuBLAS on the same GPU — that's the metric that's honest about how close
you are to the ceiling and portable across Hopper and Blackwell.

→ More field notes on the NVIDIA stack: waynehacking8.github.io

Notes on Federated Learning and Differential Privacy

member_2e5ba30f — Sun, 31 May 2026 15:19:12 +0000

Notes on Federated Learning and Differential Privacy

2026-05-31 · privacy-preserving ML

Working notes on building federated learning (FL) from scratch, what actually breaks under
Non-IID data, and how differential privacy (DP) and secure aggregation fit on top —
including the honest negative results that the marketing slides leave out. They follow the
implementation in
federated-learning-lab
(FedAvg / FedProx / SCAFFOLD, DP-SGD, secure aggregation; 33/33 tests, literature
cross-validated).

1. What federated learning actually is

The data never moves. Instead of pooling everyone's data on one server, each client trains
locally and sends model updates to a server that aggregates them. The canonical loop
(FedAvg) is:

Server broadcasts the global model.
Each client does a few local SGD epochs on its own data.
Each client sends back its updated weights.
Server averages the weights (weighted by client data size) → new global model.

That's it. The elegance is that raw data stays on-device; the difficulty is that the clients'
data distributions are not identical.

2. The Non-IID problem (where FedAvg starts to hurt)

FedAvg implicitly assumes every client sees roughly the same distribution. Real clients don't —
one hospital sees different cases than another, one phone's keyboard sees different language.
Under Non-IID data, each client's local optimum pulls in a different direction, so averaging
their updates produces client drift: the global model lands somewhere none of them wanted.

Two well-known fixes, both implemented and measured in the lab:

FedProx — add a proximal term that penalises drifting too far from the global model. Stabilises training when clients are heterogeneous.
SCAFFOLD — track control variates (correction terms) that estimate and subtract the drift direction. More state to communicate, but corrects the bias FedProx only damps.

The honest finding worth repeating: on a strongly Non-IID split (e.g. label-skewed MNIST), the
fancy methods don't always beat plain FedAvg by much — and sometimes the dominant lever is
just more communication rounds. Reporting the case where your method doesn't win is what
separates a lab from a brochure.

3. Differential privacy: the model still leaks

Keeping data on-device is not privacy. Model updates leak information about the data that
produced them — membership inference and gradient-inversion attacks reconstruct training samples
from gradients. To get a real guarantee you add differential privacy.

DP-SGD makes each training step private by:

Per-sample gradient clipping — bound each example's contribution to a max norm C.
Gaussian noise — add noise calibrated to C to the summed gradients.

The result is a formal (ε, δ) guarantee: the trained model is provably almost the same
whether or not any single example was in the data. The cost is the privacy–utility
trade-off — smaller ε (stronger privacy) means more noise and lower accuracy. There is no
free lunch; the contribution is measuring the curve, not claiming privacy is costless.

4. Secure aggregation: hide the individual update

DP bounds what the final model leaks. Secure aggregation addresses a different threat: a
curious server seeing each client's individual update. With secure aggregation, clients mask
their updates so the server can compute only the sum — no single client's contribution is
visible — yet the masks cancel in aggregate. DP (what the model leaks) and secure aggregation
(what the server sees) are complementary, not substitutes.

5. Why "from scratch" and "33/33 tests"

Privacy ML is exactly the domain where a subtly wrong implementation gives a false sense of
safety — a clipping bug or a miscalibrated noise multiplier silently voids the ε guarantee. So
the lab:

implements each algorithm from scratch (FedAvg / FedProx / SCAFFOLD, plus FedPer / Byzantine-robust / FedAdam / FedLoRA),
cross-validates against the literature so behaviour matches published results, and
ships 33/33 passing tests and explicit negative results.

For privacy and security work, the test suite and the reproduction are the credibility.

Takeaway

Federated learning moves the model, not the data — but on-device ≠ private. Non-IID data breaks
naive averaging (FedProx/SCAFFOLD help, sometimes only a little); DP-SGD buys a formal (ε, δ)
guarantee at a measurable accuracy cost; secure aggregation hides individual updates from the
server. The trustworthy version of all three is the one with the tests and the honest curves.

→ From-scratch implementations, tests, and negative results:
github.com/waynehacking8/federated-learning-lab

Notes on Serving LLMs with TensorRT-LLM and Triton

member_2e5ba30f — Sun, 31 May 2026 15:18:35 +0000

Notes on Serving LLMs with TensorRT-LLM and Triton

2026-05-31 · LLM serving / NVIDIA stack

These are working notes on taking an open-weights LLM from a Hugging Face checkpoint to a
production-style serving endpoint on the NVIDIA stack — TensorRT-LLM for the engine,
Triton Inference Server for the deployment surface — and benchmarking it honestly against
vLLM on multi-GPU hardware. They follow the harness in
trtllm-triton-serving
(4× H100, NVLink).

The goal is to move from "I use vLLM" to "I can stand up the NVIDIA inference stack on real
multi-GPU hardware and reason about the trade-offs."

1. The serving pipeline

The path from checkpoint to endpoint has four stages. Each one is a place where a decision
affects latency, throughput, or accuracy:

Checkpoint — a Hugging Face model.
Engine build — compile to a TensorRT-LLM engine for a fixed tensor-parallel degree, precision, and batching policy.
Model repository — wrap the engine in a Triton tensorrt_llm-backend model repo.
Serving + load test — trtllm-serve (or Triton) exposes an OpenAI-compatible endpoint; a load generator drives it under controlled concurrency.

The key mental shift from vLLM: TensorRT-LLM does ahead-of-time compilation. vLLM is a
runtime that takes the model and serves it; TensorRT-LLM builds an engine specialized to your
GPU, TP degree, and precision first. That build is where the performance comes from, and also
where the rigidity comes from.

2. Tensor parallelism (TP)

For a model that doesn't fit on one GPU — or to cut latency — TensorRT-LLM shards each layer
across GPUs. On a 4× H100 NVLink box, TP=4 means every forward pass does an all-reduce
across the four GPUs over NVLink.

The all-reduce is not free. On this fabric it tops out around 77 % of the NVLink budget
(see the separate NVLink-wall notes). For prefill (large
tensors) you're bandwidth-bound and TP helps. For decode (one token at a time) you're
pinned against the small-message latency floor, and past a point more TP makes decode
slower. Pick TP for the regime you actually serve.

3. Precision: FP16 vs FP8

The engine is built for a specific precision. The two that matter most on Hopper:

Precision	Memory	Throughput	Accuracy risk
FP16	baseline	baseline	none (reference)
FP8	~½ weights + KV-cache	higher	small, model-dependent

FP8 uses the Hopper Transformer Engine and shrinks both weights and the KV-cache, which is
often the real bottleneck for long contexts. The honest move is to measure the accuracy
delta on your task rather than assume FP8 is free — a quantization study belongs in the same
harness as the throughput numbers.

4. The batching policy that actually matters

Two features dominate real serving throughput:

In-flight (continuous) batching — new requests join the running batch at the next iteration instead of waiting for the current batch to drain. This is what keeps GPUs busy under bursty traffic; vLLM and TensorRT-LLM both do it.
Paged KV-cache — the KV-cache is allocated in pages, so memory isn't reserved for the worst-case sequence length per request. This is what lets you fit more concurrent sequences.

If a "benchmark" doesn't enable these, it isn't measuring production serving — it's measuring a
toy.

5. The benchmark trap: comparing the same work

The single most common mistake in "X vs Y" LLM benchmarks is not decoding the same number of
tokens. If stack A happens to emit shorter completions, it looks faster while doing less work.

The fix used in the harness is a controlled methodology: every request decodes exactly
256 tokens by setting ignore_eos=True and min_tokens=max_tokens. Now throughput and
latency compare identical work across TensorRT-LLM, Triton, and vLLM. Without this, the numbers
are noise.

Metrics worth reporting, all under matched concurrency:

Throughput (tokens/s, total) — the headline.
TTFT (time to first token) — dominated by prefill; what the user feels first.
Inter-token latency — dominated by decode; what the user feels while reading.

6. Triton as the production surface

The measured runs can use TensorRT-LLM's own OpenAI server (trtllm-serve), but the
production path is the Triton tensorrt_llm-backend model repository (triton_model_repo/):

It exposes the engine over a hardened, observable server (metrics, health, dynamic batching config) instead of a script.
It's the same control plane you'd use for an ensemble (tokenizer → engine → de-tokenizer) and for multi-model hosting.

Treat trtllm-serve as the fast path for benchmarking and Triton as the path you'd actually
ship behind a gateway.

7. When does TensorRT-LLM win? (the measured answer)

Not always — and the measurement says which regime, not a vibe. Across a matched-work sweep on
4× H100, the result lands on a concurrency crossover:

TensorRT-LLM (with CUDA graphs) wins at low-to-mid concurrency — the latency-sensitive regime. The ahead-of-time engine plus CUDA-graph capture removes per-iteration launch overhead that dominates when the batch is small, so TTFT and inter-token latency are lower.
vLLM wins at high concurrency — the throughput-saturated regime, where its scheduler keeps the GPU packed and the launch-overhead advantage no longer matters.

One caveat that cost a real bug: CUDA graphs only help if the config actually enables them.
A run that looks like "TensorRT-LLM is barely faster" can be a mis-set graph config; fixing it
moved the low-concurrency number substantially. Always confirm the optimisation you're
crediting is switched on before drawing the curve.

So the decision rule is about your load, not brand loyalty: latency-bound, low/mid
concurrency → TensorRT-LLM + CUDA graphs; throughput-bound, high concurrency → vLLM. The honest
deliverable is a reproducible serve → benchmark loop with documented methodology that draws
that crossover for your hardware.

Takeaway

Serving an LLM well is mostly about three things: putting tensor parallelism in the regime that
helps, enabling continuous batching + paged KV-cache, and measuring the same work across
stacks. The measured crossover: TensorRT-LLM + CUDA graphs win low/mid concurrency (latency),
vLLM wins high concurrency (throughput) — and a Triton control plane is what you'd actually
put in production.

→ Full pipeline, Triton model repo, and the matched-work harness:
github.com/waynehacking8/trtllm-triton-serving

0% vs 50%: Making a RAG Agent Refuse to Hallucinate

member_2e5ba30f — Sun, 31 May 2026 15:12:07 +0000

0 % vs 50 %: making a RAG agent refuse to hallucinate

2026-05-31 · LLM / RAG

A retrieval-augmented agent is only as trustworthy as its behaviour on questions whose answer
isn't in the corpus. The failure mode is quiet: instead of saying "I don't know," the model
invents a confident, well-formed, wrong answer. This post shows a single guardrail that takes
that from common to never — and, crucially, measures it.

Reference architecture:
nim-agent-blueprint — agentic RAG on
the NVIDIA NIM stack with a built-in eval harness.

The ablation

The agent loop is plan → retrieve → generate → validate. The interesting variable is the
generation prompt's contract with the retrieved context:

Configuration	Out-of-corpus hallucination rate
Generate freely from context	~50 %
Guarded prompt (answer only from context; otherwise abstain)	0 %

Same model, same retriever, same questions. The only change is a prompt that makes "I can't
answer that from the provided sources" a first-class, rewarded output — plus a validate
step that checks the answer is grounded in retrieved spans before returning it. On in-corpus
questions, retrieval recall@3 stayed at 94–100 %, so the guardrail buys safety without
costing coverage.

Why "just prompt better" isn't the lesson

The lesson isn't the prompt — it's that the difference between 50 % and 0 % is invisible
without an eval harness. A demo that only asks in-corpus questions looks perfect in both
configurations. You only see the 50 % when you deliberately ask things the corpus can't
answer and score groundedness. So the blueprint ships with:

retrieval hit-rate (is the answer even retrievable?),
answer groundedness via LLM-as-judge (is the answer supported by what was retrieved?),
latency, and OpenTelemetry traces per agent step.

That's the difference between "it works on my five questions" and "here is the number a
partner can hold me to."

Takeaway

For enterprise RAG, abstention is a feature, not a failure. Make "I don't know" a rewarded
output, validate groundedness before returning, and measure the out-of-corpus rate — it's
the number that separates a demo from something you'd put in front of a customer.

→ Runnable blueprint + eval harness:
github.com/waynehacking8/nim-agent-blueprint

Where Tensor-Parallel Inference Hits the NVLink Wall

member_2e5ba30f — Sun, 31 May 2026 15:11:24 +0000

Where tensor-parallel inference hits the NVLink wall

2026-05-31 · GPU / distributed systems

Tensor parallelism splits each layer across GPUs, so every forward pass pays for an
all-reduce over the network fabric. On a single node that fabric is NVLink/NVSwitch — and
how close you get to its theoretical budget decides whether TP helps or hurts. This post
measures it on 4× H100 and explains where the wall is.

Repo with the full harness and CSVs:
nccl-collectives-bench.

What was measured

A bandwidth sweep (message size 8 B → 8 GB) of the three collectives that bound distributed
LLM work — all-reduce, all-gather, reduce-scatter — driving the canonical
nvidia/nccl-tests and adding a parser + analysis layer on top. The headline number:

All-reduce bus bandwidth ≈ 366 GB/s, about 77 % of the per-GPU NVLink uni-directional budget on this box. That 77 % is the practical ceiling TP communication runs into; the remaining gap is protocol overhead and the algorithm's traffic multiplier.
Algorithm ranking at large messages: NVLS > Ring > Tree. NVLink SHARP (NVLS) offloads the reduction into the switch, which is why it pulls ahead once messages are big enough to amortise setup.
A protocol study (Simple / LL / LL128) showing the small-message latency floor — the regime that actually matters for decode, where each token's all-reduce is tiny.

Why it matters for inference

Training all-reduces gradients on big tensors, so it lives in the bandwidth-bound regime
where 366 GB/s is good news. Decode is the opposite: one token at a time means small
messages, so you're pinned against the latency floor, not the bandwidth ceiling. That is the
real "TP wall" — past a certain TP degree, the per-token all-reduce latency dominates and
adding GPUs makes decode slower, not faster.

The repo also includes an eager-vs-CUDA-Graph comparison of that decode latency wall:
capturing the per-token step as a graph removes launch overhead that would otherwise be
indistinguishable from communication cost — a reminder to measure the right thing before
blaming the fabric.

Takeaway

"Use tensor parallelism" is not free advice. Measure the all-reduce on your fabric, know
your 77 %, and know that the number that decides decode latency is the small-message floor —
not the big-message bandwidth everyone quotes.

→ Methodology, raw CSVs, and the roofline analysis:
github.com/waynehacking8/nccl-collectives-bench