<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: member_2e5ba30f</title>
    <description>The latest articles on DEV Community by member_2e5ba30f (@member_2e5ba30f).</description>
    <link>https://dev.to/member_2e5ba30f</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3961326%2F350e3e35-e37f-4c56-ba89-8643d4858a95.png</url>
      <title>DEV Community: member_2e5ba30f</title>
      <link>https://dev.to/member_2e5ba30f</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/member_2e5ba30f"/>
    <language>en</language>
    <item>
      <title>Notes on CUDA Tensor Core GEMM (WMMA)</title>
      <dc:creator>member_2e5ba30f</dc:creator>
      <pubDate>Sun, 31 May 2026 15:26:36 +0000</pubDate>
      <link>https://dev.to/member_2e5ba30f/notes-on-cuda-tensor-core-gemm-wmma-2m47</link>
      <guid>https://dev.to/member_2e5ba30f/notes-on-cuda-tensor-core-gemm-wmma-2m47</guid>
      <description>&lt;h1&gt;
  
  
  Notes on CUDA Tensor Core GEMM (WMMA)
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;2026-05-31 · CUDA / GPU kernels&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Working notes on writing a matrix-multiply (GEMM) kernel in CUDA and climbing from a naive&lt;br&gt;
implementation to &lt;strong&gt;Tensor Cores&lt;/strong&gt; via the &lt;strong&gt;WMMA&lt;/strong&gt; API — and, just as important, how to know&lt;br&gt;
how good your kernel actually is by measuring it against the &lt;strong&gt;cuBLAS&lt;/strong&gt; ceiling. GEMM is the&lt;br&gt;
right thing to understand deeply: it is the operation underneath every linear layer and&lt;br&gt;
attention projection in an LLM.&lt;/p&gt;
&lt;h2&gt;
  
  
  1. Why GEMM is the kernel that matters
&lt;/h2&gt;

&lt;p&gt;A transformer forward pass is, to a first approximation, a stack of GEMMs. If you understand&lt;br&gt;
what makes a GEMM kernel fast on a GPU, you understand where inference latency comes from and&lt;br&gt;
why Tensor Cores exist. The progression below is the standard pedagogy — each step removes the&lt;br&gt;
bottleneck the previous one exposed.&lt;/p&gt;
&lt;h2&gt;
  
  
  2. Naive GEMM: memory-bound by construction
&lt;/h2&gt;

&lt;p&gt;The textbook kernel assigns one thread per output element &lt;code&gt;C[i][j]&lt;/code&gt; and loops over &lt;code&gt;k&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;C[i][j] = Σ_k A[i][k] * B[k][j]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's correct and it's slow. Every thread re-reads entire rows of &lt;code&gt;A&lt;/code&gt; and columns of &lt;code&gt;B&lt;/code&gt; from&lt;br&gt;
&lt;strong&gt;global memory&lt;/strong&gt;, so the same values are fetched from DRAM hundreds of times. The kernel is&lt;br&gt;
&lt;strong&gt;memory-bandwidth-bound&lt;/strong&gt; — the arithmetic units sit idle waiting on loads. The arithmetic&lt;br&gt;
intensity (FLOPs per byte) is far too low to approach peak.&lt;/p&gt;
&lt;h2&gt;
  
  
  3. Tiled GEMM: shared memory turns the problem compute-bound
&lt;/h2&gt;

&lt;p&gt;The fix is &lt;strong&gt;shared-memory tiling&lt;/strong&gt;. Threads in a block cooperatively load a tile of &lt;code&gt;A&lt;/code&gt; and a&lt;br&gt;
tile of &lt;code&gt;B&lt;/code&gt; into fast on-chip &lt;strong&gt;shared memory&lt;/strong&gt;, then every thread in the block reuses those&lt;br&gt;
tiles for its partial sums before loading the next tile:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Load a &lt;code&gt;TILE × TILE&lt;/code&gt; block of &lt;code&gt;A&lt;/code&gt; and of &lt;code&gt;B&lt;/code&gt; into shared memory (coalesced).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;__syncthreads()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Each thread accumulates its &lt;code&gt;C&lt;/code&gt; partial product from the shared tiles.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;__syncthreads()&lt;/code&gt;, advance to the next tile along &lt;code&gt;k&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This raises arithmetic intensity by a factor of &lt;code&gt;TILE&lt;/code&gt;: each value loaded from global memory is&lt;br&gt;
now reused &lt;code&gt;TILE&lt;/code&gt; times. The kernel crosses from memory-bound toward &lt;strong&gt;compute-bound&lt;/strong&gt; — now&lt;br&gt;
the FP32 ALUs are the limit. This is the single biggest jump, and it's pure data-movement&lt;br&gt;
strategy, not math.&lt;/p&gt;
&lt;h2&gt;
  
  
  4. Tensor Cores via WMMA: a different compute unit
&lt;/h2&gt;

&lt;p&gt;Tiling saturates the &lt;em&gt;CUDA cores&lt;/em&gt;. Tensor Cores are a &lt;strong&gt;separate&lt;/strong&gt; unit that does a small&lt;br&gt;
matrix-multiply-accumulate (MMA) in one instruction — e.g. a 16×16×16 &lt;code&gt;D = A·B + C&lt;/code&gt; per warp,&lt;br&gt;
on half-precision inputs with FP32 accumulation. The &lt;strong&gt;WMMA&lt;/strong&gt; (Warp Matrix Multiply-Accumulate)&lt;br&gt;
API exposes them in CUDA C++:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="k"&gt;namespace&lt;/span&gt; &lt;span class="n"&gt;nvcuda&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;wmma&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;fragment&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;matrix_a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;half&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row_major&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a_frag&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;fragment&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;matrix_b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;half&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;col_major&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;b_frag&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;fragment&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;accumulator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;        &lt;span class="n"&gt;c_frag&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;fill_fragment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c_frag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;load_matrix_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a_frag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="n"&gt;lda&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;load_matrix_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b_frag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="n"&gt;ldb&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;mma_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c_frag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a_frag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b_frag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c_frag&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;   &lt;span class="c1"&gt;// Tensor Core MMA&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;store_matrix_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="n"&gt;c_frag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ldc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mem_row_major&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The mental model shifts from "threads computing elements" to "&lt;strong&gt;warps cooperating on&lt;br&gt;
fragments&lt;/strong&gt;." A &lt;code&gt;fragment&lt;/code&gt; is an opaque, register-resident tile; you don't index its elements,&lt;br&gt;
you feed whole fragments to &lt;code&gt;mma_sync&lt;/code&gt;. Inputs are FP16/BF16 (or FP8/FP4 on newer&lt;br&gt;
architectures), accumulation is FP32 — which is why mixed precision is the native language of&lt;br&gt;
Tensor Cores.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. The number that tells the truth: % of cuBLAS
&lt;/h2&gt;

&lt;p&gt;A hand-written WMMA kernel will beat your tiled kernel, but it will &lt;strong&gt;not&lt;/strong&gt; beat cuBLAS — and&lt;br&gt;
that's the point. cuBLAS is the practical ceiling (it does register-tiling, double-buffering,&lt;br&gt;
swizzled layouts, and architecture-specific tuning you won't replicate in an afternoon). So the&lt;br&gt;
honest metric isn't raw TFLOP/s, it's &lt;strong&gt;percent of the cuBLAS ceiling on the same GPU&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Kernel&lt;/th&gt;
&lt;th&gt;Typical regime&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Naive&lt;/td&gt;
&lt;td&gt;a few % of peak — memory-bound&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tiled (shared memory)&lt;/td&gt;
&lt;td&gt;much better, still CUDA-core bound&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WMMA (Tensor Core)&lt;/td&gt;
&lt;td&gt;a meaningful fraction of cuBLAS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cuBLAS&lt;/td&gt;
&lt;td&gt;the ceiling (100%)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Reporting "X % of cuBLAS on sm_90 and sm_120" is a self-describing result: it's reproducible,&lt;br&gt;
it normalises across GPUs, and it's honest about the gap to a production library. Profiling the&lt;br&gt;
WMMA kernel with &lt;strong&gt;Nsight Compute&lt;/strong&gt; then tells you &lt;em&gt;which&lt;/em&gt; wall you're against — memory&lt;br&gt;
throughput, Tensor Core utilisation, or occupancy.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Why this matters going to Blackwell
&lt;/h2&gt;

&lt;p&gt;Each GPU generation widens what the MMA unit accepts: Hopper added FP8, &lt;strong&gt;Blackwell&lt;/strong&gt; adds&lt;br&gt;
&lt;strong&gt;FP4&lt;/strong&gt; and a new generation of Tensor Core instructions (&lt;code&gt;tcgen05&lt;/code&gt;). The WMMA mental model —&lt;br&gt;
fragments fed to an MMA, FP32 accumulation — carries forward; what changes is the input&lt;br&gt;
precision and the tile shapes. Understanding the FP16 WMMA path is the on-ramp to reasoning&lt;br&gt;
about NVFP4 inference on Blackwell.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;GEMM performance is a ladder: naive is memory-bound, &lt;strong&gt;shared-memory tiling&lt;/strong&gt; makes it&lt;br&gt;
compute-bound, and &lt;strong&gt;WMMA&lt;/strong&gt; moves the compute onto Tensor Cores with mixed precision. Measure&lt;br&gt;
every rung as &lt;strong&gt;% of cuBLAS&lt;/strong&gt; on the same GPU — that's the metric that's honest about how close&lt;br&gt;
you are to the ceiling and portable across Hopper and Blackwell.&lt;/p&gt;

&lt;p&gt;→ More field notes on the NVIDIA stack: &lt;a href="https://waynehacking8.github.io/" rel="noopener noreferrer"&gt;waynehacking8.github.io&lt;/a&gt;&lt;/p&gt;

</description>
      <category>cuda</category>
      <category>gpu</category>
      <category>cpp</category>
      <category>performance</category>
    </item>
    <item>
      <title>Notes on Federated Learning and Differential Privacy</title>
      <dc:creator>member_2e5ba30f</dc:creator>
      <pubDate>Sun, 31 May 2026 15:19:12 +0000</pubDate>
      <link>https://dev.to/member_2e5ba30f/notes-on-federated-learning-and-differential-privacy-3cmn</link>
      <guid>https://dev.to/member_2e5ba30f/notes-on-federated-learning-and-differential-privacy-3cmn</guid>
      <description>&lt;h1&gt;
  
  
  Notes on Federated Learning and Differential Privacy
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;2026-05-31 · privacy-preserving ML&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Working notes on building federated learning (FL) from scratch, what actually breaks under&lt;br&gt;
&lt;strong&gt;Non-IID&lt;/strong&gt; data, and how &lt;strong&gt;differential privacy (DP)&lt;/strong&gt; and &lt;strong&gt;secure aggregation&lt;/strong&gt; fit on top —&lt;br&gt;
including the honest negative results that the marketing slides leave out. They follow the&lt;br&gt;
implementation in&lt;br&gt;
&lt;a href="https://github.com/waynehacking8/federated-learning-lab" rel="noopener noreferrer"&gt;&lt;strong&gt;federated-learning-lab&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
(FedAvg / FedProx / SCAFFOLD, DP-SGD, secure aggregation; 33/33 tests, literature&lt;br&gt;
cross-validated).&lt;/p&gt;

&lt;h2&gt;
  
  
  1. What federated learning actually is
&lt;/h2&gt;

&lt;p&gt;The data never moves. Instead of pooling everyone's data on one server, each client trains&lt;br&gt;
locally and sends &lt;strong&gt;model updates&lt;/strong&gt; to a server that aggregates them. The canonical loop&lt;br&gt;
(&lt;strong&gt;FedAvg&lt;/strong&gt;) is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Server broadcasts the global model.&lt;/li&gt;
&lt;li&gt;Each client does a few local SGD epochs on its own data.&lt;/li&gt;
&lt;li&gt;Each client sends back its updated weights.&lt;/li&gt;
&lt;li&gt;Server averages the weights (weighted by client data size) → new global model.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. The elegance is that raw data stays on-device; the difficulty is that the clients'&lt;br&gt;
data distributions are &lt;strong&gt;not&lt;/strong&gt; identical.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The Non-IID problem (where FedAvg starts to hurt)
&lt;/h2&gt;

&lt;p&gt;FedAvg implicitly assumes every client sees roughly the same distribution. Real clients don't —&lt;br&gt;
one hospital sees different cases than another, one phone's keyboard sees different language.&lt;br&gt;
Under &lt;strong&gt;Non-IID&lt;/strong&gt; data, each client's local optimum pulls in a different direction, so averaging&lt;br&gt;
their updates produces &lt;strong&gt;client drift&lt;/strong&gt;: the global model lands somewhere none of them wanted.&lt;/p&gt;

&lt;p&gt;Two well-known fixes, both implemented and measured in the lab:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FedProx&lt;/strong&gt; — add a proximal term that penalises drifting too far from the global model.
Stabilises training when clients are heterogeneous.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SCAFFOLD&lt;/strong&gt; — track &lt;strong&gt;control variates&lt;/strong&gt; (correction terms) that estimate and subtract the
drift direction. More state to communicate, but corrects the bias FedProx only damps.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The honest finding worth repeating: on a strongly Non-IID split (e.g. label-skewed MNIST), the&lt;br&gt;
fancy methods &lt;strong&gt;don't always beat plain FedAvg by much&lt;/strong&gt; — and sometimes the dominant lever is&lt;br&gt;
just more communication rounds. Reporting the case where your method &lt;em&gt;doesn't&lt;/em&gt; win is what&lt;br&gt;
separates a lab from a brochure.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Differential privacy: the model still leaks
&lt;/h2&gt;

&lt;p&gt;Keeping data on-device is &lt;strong&gt;not&lt;/strong&gt; privacy. Model updates leak information about the data that&lt;br&gt;
produced them — membership inference and gradient-inversion attacks reconstruct training samples&lt;br&gt;
from gradients. To get a real guarantee you add &lt;strong&gt;differential privacy&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DP-SGD&lt;/strong&gt; makes each training step private by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Per-sample gradient clipping&lt;/strong&gt; — bound each example's contribution to a max norm &lt;code&gt;C&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gaussian noise&lt;/strong&gt; — add noise calibrated to &lt;code&gt;C&lt;/code&gt; to the summed gradients.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result is a formal &lt;strong&gt;(ε, δ)&lt;/strong&gt; guarantee: the trained model is provably almost the same&lt;br&gt;
whether or not any single example was in the data. The cost is the &lt;strong&gt;privacy–utility&lt;br&gt;
trade-off&lt;/strong&gt; — smaller ε (stronger privacy) means more noise and lower accuracy. There is no&lt;br&gt;
free lunch; the contribution is &lt;em&gt;measuring&lt;/em&gt; the curve, not claiming privacy is costless.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Secure aggregation: hide the individual update
&lt;/h2&gt;

&lt;p&gt;DP bounds what the &lt;em&gt;final model&lt;/em&gt; leaks. &lt;strong&gt;Secure aggregation&lt;/strong&gt; addresses a different threat: a&lt;br&gt;
curious server seeing each client's &lt;em&gt;individual&lt;/em&gt; update. With secure aggregation, clients mask&lt;br&gt;
their updates so the server can compute only the &lt;strong&gt;sum&lt;/strong&gt; — no single client's contribution is&lt;br&gt;
visible — yet the masks cancel in aggregate. DP (what the model leaks) and secure aggregation&lt;br&gt;
(what the server sees) are &lt;strong&gt;complementary&lt;/strong&gt;, not substitutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Why "from scratch" and "33/33 tests"
&lt;/h2&gt;

&lt;p&gt;Privacy ML is exactly the domain where a subtly wrong implementation gives a &lt;em&gt;false&lt;/em&gt; sense of&lt;br&gt;
safety — a clipping bug or a miscalibrated noise multiplier silently voids the ε guarantee. So&lt;br&gt;
the lab:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;implements each algorithm from scratch (FedAvg / FedProx / SCAFFOLD, plus FedPer /
Byzantine-robust / FedAdam / FedLoRA),&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;cross-validates against the literature&lt;/strong&gt; so behaviour matches published results, and&lt;/li&gt;
&lt;li&gt;ships &lt;strong&gt;33/33 passing tests&lt;/strong&gt; and explicit negative results.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For privacy and security work, the test suite and the reproduction &lt;em&gt;are&lt;/em&gt; the credibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;Federated learning moves the model, not the data — but on-device ≠ private. Non-IID data breaks&lt;br&gt;
naive averaging (FedProx/SCAFFOLD help, sometimes only a little); DP-SGD buys a formal (ε, δ)&lt;br&gt;
guarantee at a measurable accuracy cost; secure aggregation hides individual updates from the&lt;br&gt;
server. The trustworthy version of all three is the one with the tests and the honest curves.&lt;/p&gt;

&lt;p&gt;→ From-scratch implementations, tests, and negative results:&lt;br&gt;
&lt;a href="https://github.com/waynehacking8/federated-learning-lab" rel="noopener noreferrer"&gt;github.com/waynehacking8/federated-learning-lab&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>privacy</category>
      <category>ai</category>
      <category>python</category>
    </item>
    <item>
      <title>Notes on Serving LLMs with TensorRT-LLM and Triton</title>
      <dc:creator>member_2e5ba30f</dc:creator>
      <pubDate>Sun, 31 May 2026 15:18:35 +0000</pubDate>
      <link>https://dev.to/member_2e5ba30f/notes-on-serving-llms-with-tensorrt-llm-and-triton-14ai</link>
      <guid>https://dev.to/member_2e5ba30f/notes-on-serving-llms-with-tensorrt-llm-and-triton-14ai</guid>
      <description>&lt;h1&gt;
  
  
  Notes on Serving LLMs with TensorRT-LLM and Triton
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;2026-05-31 · LLM serving / NVIDIA stack&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;These are working notes on taking an open-weights LLM from a Hugging Face checkpoint to a&lt;br&gt;
production-style serving endpoint on the NVIDIA stack — &lt;strong&gt;TensorRT-LLM&lt;/strong&gt; for the engine,&lt;br&gt;
&lt;strong&gt;Triton Inference Server&lt;/strong&gt; for the deployment surface — and benchmarking it honestly against&lt;br&gt;
&lt;strong&gt;vLLM&lt;/strong&gt; on multi-GPU hardware. They follow the harness in&lt;br&gt;
&lt;a href="https://github.com/waynehacking8/trtllm-triton-serving" rel="noopener noreferrer"&gt;&lt;strong&gt;trtllm-triton-serving&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
(4× H100, NVLink).&lt;/p&gt;

&lt;p&gt;The goal is to move from "I use vLLM" to "I can stand up the NVIDIA inference stack on real&lt;br&gt;
multi-GPU hardware and reason about the trade-offs."&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The serving pipeline
&lt;/h2&gt;

&lt;p&gt;The path from checkpoint to endpoint has four stages. Each one is a place where a decision&lt;br&gt;
affects latency, throughput, or accuracy:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Checkpoint&lt;/strong&gt; — a Hugging Face model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engine build&lt;/strong&gt; — compile to a TensorRT-LLM engine for a &lt;em&gt;fixed&lt;/em&gt; tensor-parallel degree,
precision, and batching policy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model repository&lt;/strong&gt; — wrap the engine in a Triton &lt;code&gt;tensorrt_llm&lt;/code&gt;-backend model repo.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Serving + load test&lt;/strong&gt; — &lt;code&gt;trtllm-serve&lt;/code&gt; (or Triton) exposes an OpenAI-compatible endpoint;
a load generator drives it under controlled concurrency.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key mental shift from vLLM: TensorRT-LLM does &lt;strong&gt;ahead-of-time compilation&lt;/strong&gt;. vLLM is a&lt;br&gt;
runtime that takes the model and serves it; TensorRT-LLM &lt;em&gt;builds an engine&lt;/em&gt; specialized to your&lt;br&gt;
GPU, TP degree, and precision first. That build is where the performance comes from, and also&lt;br&gt;
where the rigidity comes from.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Tensor parallelism (TP)
&lt;/h2&gt;

&lt;p&gt;For a model that doesn't fit on one GPU — or to cut latency — TensorRT-LLM shards each layer&lt;br&gt;
across GPUs. On a 4× H100 NVLink box, &lt;code&gt;TP=4&lt;/code&gt; means every forward pass does an &lt;strong&gt;all-reduce&lt;/strong&gt;&lt;br&gt;
across the four GPUs over NVLink.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The all-reduce is not free. On this fabric it tops out around 77 % of the NVLink budget&lt;br&gt;
(see the separate &lt;a href="//nccl-nvlink-bandwidth.md"&gt;NVLink-wall notes&lt;/a&gt;). For &lt;strong&gt;prefill&lt;/strong&gt; (large&lt;br&gt;
tensors) you're bandwidth-bound and TP helps. For &lt;strong&gt;decode&lt;/strong&gt; (one token at a time) you're&lt;br&gt;
pinned against the small-message latency floor, and past a point more TP makes decode&lt;br&gt;
&lt;em&gt;slower&lt;/em&gt;. Pick TP for the regime you actually serve.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  3. Precision: FP16 vs FP8
&lt;/h2&gt;

&lt;p&gt;The engine is built for a specific precision. The two that matter most on Hopper:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Precision&lt;/th&gt;
&lt;th&gt;Memory&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;Accuracy risk&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FP16&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;none (reference)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FP8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~½ weights + KV-cache&lt;/td&gt;
&lt;td&gt;higher&lt;/td&gt;
&lt;td&gt;small, model-dependent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;FP8 uses the Hopper Transformer Engine and shrinks both weights and the KV-cache, which is&lt;br&gt;
often the real bottleneck for long contexts. The honest move is to &lt;strong&gt;measure&lt;/strong&gt; the accuracy&lt;br&gt;
delta on your task rather than assume FP8 is free — a quantization study belongs in the same&lt;br&gt;
harness as the throughput numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. The batching policy that actually matters
&lt;/h2&gt;

&lt;p&gt;Two features dominate real serving throughput:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;In-flight (continuous) batching&lt;/strong&gt; — new requests join the running batch at the next
iteration instead of waiting for the current batch to drain. This is what keeps GPUs busy
under bursty traffic; vLLM and TensorRT-LLM both do it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Paged KV-cache&lt;/strong&gt; — the KV-cache is allocated in pages, so memory isn't reserved for the
worst-case sequence length per request. This is what lets you fit more concurrent sequences.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a "benchmark" doesn't enable these, it isn't measuring production serving — it's measuring a&lt;br&gt;
toy.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. The benchmark trap: comparing the same work
&lt;/h2&gt;

&lt;p&gt;The single most common mistake in "X vs Y" LLM benchmarks is &lt;strong&gt;not decoding the same number of&lt;br&gt;
tokens&lt;/strong&gt;. If stack A happens to emit shorter completions, it looks faster while doing less work.&lt;/p&gt;

&lt;p&gt;The fix used in the harness is a &lt;strong&gt;controlled methodology&lt;/strong&gt;: every request decodes &lt;em&gt;exactly&lt;/em&gt;&lt;br&gt;
256 tokens by setting &lt;code&gt;ignore_eos=True&lt;/code&gt; and &lt;code&gt;min_tokens=max_tokens&lt;/code&gt;. Now throughput and&lt;br&gt;
latency compare identical work across TensorRT-LLM, Triton, and vLLM. Without this, the numbers&lt;br&gt;
are noise.&lt;/p&gt;

&lt;p&gt;Metrics worth reporting, all under &lt;em&gt;matched concurrency&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Throughput&lt;/strong&gt; (tokens/s, total) — the headline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTFT&lt;/strong&gt; (time to first token) — dominated by prefill; what the user feels first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inter-token latency&lt;/strong&gt; — dominated by decode; what the user feels while reading.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  6. Triton as the production surface
&lt;/h2&gt;

&lt;p&gt;The measured runs can use TensorRT-LLM's own OpenAI server (&lt;code&gt;trtllm-serve&lt;/code&gt;), but the&lt;br&gt;
&lt;strong&gt;production path&lt;/strong&gt; is the Triton &lt;code&gt;tensorrt_llm&lt;/code&gt;-backend model repository (&lt;code&gt;triton_model_repo/&lt;/code&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It exposes the engine over a hardened, observable server (metrics, health, dynamic batching
config) instead of a script.&lt;/li&gt;
&lt;li&gt;It's the same control plane you'd use for an ensemble (tokenizer → engine → de-tokenizer) and
for multi-model hosting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Treat &lt;code&gt;trtllm-serve&lt;/code&gt; as the fast path for benchmarking and Triton as the path you'd actually&lt;br&gt;
ship behind a gateway.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. When does TensorRT-LLM win? (the measured answer)
&lt;/h2&gt;

&lt;p&gt;Not always — and the measurement says &lt;em&gt;which&lt;/em&gt; regime, not a vibe. Across a matched-work sweep on&lt;br&gt;
4× H100, the result lands on a &lt;strong&gt;concurrency crossover&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TensorRT-LLM (with CUDA graphs) wins at low-to-mid concurrency&lt;/strong&gt; — the latency-sensitive
regime. The ahead-of-time engine plus CUDA-graph capture removes per-iteration launch overhead
that dominates when the batch is small, so TTFT and inter-token latency are lower.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vLLM wins at high concurrency&lt;/strong&gt; — the throughput-saturated regime, where its scheduler keeps
the GPU packed and the launch-overhead advantage no longer matters.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;One caveat that cost a real bug: &lt;strong&gt;CUDA graphs only help if the config actually enables them.&lt;/strong&gt;&lt;br&gt;
A run that looks like "TensorRT-LLM is barely faster" can be a mis-set graph config; fixing it&lt;br&gt;
moved the low-concurrency number substantially. Always confirm the optimisation you're&lt;br&gt;
crediting is switched on before drawing the curve.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So the decision rule is about your &lt;strong&gt;load&lt;/strong&gt;, not brand loyalty: latency-bound, low/mid&lt;br&gt;
concurrency → TensorRT-LLM + CUDA graphs; throughput-bound, high concurrency → vLLM. The honest&lt;br&gt;
deliverable is a reproducible &lt;strong&gt;serve → benchmark&lt;/strong&gt; loop with documented methodology that draws&lt;br&gt;
that crossover for &lt;em&gt;your&lt;/em&gt; hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;Serving an LLM well is mostly about three things: putting tensor parallelism in the regime that&lt;br&gt;
helps, enabling continuous batching + paged KV-cache, and &lt;strong&gt;measuring the same work&lt;/strong&gt; across&lt;br&gt;
stacks. The measured crossover: TensorRT-LLM + CUDA graphs win &lt;strong&gt;low/mid concurrency&lt;/strong&gt; (latency),&lt;br&gt;
vLLM wins &lt;strong&gt;high concurrency&lt;/strong&gt; (throughput) — and a Triton control plane is what you'd actually&lt;br&gt;
put in production.&lt;/p&gt;

&lt;p&gt;→ Full pipeline, Triton model repo, and the matched-work harness:&lt;br&gt;
&lt;a href="https://github.com/waynehacking8/trtllm-triton-serving" rel="noopener noreferrer"&gt;github.com/waynehacking8/trtllm-triton-serving&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>nvidia</category>
      <category>machinelearning</category>
      <category>performance</category>
    </item>
    <item>
      <title>0% vs 50%: Making a RAG Agent Refuse to Hallucinate</title>
      <dc:creator>member_2e5ba30f</dc:creator>
      <pubDate>Sun, 31 May 2026 15:12:07 +0000</pubDate>
      <link>https://dev.to/member_2e5ba30f/0-vs-50-making-a-rag-agent-refuse-to-hallucinate-13ba</link>
      <guid>https://dev.to/member_2e5ba30f/0-vs-50-making-a-rag-agent-refuse-to-hallucinate-13ba</guid>
      <description>&lt;h1&gt;
  
  
  0 % vs 50 %: making a RAG agent refuse to hallucinate
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;2026-05-31 · LLM / RAG&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A retrieval-augmented agent is only as trustworthy as its behaviour on questions whose answer&lt;br&gt;
&lt;strong&gt;isn't in the corpus&lt;/strong&gt;. The failure mode is quiet: instead of saying "I don't know," the model&lt;br&gt;
invents a confident, well-formed, wrong answer. This post shows a single guardrail that takes&lt;br&gt;
that from common to never — and, crucially, &lt;em&gt;measures&lt;/em&gt; it.&lt;/p&gt;

&lt;p&gt;Reference architecture:&lt;br&gt;
&lt;a href="https://github.com/waynehacking8/nim-agent-blueprint" rel="noopener noreferrer"&gt;&lt;strong&gt;nim-agent-blueprint&lt;/strong&gt;&lt;/a&gt; — agentic RAG on&lt;br&gt;
the NVIDIA NIM stack with a built-in eval harness.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ablation
&lt;/h2&gt;

&lt;p&gt;The agent loop is &lt;strong&gt;plan → retrieve → generate → validate&lt;/strong&gt;. The interesting variable is the&lt;br&gt;
generation prompt's contract with the retrieved context:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Out-of-corpus hallucination rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Generate freely from context&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~50 %&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Guarded prompt (answer &lt;em&gt;only&lt;/em&gt; from context; otherwise abstain)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0 %&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same model, same retriever, same questions. The only change is a prompt that makes "I can't&lt;br&gt;
answer that from the provided sources" a first-class, rewarded output — plus a &lt;strong&gt;validate&lt;/strong&gt;&lt;br&gt;
step that checks the answer is grounded in retrieved spans before returning it. On in-corpus&lt;br&gt;
questions, retrieval &lt;strong&gt;recall@3 stayed at 94–100 %&lt;/strong&gt;, so the guardrail buys safety without&lt;br&gt;
costing coverage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "just prompt better" isn't the lesson
&lt;/h2&gt;

&lt;p&gt;The lesson isn't the prompt — it's that the difference between 50 % and 0 % is &lt;strong&gt;invisible&lt;br&gt;
without an eval harness&lt;/strong&gt;. A demo that only asks in-corpus questions looks perfect in both&lt;br&gt;
configurations. You only see the 50 % when you deliberately ask things the corpus can't&lt;br&gt;
answer and &lt;em&gt;score groundedness&lt;/em&gt;. So the blueprint ships with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;retrieval hit-rate&lt;/strong&gt; (is the answer even retrievable?),&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;answer groundedness&lt;/strong&gt; via LLM-as-judge (is the answer supported by what was retrieved?),&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;latency&lt;/strong&gt;, and OpenTelemetry traces per agent step.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the difference between "it works on my five questions" and "here is the number a&lt;br&gt;
partner can hold me to."&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;For enterprise RAG, abstention is a feature, not a failure. Make "I don't know" a rewarded&lt;br&gt;
output, validate groundedness before returning, and &lt;strong&gt;measure the out-of-corpus rate&lt;/strong&gt; — it's&lt;br&gt;
the number that separates a demo from something you'd put in front of a customer.&lt;/p&gt;

&lt;p&gt;→ Runnable blueprint + eval harness:&lt;br&gt;
&lt;a href="https://github.com/waynehacking8/nim-agent-blueprint" rel="noopener noreferrer"&gt;github.com/waynehacking8/nim-agent-blueprint&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Where Tensor-Parallel Inference Hits the NVLink Wall</title>
      <dc:creator>member_2e5ba30f</dc:creator>
      <pubDate>Sun, 31 May 2026 15:11:24 +0000</pubDate>
      <link>https://dev.to/member_2e5ba30f/where-tensor-parallel-inference-hits-the-nvlink-wall-1l8p</link>
      <guid>https://dev.to/member_2e5ba30f/where-tensor-parallel-inference-hits-the-nvlink-wall-1l8p</guid>
      <description>&lt;h1&gt;
  
  
  Where tensor-parallel inference hits the NVLink wall
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;2026-05-31 · GPU / distributed systems&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Tensor parallelism splits each layer across GPUs, so every forward pass pays for an&lt;br&gt;
&lt;strong&gt;all-reduce&lt;/strong&gt; over the network fabric. On a single node that fabric is NVLink/NVSwitch — and&lt;br&gt;
how close you get to its theoretical budget decides whether TP helps or hurts. This post&lt;br&gt;
measures it on &lt;strong&gt;4× H100&lt;/strong&gt; and explains where the wall is.&lt;/p&gt;

&lt;p&gt;Repo with the full harness and CSVs:&lt;br&gt;
&lt;a href="https://github.com/waynehacking8/nccl-collectives-bench" rel="noopener noreferrer"&gt;&lt;strong&gt;nccl-collectives-bench&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What was measured
&lt;/h2&gt;

&lt;p&gt;A bandwidth sweep (message size 8 B → 8 GB) of the three collectives that bound distributed&lt;br&gt;
LLM work — &lt;strong&gt;all-reduce, all-gather, reduce-scatter&lt;/strong&gt; — driving the canonical&lt;br&gt;
&lt;code&gt;nvidia/nccl-tests&lt;/code&gt; and adding a parser + analysis layer on top. The headline number:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;All-reduce bus bandwidth ≈ 366 GB/s&lt;/strong&gt;, about &lt;strong&gt;77 % of the per-GPU NVLink uni-directional
budget&lt;/strong&gt; on this box. That 77 % is the practical ceiling TP communication runs into; the
remaining gap is protocol overhead and the algorithm's traffic multiplier.&lt;/li&gt;
&lt;li&gt;Algorithm ranking at large messages: &lt;strong&gt;NVLS &amp;gt; Ring &amp;gt; Tree&lt;/strong&gt;. NVLink SHARP (NVLS) offloads
the reduction into the switch, which is why it pulls ahead once messages are big enough to
amortise setup.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;protocol study&lt;/strong&gt; (Simple / LL / LL128) showing the small-message latency floor — the
regime that actually matters for &lt;strong&gt;decode&lt;/strong&gt;, where each token's all-reduce is tiny.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why it matters for inference
&lt;/h2&gt;

&lt;p&gt;Training all-reduces gradients on big tensors, so it lives in the bandwidth-bound regime&lt;br&gt;
where 366 GB/s is good news. &lt;strong&gt;Decode is the opposite&lt;/strong&gt;: one token at a time means small&lt;br&gt;
messages, so you're pinned against the &lt;em&gt;latency&lt;/em&gt; floor, not the bandwidth ceiling. That is the&lt;br&gt;
real "TP wall" — past a certain TP degree, the per-token all-reduce latency dominates and&lt;br&gt;
adding GPUs makes decode &lt;em&gt;slower&lt;/em&gt;, not faster.&lt;/p&gt;

&lt;p&gt;The repo also includes an &lt;strong&gt;eager-vs-CUDA-Graph&lt;/strong&gt; comparison of that decode latency wall:&lt;br&gt;
capturing the per-token step as a graph removes launch overhead that would otherwise be&lt;br&gt;
indistinguishable from communication cost — a reminder to measure the right thing before&lt;br&gt;
blaming the fabric.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;"Use tensor parallelism" is not free advice. Measure the all-reduce on &lt;em&gt;your&lt;/em&gt; fabric, know&lt;br&gt;
your 77 %, and know that the number that decides decode latency is the small-message floor —&lt;br&gt;
not the big-message bandwidth everyone quotes.&lt;/p&gt;

&lt;p&gt;→ Methodology, raw CSVs, and the roofline analysis:&lt;br&gt;
&lt;a href="https://github.com/waynehacking8/nccl-collectives-bench" rel="noopener noreferrer"&gt;github.com/waynehacking8/nccl-collectives-bench&lt;/a&gt;&lt;/p&gt;

</description>
      <category>cuda</category>
      <category>gpu</category>
      <category>machinelearning</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
