<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shah Fahad</title>
    <description>The latest articles on DEV Community by Shah Fahad (@sfahad).</description>
    <link>https://dev.to/sfahad</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3783516%2F5f595b19-bafe-4d58-85e5-0d83a323d253.jpg</url>
      <title>DEV Community: Shah Fahad</title>
      <link>https://dev.to/sfahad</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sfahad"/>
    <language>en</language>
    <item>
      <title>Roofline Model: Why Your Kernel Is Slow, Geometrically</title>
      <dc:creator>Shah Fahad</dc:creator>
      <pubDate>Sun, 10 May 2026 10:31:05 +0000</pubDate>
      <link>https://dev.to/sfahad/roofline-model-why-your-kernel-is-slow-geometrically-2d3l</link>
      <guid>https://dev.to/sfahad/roofline-model-why-your-kernel-is-slow-geometrically-2d3l</guid>
      <description>&lt;p&gt;Every kernel does two kinds of work: it performs arithmetic, and it moves data. A kernel is fast only when both sides are used well. If the arithmetic units are waiting for data, the kernel is memory-bandwidth-limited. If data is arriving fast enough but the arithmetic units are saturated, the kernel is compute-limited.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Roofline performance model&lt;/strong&gt; helps separate those two cases. It gives a first-order way to ask: is this kernel limited by peak arithmetic throughput, by peak memory bandwidth, or by inefficient use of the hardware, where neither peak compute throughput nor peak memory bandwidth is being achieved?&lt;/p&gt;

&lt;p&gt;The model is intentionally simplified. It ignores most hardware details and keeps only the first-order limits that matter for performance: how fast the machine can do FLOPs, and how fast it can move bytes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Simplified Machine
&lt;/h2&gt;

&lt;p&gt;Roofline starts with a deliberately simple picture of the hardware:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+----------------------+   bytes moved through HBM   +----------------------+
| GPU SMs              | &amp;lt;-------------------------&amp;gt; | HBM                  |
|                      |                             |                      |
| peak_FLOPs_per_sec   |                             | peak_BW              |
+----------------------+                             +----------------------+

        compute work: FLOPs                 memory work: bytes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The left side is the processor. For this article, assume the compute units are a GPU's SMs. We summarize all of them with one number: the maximum rate at which they can perform floating-point work, &lt;code&gt;peak_FLOPs_per_sec&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The right side is memory the processor needs to read from and write to. For this article, assume that memory is HBM. We summarize the path between HBM and the SMs with one number too: the maximum rate at which bytes can be transferred between HBM and the compute units, &lt;code&gt;peak_BW&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For a kernel to run, two things must happen:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;memory work:   bytes must move to or from HBM
compute work:  FLOPs must be executed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Terminology matters here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FLOPs&lt;/strong&gt; means total floating-point operations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FLOP/s&lt;/strong&gt; means a rate: floating-point operations per second.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Arithmetic Intensity
&lt;/h2&gt;

&lt;p&gt;The key quantity in Roofline is &lt;strong&gt;arithmetic intensity&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Arithmetic Intensity (AI) = FLOPs / bytes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It measures how much computation the kernel performs for each byte moved to or from HBM. A low-AI kernel moves a lot of data for each unit of arithmetic. A high-AI kernel reuses data well and performs many FLOPs per byte.&lt;/p&gt;

&lt;p&gt;There are two versions of AI, and the difference between them matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Algorithmic AI&lt;/strong&gt; is the ideal value implied by the algorithm itself. You count the FLOPs the algorithm must perform, then divide by the minimum bytes the algorithm must move. In this view, every input is loaded only when it is truly needed, reused perfectly after that, and every output is written only as required. Algorithmic AI answers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If memory reuse were perfect, how many FLOPs could this algorithm get per byte?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Observed AI&lt;/strong&gt; is what the implemented kernel actually achieves at runtime. You still count FLOPs, but now the byte count comes from the real traffic through HBM. If the same value is loaded multiple times, those bytes count multiple times. If an uncoalesced access fetches a full memory sector but uses only part of it, the fetched bytes count. If register spills or cache misses create extra traffic, those bytes count too. Observed AI answers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Given the traffic this implementation really generated, how many FLOPs did it get per byte?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ideally, under consistent FLOP accounting and the same memory boundary:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;observed AI &amp;lt;= algorithmic AI
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Roofline point uses &lt;strong&gt;observed AI&lt;/strong&gt;. The algorithmic AI is an upper-bound reference: it tells you how far right the kernel should be able to move if wasted memory traffic is removed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Drawing the Roofline
&lt;/h2&gt;

&lt;p&gt;Now take the simplified machine and run one kernel on it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Plot the kernel point
&lt;/h3&gt;

&lt;p&gt;At runtime, suppose the kernel:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;moves some total number of bytes to or from HBM&lt;/li&gt;
&lt;li&gt;performs some total number of FLOPs&lt;/li&gt;
&lt;li&gt;takes some amount of time to finish&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From those observed quantities, we compute two values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;observed AI          = FLOPs / bytes moved through HBM
achieved performance = FLOPs / time
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These two values become the kernel's point on the Roofline chart:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;x-position = observed AI
y-position = achieved performance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next we draw the two hardware limits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Draw the compute roof
&lt;/h3&gt;

&lt;p&gt;The first limit comes from the SMs. No matter how much data reuse the kernel has, it cannot run faster than the maximum arithmetic throughput of the SMs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;compute roof = peak_FLOPs_per_sec
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a horizontal line on the chart.&lt;/p&gt;

&lt;h3&gt;
  
  
  Draw the bandwidth roof
&lt;/h3&gt;

&lt;p&gt;The second limit comes from HBM bandwidth. The hardware has some peak HBM bandwidth, &lt;code&gt;peak_BW&lt;/code&gt;, measured in bytes per second. For a kernel with arithmetic intensity &lt;code&gt;AI&lt;/code&gt;, every byte moved from HBM supports &lt;code&gt;AI&lt;/code&gt; FLOPs. So if the kernel could use the full HBM bandwidth, the maximum compute throughput that HBM could feed is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bandwidth roof = AI × peak_BW
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a diagonal line. At low AI, each FLOP requires many bytes, so even peak HBM bandwidth cannot feed enough data to reach the compute roof. As AI increases, each byte supports more FLOPs, so the bandwidth-limited ceiling rises until it eventually meets the compute roof.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;log(FLOP/s)
   ^
   |        bandwidth roof                     compute roof
   |                         o======================================
   |                       / |
   |                     /   |
   |                   /     |
   |                 /       |
   |               /         |
   |             /           |
   |           /             |
   |         /               |
   |       /                 |
   |     /                   |
   |   /                     |
   +-------------------------+--------------------&amp;gt; log(AI = FLOPs / Bytes)
                             |
                             ridge point

        bandwidth-limited    |    compute-limited
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Roofline is the lower of those two ceilings. At low AI, the diagonal bandwidth roof is lower, so HBM bandwidth is the applicable ceiling. At high AI, the horizontal compute roof is lower, so SM arithmetic throughput is the applicable ceiling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Find the ridge point
&lt;/h3&gt;

&lt;p&gt;The point where the two ceilings meet is the &lt;strong&gt;ridge point&lt;/strong&gt;, also called the &lt;em&gt;machine balance&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI_ridge = peak_FLOPs_per_sec / peak_BW
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now compare the kernel's observed AI — the x-position we computed from runtime FLOPs and HBM bytes — against this ridge point.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Observed AI left of the ridge:&lt;/strong&gt; the bandwidth roof is the lower ceiling. At this AI, even perfect HBM bandwidth utilization would not reach peak compute throughput. If the point sits below the diagonal roof, the implementation is also failing to use the available bandwidth efficiently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observed AI right of the ridge:&lt;/strong&gt; the compute roof is the lower ceiling. At this AI, HBM bandwidth is high enough in the Roofline model, so peak arithmetic throughput becomes the main limit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The exact ridge location depends on the machine and on which compute peak you choose. The important question is not the absolute value of the ridge; it is whether the kernel's observed AI lands to the left or right of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Two Diagnostic Gaps
&lt;/h2&gt;

&lt;p&gt;Once the kernel point is on the chart, there are two different questions to ask.&lt;/p&gt;

&lt;p&gt;First: &lt;strong&gt;is the point below the Roofline at its current observed AI?&lt;/strong&gt; That is a vertical gap.&lt;/p&gt;

&lt;p&gt;Second: &lt;strong&gt;is the observed AI far left of the algorithmic AI?&lt;/strong&gt; That is a horizontal gap.&lt;/p&gt;

&lt;p&gt;These gaps mean different things. A vertical gap means the kernel is not using the relevant hardware limit efficiently. A horizontal gap means the kernel is moving more bytes than the algorithm ideally requires.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vertical Gap: Below the Roofline
&lt;/h3&gt;

&lt;p&gt;At a fixed observed AI, the Roofline tells you the best performance the hardware could provide. If the kernel point sits below that roof, the kernel is not reaching the applicable ceiling.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;log(FLOP/s)
   ^
   |        bandwidth roof                     compute roof
   |                         R==============================
   |                       / |                         |
   |                     /   |                         |
   |                   /     |                         v compute-side gap
   |                 /       |                         C
   |               /         |
   |             /           |
   |           /             |
   |         /               |
   |       /                 |
   |     / |                 |
   |   /   v memory-side gap |
   | /     M                 |
   +-------------------------+-------------------------&amp;gt; log(AI = FLOPs / Bytes)
                             |
                         ridge point

        bandwidth-limited    |    compute-limited
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Point &lt;code&gt;M&lt;/code&gt; is left of the ridge, so its applicable roof is the diagonal bandwidth roof. The vertical distance between the bandwidth roof and &lt;code&gt;M&lt;/code&gt; means the kernel is not achieving peak HBM bandwidth for its current observed AI. The bytes are what they are, but they are not being moved fast enough.&lt;/p&gt;

&lt;p&gt;Common reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Too few memory operations in flight to saturate HBM bandwidth.&lt;/li&gt;
&lt;li&gt;Synchronous loads that stall instead of overlapping with compute.&lt;/li&gt;
&lt;li&gt;Poor producer-consumer overlap: load phase, then compute phase, instead of a pipeline.&lt;/li&gt;
&lt;li&gt;HBM row-buffer thrashing or memory channel imbalance.&lt;/li&gt;
&lt;li&gt;A problem too small to expose enough parallelism.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Point &lt;code&gt;C&lt;/code&gt; is right of the ridge, so its applicable roof is the horizontal compute roof. The vertical distance between the compute roof and &lt;code&gt;C&lt;/code&gt; means the kernel is not achieving peak arithmetic throughput. HBM bandwidth is no longer the limiting ceiling; the SMs are not being kept fully productive.&lt;/p&gt;

&lt;p&gt;Common reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shared-memory bank conflicts that stall operand delivery.&lt;/li&gt;
&lt;li&gt;Register dependency chains or low instruction-level parallelism.&lt;/li&gt;
&lt;li&gt;Warp divergence.&lt;/li&gt;
&lt;li&gt;Not enough asynchronous-MMA work in flight to hide latency.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Horizontal Gap: Observed AI vs Algorithmic AI
&lt;/h3&gt;

&lt;p&gt;The horizontal gap is different. It compares the kernel's &lt;strong&gt;observed AI&lt;/strong&gt; to the algorithm's &lt;strong&gt;algorithmic AI&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;observed AI    = FLOPs / actual HBM bytes moved
algorithmic AI = FLOPs / minimum bytes required by the algorithm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If observed AI is far to the left of algorithmic AI, the implementation is moving extra bytes. The FLOPs may be the same, but the denominator is larger than it should be.&lt;/p&gt;

&lt;p&gt;Common reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uncoalesced or scattered global loads that over-fetch memory sectors.&lt;/li&gt;
&lt;li&gt;Cache thrashing: data is evicted before reuse and loaded again.&lt;/li&gt;
&lt;li&gt;Redundant loads across thread blocks.&lt;/li&gt;
&lt;li&gt;Register spills that create local-memory traffic.&lt;/li&gt;
&lt;li&gt;Unfused kernels that write intermediates to HBM and read them back later.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not about whether HBM bandwidth is saturated. A kernel can sit exactly on the bandwidth roof and still have a large horizontal gap. That means it is moving too many bytes, but moving them efficiently.&lt;/p&gt;




&lt;h2&gt;
  
  
  Optimization as a 2D Walk
&lt;/h2&gt;

&lt;p&gt;Once you have the two-gap picture, you can think of optimization as walking the dot on the chart. Every optimization moves the dot in a specific direction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Right (raise observed AI)
&lt;/h3&gt;

&lt;p&gt;These optimizations change &lt;em&gt;which bytes&lt;/em&gt; (or &lt;em&gt;how many bytes&lt;/em&gt;) cross the boundary you're measuring at. Same FLOPs, fewer bytes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tiling and blocking for shared memory and registers.&lt;/li&gt;
&lt;li&gt;Larger thread-block tiles so each loaded byte is reused more times before being evicted.&lt;/li&gt;
&lt;li&gt;Kernel fusion — eliminate HBM round trips for intermediates.&lt;/li&gt;
&lt;li&gt;Multicast loads and cache-residency hints.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The maximum reference target for "right" is the algorithmic AI. Under consistent accounting, it is the upper bound for how much HBM traffic reduction can improve observed AI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Up (raise achieved FLOP/s at current AI)
&lt;/h3&gt;

&lt;p&gt;These optimizations change &lt;em&gt;how fast&lt;/em&gt; the bytes already in flight are processed. Same bytes, higher throughput.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increase ILP — issue more independent MMAs before any synchronizing instruction.&lt;/li&gt;
&lt;li&gt;Software pipelining and double- or multi-buffering to overlap data movement with compute.&lt;/li&gt;
&lt;li&gt;Eliminate shared-memory bank conflicts using the right swizzled layouts.&lt;/li&gt;
&lt;li&gt;Coalesce global memory accesses (this also shifts the dot right, so it's a both-axis optimization).&lt;/li&gt;
&lt;li&gt;Raise occupancy when you're latency-bound; lower it when register pressure is helping ILP. Both happen.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The maximum target for "up" is the applicable roof — bandwidth roof if you're memory-bandwidth-limited, compute roof if you're compute-limited.&lt;/p&gt;

&lt;h3&gt;
  
  
  Up-and-right (combined)
&lt;/h3&gt;

&lt;p&gt;This is the typical trajectory of a real rewrite: naive triple-loop GEMM → tiled GEMM crosses the ridge point and continues climbing toward the compute roof. The dot may zig-zag a bit as you fix one bottleneck and uncover the next.&lt;/p&gt;

&lt;p&gt;There is one important constraint to internalize: &lt;strong&gt;once the dot is well into the compute-limited region, further AI is usually no longer the main lever.&lt;/strong&gt; In the simple Roofline model, the horizontal compute roof is now the ceiling. At that point, the important question is how close the kernel gets to peak compute throughput.&lt;/p&gt;

&lt;p&gt;This is why people working on GEMM kernels at any reasonable size obsess over the vertical gap: their algorithmic AI is so far past the ridge that additional tiling is usually no longer the main lever; the remaining question is how close to peak compute they can get.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Optimization Mental Model
&lt;/h2&gt;

&lt;p&gt;Before optimizing, ask what the change is supposed to improve.&lt;/p&gt;

&lt;p&gt;If the optimization reduces the number of HBM bytes needed for the same FLOPs, it increases observed AI. The point moves &lt;strong&gt;right&lt;/strong&gt;. These are reuse and traffic-reduction optimizations.&lt;/p&gt;

&lt;p&gt;If the optimization keeps the same observed AI but makes the kernel run faster, it increases achieved FLOP/s. The point moves &lt;strong&gt;up&lt;/strong&gt;. These are utilization and pipelining optimizations.&lt;/p&gt;

&lt;p&gt;Some optimizations do both: they reduce traffic and improve throughput. But the distinction is still useful, because it tells you what movement you should expect on the chart.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Optimization&lt;/th&gt;
&lt;th&gt;Primary axis&lt;/th&gt;
&lt;th&gt;What changes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Coalescing&lt;/td&gt;
&lt;td&gt;X (right)&lt;/td&gt;
&lt;td&gt;Fewer over-fetched sector bytes.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Async copy / pipelining&lt;/td&gt;
&lt;td&gt;Y (up, memory-bandwidth-limited side)&lt;/td&gt;
&lt;td&gt;Memory latency is hidden and bandwidth utilization improves.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SMEM swizzling / bank-conflict fixes&lt;/td&gt;
&lt;td&gt;Y (up, compute-limited side)&lt;/td&gt;
&lt;td&gt;Same HBM bytes, math pipe stalls less.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tiling / blocking&lt;/td&gt;
&lt;td&gt;X (right)&lt;/td&gt;
&lt;td&gt;Same algorithm, fewer HBM round trips per FLOP.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kernel fusion&lt;/td&gt;
&lt;td&gt;X (right)&lt;/td&gt;
&lt;td&gt;Eliminates HBM round trips for intermediates.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multicast loads&lt;/td&gt;
&lt;td&gt;X (right)&lt;/td&gt;
&lt;td&gt;One thread block's load serves many; eliminates redundant cross-block traffic.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  GEMM as a Roofline Example
&lt;/h2&gt;

&lt;p&gt;Now tie the pieces together with GEMM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;C = A × B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Assume a square GEMM where &lt;code&gt;M = N = K&lt;/code&gt;, with FP16 inputs and FP16 output. The algorithmic work is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FLOPs = 2 * M^3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For this simplified &lt;code&gt;C = A × B&lt;/code&gt; case, with no read of an old &lt;code&gt;C&lt;/code&gt; value, the compulsory HBM traffic is one read of &lt;code&gt;A&lt;/code&gt;, one read of &lt;code&gt;B&lt;/code&gt;, and one write of &lt;code&gt;C&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Compulsory bytes = 6 * M^2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the algorithmic AI is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;algorithmic AI = FLOPs / compulsory bytes = M / 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the rightward target. It says that as the matrix grows, GEMM can theoretically do more and more FLOPs for each byte moved from HBM. A large GEMM should therefore live far to the right on the Roofline chart, usually in the compute-limited region.&lt;/p&gt;

&lt;p&gt;But that only describes the algorithm. The implementation still has to earn that AI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Naive GEMM
&lt;/h3&gt;

&lt;p&gt;A naive implementation might reload the same elements of &lt;code&gt;A&lt;/code&gt; and &lt;code&gt;B&lt;/code&gt; many times from HBM. The FLOP count is still the GEMM FLOP count, but the HBM byte count is much larger than the compulsory byte count.&lt;/p&gt;

&lt;p&gt;That means:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;observed AI &amp;lt;&amp;lt; algorithmic AI
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the chart, the point moves far left. If it is left of the ridge, the kernel is in the bandwidth-limited regime. If it is also below the diagonal roof, then it has both problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a &lt;strong&gt;horizontal gap&lt;/strong&gt;, because it moves too many HBM bytes&lt;/li&gt;
&lt;li&gt;a &lt;strong&gt;vertical memory-side gap&lt;/strong&gt;, because it is not using peak HBM bandwidth efficiently&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tiled GEMM
&lt;/h3&gt;

&lt;p&gt;Tiling attacks the horizontal gap. Instead of loading an element from HBM every time it is used, the kernel loads a tile once and reuses it many times from faster on-chip storage.&lt;/p&gt;

&lt;p&gt;The FLOPs are the same, but HBM bytes go down:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;observed AI increases
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the chart, the point moves right toward the algorithmic AI. If the point crosses the ridge, the applicable ceiling changes: HBM bandwidth is no longer the lower roof, and the kernel moves into the compute-limited region.&lt;/p&gt;

&lt;h3&gt;
  
  
  Well-pipelined GEMM
&lt;/h3&gt;

&lt;p&gt;Once the kernel is in the compute-limited region, moving further right is not enough. The applicable ceiling is now the horizontal compute roof. The remaining problem is the vertical gap between the point and that roof.&lt;/p&gt;

&lt;p&gt;Now the question becomes: are the SMs kept busy?&lt;/p&gt;

&lt;p&gt;Optimizations in this phase do not primarily reduce HBM bytes. They improve achieved FLOP/s at roughly the same observed AI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;overlap HBM loads with computation&lt;/li&gt;
&lt;li&gt;keep enough independent math work in flight&lt;/li&gt;
&lt;li&gt;avoid shared-memory bank conflicts&lt;/li&gt;
&lt;li&gt;avoid long dependency chains&lt;/li&gt;
&lt;li&gt;reduce synchronization stalls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the chart, these changes move the point up toward the compute roof.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Roofline Reading
&lt;/h3&gt;

&lt;p&gt;GEMM is useful because it shows both directions clearly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bad reuse moves the point left of algorithmic AI.&lt;/li&gt;
&lt;li&gt;Better tiling moves the point right.&lt;/li&gt;
&lt;li&gt;Poor scheduling or operand delivery leaves the point vertically below the applicable roof.&lt;/li&gt;
&lt;li&gt;Better pipelining and utilization move the point up.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ideal GEMM implementation is therefore not just "high AI" and not just "high FLOP/s." It is both: observed AI close to algorithmic AI, and achieved performance close to the applicable Roofline ceiling.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;Roofline is useful because it turns performance tuning into a sequence of concrete questions.&lt;/p&gt;

&lt;p&gt;Measure the kernel's FLOPs, HBM bytes, and time. Compute observed AI and achieved FLOP/s. Place the point on the chart. Then ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is observed AI left or right of the ridge?&lt;/li&gt;
&lt;li&gt;Is the point below the applicable roof?&lt;/li&gt;
&lt;li&gt;Is observed AI far left of algorithmic AI?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those answers tell you the next direction. Move right by reducing HBM traffic. Move up by improving utilization of the current limiting resource. If the point is already near the roof and near algorithmic AI, the kernel is close to what this model says the hardware can do.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>computerscience</category>
      <category>performance</category>
      <category>programming</category>
    </item>
    <item>
      <title>Programming Hopper GPUs: The Memory Consistency Model</title>
      <dc:creator>Shah Fahad</dc:creator>
      <pubDate>Sat, 25 Apr 2026 14:05:13 +0000</pubDate>
      <link>https://dev.to/sfahad/programming-hopper-gpus-the-memory-consistency-model-24m7</link>
      <guid>https://dev.to/sfahad/programming-hopper-gpus-the-memory-consistency-model-24m7</guid>
      <description>&lt;p&gt;You've decided to write fast code for an NVIDIA Hopper GPU. Maybe you want to build a custom attention kernel. Maybe you're trying to understand how CUTLASS and ThunderKittens work under the hood. Either way, before you can use any of the cool Hopper hardware — TMA, wgmma, mbarriers, clusters — you need to understand one thing: &lt;strong&gt;how memory works when thousands of threads share it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's what the &lt;em&gt;memory consistency model&lt;/em&gt; describes. It's the rulebook for what one thread can or cannot see about another thread's writes. Without it, the rest of the stack is undefined behavior waiting to happen.&lt;/p&gt;

&lt;p&gt;This article covers the minimum you need to write correct multi-threaded GPU code. We'll build it from one concrete bug, then introduce the two primitives that fix it.&lt;/p&gt;

&lt;p&gt;Code examples are schematic unless noted. Full PTX spells out scope, state space, and type — for example &lt;code&gt;fence.release.gpu&lt;/code&gt; or &lt;code&gt;st.release.gpu.global.u32&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The bug we're trying to prevent
&lt;/h2&gt;

&lt;p&gt;Imagine the simplest possible producer-consumer pattern. One thread fills a buffer, then sets a flag to say "I'm done." Another thread waits for the flag, then reads the buffer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Producer (thread T1):              Consumer (thread T2):

  data = 42;       // (1)            while (flag == 0) {}    // (3) wait for flag
  flag = 1;        // (2)            x = data;               // (4) expects 42
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looks fine, right? It's how you'd write it in any single-threaded language. But on a GPU (or any modern CPU, for that matter), this code can produce &lt;code&gt;x == 0&lt;/code&gt; instead of &lt;code&gt;x == 42&lt;/code&gt;. Here's why.&lt;/p&gt;

&lt;p&gt;From the producer's own perspective, &lt;code&gt;(1)&lt;/code&gt; and &lt;code&gt;(2)&lt;/code&gt; are two completely unrelated stores to two different memory locations. There's nothing in T1's own code that depends on the order in which they hit memory. So the &lt;strong&gt;compiler is free to reorder them&lt;/strong&gt;, and the &lt;strong&gt;hardware is free to commit them out of order&lt;/strong&gt; — neither change is visible from inside T1.&lt;/p&gt;

&lt;p&gt;But the consumer T2 &lt;em&gt;can&lt;/em&gt; see the reorder. If &lt;code&gt;flag = 1&lt;/code&gt; arrives in memory &lt;em&gt;before&lt;/em&gt; &lt;code&gt;data = 42&lt;/code&gt; does, then T2 might see &lt;code&gt;flag == 1&lt;/code&gt;, exit its wait loop, and read &lt;code&gt;data&lt;/code&gt; while it's still &lt;code&gt;0&lt;/code&gt;. The consumer's assumption — "if the flag is set, the data is ready" — silently breaks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;What T1 thinks happens:        What T2 might actually see:

  data = 42                      flag = 1     ← becomes visible first
  flag = 1                       data = 42    ← arrives later

  T2 reads:                      T2 reads:
    flag → 1 ✓                     flag → 1 ✓ (exits wait loop)
    data → 42 ✓                    data → 0 ✗ (reads stale value!)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;This is the central problem of shared-memory programming.&lt;/strong&gt; A program that looks logically correct can fail because the system reorders operations that are independent from the issuing thread's perspective but not independent from another thread's perspective.&lt;/p&gt;

&lt;p&gt;The memory consistency model gives us tools to prevent this. The two main tools are called the &lt;strong&gt;release fence&lt;/strong&gt; and the &lt;strong&gt;acquire fence&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The big idea: pair of fences
&lt;/h2&gt;

&lt;p&gt;Think of fences as a contract between the producer and the consumer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Producer side:&lt;/strong&gt; "I promise to &lt;em&gt;publish&lt;/em&gt; all my prior work before I tell anyone I'm done."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumer side:&lt;/strong&gt; "I promise to wait for the announcement, then &lt;em&gt;read fresh data&lt;/em&gt; — not anything I might have cached earlier."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The producer holds up its end of the contract using a &lt;strong&gt;release fence&lt;/strong&gt;. The consumer holds up its end using an &lt;strong&gt;acquire fence&lt;/strong&gt;. When both sides cooperate, the bug above goes away.&lt;/p&gt;

&lt;p&gt;Let's see exactly what each fence does, then we'll put them together to fix the bug.&lt;/p&gt;




&lt;h2&gt;
  
  
  The release fence
&lt;/h2&gt;

&lt;p&gt;A release fence sits between two pieces of producer code. Conceptually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  data = 42;          // some prior work

  fence.release;      // "publish everything above me before anything below me"

  flag = 1;           // the announcement
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things to understand:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The compiler cannot move prior memory operations past the fence.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without the fence, a compiler optimization might reorder &lt;code&gt;data = 42&lt;/code&gt; to come &lt;em&gt;after&lt;/em&gt; &lt;code&gt;flag = 1&lt;/code&gt;. With &lt;code&gt;fence.release&lt;/code&gt; in between, that's forbidden. The fence is a one-way wall: stuff above stays above.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without the fence:                With fence.release:

  data = 42                         data = 42
  flag = 1                          fence.release  ← prior writes pinned above
                                    flag = 1
  ⇒ compiler/hardware may swap     ⇒ data = 42 stays before flag = 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The fence is one-way, not two-way.&lt;/strong&gt; This is a subtle but important point. The release fence only blocks downward motion — operations &lt;em&gt;above&lt;/em&gt; it can't sink below. Operations &lt;em&gt;below&lt;/em&gt; the fence are technically allowed to move above it; the fence does not pin them in place. This is by design: the release only promises "everything above me is published," so it has nothing to say about what comes after.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   memory op above                    ← cannot sink below the fence
   memory op above                    ← cannot sink below the fence
   ─── fence.release ───
   memory op below                    ← in general, CAN move above the fence
   memory op below                    ← in general, CAN move above the fence
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This sounds dangerous — wouldn't the compiler also be free to move the &lt;code&gt;flag = 1&lt;/code&gt; write above the fence? In source PTX, the answer is no for the store that forms the release pattern: NVIDIA defines the pattern by program order, so moving that store would change the program's synchronization behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The fence makes the prior writes "publishable."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the more subtle part. After the fence, when the producer writes &lt;code&gt;flag = 1&lt;/code&gt;, that flag write effectively &lt;em&gt;carries with it&lt;/em&gt; the promise that everything before the fence is also visible. It's like attaching a receipt to the announcement: "by the way, my data is ready too."&lt;/p&gt;

&lt;p&gt;This is called a &lt;strong&gt;release pattern&lt;/strong&gt;: a &lt;code&gt;fence.release&lt;/code&gt; followed by a write to a flag. The pair together is what publishes the producer's prior work to anyone watching the flag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern exception — the paired flag write is pinned.&lt;/strong&gt; The "later operations can move above the fence" rule has one exception: the &lt;em&gt;specific strong write&lt;/em&gt; that pairs with the fence to form the release pattern. NVIDIA's model defines &lt;code&gt;fence.release; st.relaxed [flag], 1&lt;/code&gt; as a release pattern because those instructions occur in that program order. A compiler or assembler cannot preserve the same PTX semantics while hoisting that store above the fence, because doing so would dismantle the release pattern entirely.&lt;/p&gt;

&lt;p&gt;So in practice, the rules around a release fence are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anything before the fence stays before it.&lt;/li&gt;
&lt;li&gt;Unrelated memory operations after the fence are not protected by the release semantics in the same way.&lt;/li&gt;
&lt;li&gt;The specific strong flag write that pairs with the fence must stay after the fence — it's the whole point of the pattern.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;One important nuance:&lt;/strong&gt; the fence does NOT eagerly broadcast anything to other threads. It just guarantees an &lt;em&gt;ordering&lt;/em&gt; — that the data is ready in memory before the flag is. Other threads will observe the change at their own pace; the release fence doesn't push anything to them. This is why we'll need a spin loop on the consumer side.&lt;/p&gt;




&lt;h2&gt;
  
  
  The acquire fence
&lt;/h2&gt;

&lt;p&gt;The acquire fence is the mirror image. It sits between two pieces of consumer code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  flag_value = flag;   // read the announcement

  fence.acquire;       // "anything below me sees a fresh view of memory"

  x = data;            // read the data, guaranteed up-to-date
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things to understand:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The compiler cannot move later memory operations before the fence.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without the fence, a compiler might prefetch &lt;code&gt;data&lt;/code&gt; into a register before the flag check (a common optimization). With &lt;code&gt;fence.acquire&lt;/code&gt;, that's forbidden. The fence is again a one-way wall, but in the opposite direction this time: stuff below stays below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without the fence:                With fence.acquire:

  flag_value = flag                 flag_value = flag
  x = data         ← may be         fence.acquire  ← later reads pinned below
                     prefetched     x = data
                     above flag!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The fence is one-way, in the opposite direction from release.&lt;/strong&gt; Symmetrically with &lt;code&gt;fence.release&lt;/code&gt;, the acquire fence only blocks upward motion — operations &lt;em&gt;below&lt;/em&gt; it can't hoist above. Operations &lt;em&gt;above&lt;/em&gt; the fence are technically allowed to move below it; the fence doesn't pin them in place. This is by design: the acquire only promises "everything below me sees a fresh view," so it has nothing to say about what came before.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   memory op above                    ← in general, CAN move below the fence
   memory op above                    ← in general, CAN move below the fence
   ─── fence.acquire ───
   memory op below                    ← cannot hoist above the fence
   memory op below                    ← cannot hoist above the fence
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Just like for release, this raises a question: couldn't the compiler also move the &lt;code&gt;flag_value = flag&lt;/code&gt; read &lt;em&gt;below&lt;/em&gt; the fence? Not for the read that forms the acquire pattern: NVIDIA defines the pattern by program order, so moving that read would change the synchronization behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The fence makes subsequent reads "fresh."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After the fence, later reads are ordered after any matching release that was observed through the flag. In other words, the consumer cannot use an old value of &lt;code&gt;data&lt;/code&gt; from before the producer's published write; it must see the released write or a later write in that location's coherence order.&lt;/p&gt;

&lt;p&gt;This is called an &lt;strong&gt;acquire pattern&lt;/strong&gt;: a read of a flag, followed by a &lt;code&gt;fence.acquire&lt;/code&gt;. The pair together is what makes the consumer pick up the producer's published data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern exception — the paired flag read is pinned.&lt;/strong&gt; The "earlier operations can move below the fence" rule has one exception: the &lt;em&gt;specific strong read&lt;/em&gt; that pairs with the fence to form the acquire pattern. NVIDIA's model defines &lt;code&gt;ld.relaxed [flag]; fence.acquire&lt;/code&gt; as an acquire pattern because those instructions occur in that program order. Moving the flag read below the fence would dismantle the pattern.&lt;/p&gt;

&lt;p&gt;So in practice, the rules around an acquire fence are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anything after the fence stays after it.&lt;/li&gt;
&lt;li&gt;Unrelated memory operations before the fence are not protected by the acquire semantics in the same way.&lt;/li&gt;
&lt;li&gt;The specific strong flag read that pairs with the fence must stay before the fence — it's the whole point of the pattern.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Putting them together: the bug, fixed
&lt;/h2&gt;

&lt;p&gt;Now let's go back to our buggy code and fix it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Producer (thread T1):                 Consumer (thread T2):

  data = 42;          // (1)            while (1) {
  fence.release;      // (F_R)            flag_value = flag;     // (3)
  flag = 1;           // (2)              if (flag_value == 1) break;
                                        }
                                        fence.acquire;           // (F_A)
                                        x = data;                // (4)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Walk through it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;T1 writes &lt;code&gt;data = 42&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;T1 hits &lt;code&gt;fence.release&lt;/code&gt;. This guarantees &lt;code&gt;data = 42&lt;/code&gt; is committed to memory before anything that follows.&lt;/li&gt;
&lt;li&gt;T1 writes &lt;code&gt;flag = 1&lt;/code&gt;. This is the announcement; it now "carries" the visibility of &lt;code&gt;data = 42&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;T2 spins on &lt;code&gt;flag&lt;/code&gt; until it reads &lt;code&gt;1&lt;/code&gt;. (We need to spin because the flag's new value takes some real time to propagate to T2 — fences don't push, they just promise ordering.)&lt;/li&gt;
&lt;li&gt;T2 hits &lt;code&gt;fence.acquire&lt;/code&gt;. This guarantees subsequent reads see fresh data — no stale cached values can satisfy them.&lt;/li&gt;
&lt;li&gt;T2 reads &lt;code&gt;data&lt;/code&gt;. Because the producer's release published &lt;code&gt;data = 42&lt;/code&gt; before publishing the flag, and T2's acquire ensures fresh reads after seeing the flag, T2 is guaranteed to see &lt;code&gt;42&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The key insight:&lt;/strong&gt; the bug only got fixed because BOTH sides cooperated. The release alone wouldn't help — the consumer would still cache stale data. The acquire alone wouldn't help — the producer's writes might still arrive out of order. Memory ordering is always a &lt;em&gt;contract&lt;/em&gt; between producer and consumer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   Producer side:                       Consumer side:

   data = 42                            spin: flag_value = flag
       │                                              │
       ▼                                              ▼
   fence.release        ─ publishes ─►   fence.acquire   ◄─ acquires
       │                  data + flag    │                  fresh view
       ▼                  together       ▼
   flag = 1                              x = data → 42 ✓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Shorter forms: baked-in release and acquire
&lt;/h2&gt;

&lt;p&gt;In the example above, the producer wrote two separate instructions for the publish step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fence.release;
st.relaxed [flag], 1;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PTX provides a shorter form that bakes the release semantics directly into the store. Instead of the two-instruction pair, you can write:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data = 42;
st.release [flag], 1;     // store with release semantics built in
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For this adjacent publish-store pattern, this gives the same release/acquire synchronization guarantee as &lt;code&gt;fence.release; st.relaxed [flag], 1&lt;/code&gt;. The release behavior is fused into the store, so the release pattern is inherent in one instruction.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important: it has to be &lt;code&gt;st.release&lt;/code&gt;, not &lt;code&gt;st.relaxed&lt;/code&gt;.&lt;/strong&gt; A bare &lt;code&gt;st.relaxed [flag], 1&lt;/code&gt; is &lt;em&gt;not&lt;/em&gt; a release pattern — it's a strong store with no release ordering effect on prior writes. The release pattern requires either &lt;code&gt;st.release&lt;/code&gt; (release baked in) &lt;strong&gt;or&lt;/strong&gt; the explicit pair &lt;code&gt;fence.release; st.relaxed [flag], 1&lt;/code&gt;. Don't drop the &lt;code&gt;fence.release&lt;/code&gt; and assume the &lt;code&gt;.relaxed&lt;/code&gt; qualifier carries any release meaning — it doesn't.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The consumer side has the matching shortcut:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flag_value = ld.acquire [flag];   // load with acquire semantics built in
x = data;                         // sees fresh data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ld.acquire&lt;/code&gt; is shorthand for "do the read, and include acquire semantics in that same operation." Same synchronization guarantee, one instruction.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Same warning, mirrored.&lt;/strong&gt; A bare &lt;code&gt;ld.relaxed [flag]&lt;/code&gt; is &lt;em&gt;not&lt;/em&gt; an acquire pattern — it's a strong load with no acquire ordering effect on later reads. The acquire pattern requires either &lt;code&gt;ld.acquire&lt;/code&gt; (acquire baked in) &lt;strong&gt;or&lt;/strong&gt; the explicit pair &lt;code&gt;ld.relaxed [flag]; fence.acquire&lt;/code&gt;. Don't drop the &lt;code&gt;fence.acquire&lt;/code&gt; and assume the &lt;code&gt;.relaxed&lt;/code&gt; qualifier carries any acquire meaning.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The fully-baked-in producer/consumer becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Producer:                          Consumer:

  data = 42;                         while (1) {
  st.release [flag], 1;                flag_value = ld.acquire [flag];
                                       if (flag_value == 1) break;
                                     }
                                     x = data;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same contract, less typing. &lt;strong&gt;For most producer-consumer patterns, this is the form you want.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When would you reach for the explicit &lt;code&gt;fence.release&lt;/code&gt; / &lt;code&gt;fence.acquire&lt;/code&gt; form instead? A few cases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;You want one fence to publish multiple flags.&lt;/strong&gt; A single &lt;code&gt;fence.release&lt;/code&gt; followed by several flag writes gives all of them release semantics — no need to repeat the fence per flag.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want a cheap relaxed read inside a spin loop, then "commit" to acquiring only when you see the right value.&lt;/strong&gt; Repeating &lt;code&gt;ld.acquire&lt;/code&gt; on every spin iteration can be more expensive than a &lt;code&gt;ld.relaxed&lt;/code&gt; loop followed by one &lt;code&gt;fence.acquire&lt;/code&gt; after exit. We'll see this pattern again with &lt;code&gt;mbarrier&lt;/code&gt; in a later article.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need the combined fence variant&lt;/strong&gt; discussed next.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  When you need both: &lt;code&gt;fence.acq_rel&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;So far we've kept the producer and consumer roles cleanly separated — release on the producer side, acquire on the consumer side. Each side has a single direction to worry about.&lt;/p&gt;

&lt;p&gt;But sometimes a single point in your code needs to do both jobs at once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Refresh&lt;/strong&gt; its view to see what someone else published (acquire side).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publish&lt;/strong&gt; its own writes for someone else to see (release side).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's exactly what &lt;code&gt;fence.acq_rel&lt;/code&gt; is for. It's a &lt;code&gt;fence.release&lt;/code&gt; and &lt;code&gt;fence.acquire&lt;/code&gt; rolled into one — both effects fused at the same point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fence.acq_rel;     // both: prior writes published AND subsequent reads refreshed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both directions of the "one-way wall" apply at once: prior memory ops cannot sink below it (release side), and later memory ops cannot hoist above it (acquire side).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   memory op above                    ← cannot sink below the fence (release side)
   memory op above                    ← cannot sink below the fence
   ─── fence.acq_rel ───
   memory op below                    ← cannot hoist above the fence (acquire side)
   memory op below                    ← cannot hoist above the fence
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Where this actually shows up: atomic operations
&lt;/h3&gt;

&lt;p&gt;The clearest case for &lt;code&gt;acq_rel&lt;/code&gt; semantics is &lt;strong&gt;atomic operations that read AND write at the same time&lt;/strong&gt; — like &lt;code&gt;atom.cas&lt;/code&gt; (compare-and-swap), &lt;code&gt;atom.exch&lt;/code&gt; (exchange), &lt;code&gt;atom.add&lt;/code&gt; (fetch-and-add). These instructions can be simultaneously consumer-like (they read an old value) and producer-like (they write a new value).&lt;/p&gt;

&lt;p&gt;Take a shared work queue as an example. Producers fill slots in a queue, then publish progress by advancing a shared index. Consumers atomically claim index values and then read the corresponding slots. In a queue like that, the index is not just a counter — it's also the handoff point between "someone published work" and "someone else is allowed to consume it."&lt;/p&gt;

&lt;p&gt;Now imagine one thread advances the index with an atomic fetch-and-add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Atomic fetch-and-add: returns the old value
old = atom.add.acq_rel [queue_index], 1;
slot = old;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why might this atomic want &lt;strong&gt;acq_rel&lt;/strong&gt; semantics? Because the same operation may be doing both jobs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Acquire half (the read).&lt;/strong&gt; The old counter value may point to a slot whose contents were written by another thread before that thread published the index. Before this thread reads &lt;code&gt;queue[slot]&lt;/code&gt;, it needs to acquire those slot writes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Release half (the write).&lt;/strong&gt; The new counter value may become the value that a later thread observes before reading work or metadata this thread prepared. If this thread did setup before advancing the index, the write-half of the atomic can publish that setup.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the kind of situation where &lt;code&gt;atom.add.acq_rel&lt;/code&gt; makes sense: the same read-modify-write is consuming someone else's publication and publishing this thread's own update. PTX bakes both effects into the atomic with the &lt;code&gt;.acq_rel&lt;/code&gt; qualifier, for example &lt;code&gt;atom.acq_rel.gpu.global.add.u32&lt;/code&gt; in full PTX syntax.&lt;/p&gt;

&lt;p&gt;For comparison, an ordinary lock usually does &lt;strong&gt;not&lt;/strong&gt; need &lt;code&gt;acq_rel&lt;/code&gt; on lock acquire. Taking a lock is normally just an acquire operation: it consumes the previous holder's release. Releasing the lock is normally just a release store:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;st.release [lock], 0;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the common lock pattern is acquire on lock acquisition and release on unlock. Reach for &lt;code&gt;acq_rel&lt;/code&gt; when a single atomic really does both jobs for your algorithm.&lt;/p&gt;

&lt;h3&gt;
  
  
  Standalone &lt;code&gt;fence.acq_rel&lt;/code&gt; is rarer
&lt;/h3&gt;

&lt;p&gt;A standalone &lt;code&gt;fence.acq_rel&lt;/code&gt; (not attached to an atomic) shows up less often, but it's there if you have a non-atomic point in your code that needs both effects. Most of the time it appears inside higher-level synchronization primitives (atomics, mbarriers — which we'll see in later articles) rather than as a programmer-written explicit fence.&lt;/p&gt;

&lt;p&gt;One bit of PTX trivia: &lt;code&gt;fence.acq_rel&lt;/code&gt; is the &lt;strong&gt;default&lt;/strong&gt; when you write a plain &lt;code&gt;fence&lt;/code&gt; without a &lt;code&gt;.sem&lt;/code&gt; qualifier. So &lt;code&gt;fence.gpu&lt;/code&gt; is shorthand for &lt;code&gt;fence.acq_rel.gpu&lt;/code&gt;. This is partly why &lt;code&gt;acq_rel&lt;/code&gt; shows up so often in PTX disassembly.&lt;/p&gt;

&lt;p&gt;For typical one-direction producer-consumer patterns, &lt;strong&gt;stick with the matched &lt;code&gt;release&lt;/code&gt; + &lt;code&gt;acquire&lt;/code&gt; pair&lt;/strong&gt; we built up in the earlier sections — it's simpler and avoids asking for stronger ordering than you need. Reach for &lt;code&gt;acq_rel&lt;/code&gt; when one fence point (or one atomic operation) really does need to do both jobs at once, like the work-queue atomic above.&lt;/p&gt;




&lt;h2&gt;
  
  
  Things to remember
&lt;/h2&gt;

&lt;p&gt;A handful of practical points worth keeping in your head as you write code.&lt;/p&gt;

&lt;h3&gt;
  
  
  The producer's flag write must be a "real" memory write
&lt;/h3&gt;

&lt;p&gt;In PTX, this means using something like &lt;code&gt;st.relaxed [flag], 1&lt;/code&gt; rather than the default &lt;code&gt;st [flag], 1&lt;/code&gt;. The default is what's called a "weak" write — the memory model gives it no cross-thread guarantees, and the release pattern won't form correctly with it. For any flag another thread will read, use a strong store such as &lt;code&gt;st.relaxed&lt;/code&gt; or &lt;code&gt;st.release&lt;/code&gt; (the baked-in form discussed above). (&lt;code&gt;.acquire&lt;/code&gt; and &lt;code&gt;.acq_rel&lt;/code&gt; do not apply to ordinary stores.)&lt;/p&gt;

&lt;p&gt;The same applies to the consumer's flag read: use &lt;code&gt;ld.relaxed [flag]&lt;/code&gt; or &lt;code&gt;ld.acquire [flag]&lt;/code&gt;, not plain &lt;code&gt;ld [flag]&lt;/code&gt;. (&lt;code&gt;.release&lt;/code&gt; only applies to stores, not loads.)&lt;/p&gt;

&lt;h3&gt;
  
  
  Spin loops are normal — fences don't push data
&lt;/h3&gt;

&lt;p&gt;A common confusion is "I issued the release, why doesn't the consumer see it immediately?" The release fence orders your writes; it doesn't shove them down other threads' throats. The consumer's &lt;code&gt;flag&lt;/code&gt; read may still return &lt;code&gt;0&lt;/code&gt; for a while after the producer wrote &lt;code&gt;1&lt;/code&gt;, because the new value has to propagate through the memory system.&lt;/p&gt;

&lt;p&gt;That's why the consumer is in a &lt;code&gt;while (flag == 0)&lt;/code&gt; loop — it's bridging the propagation gap. This is the standard, correct idiom.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory ordering is about visibility, not synchronization
&lt;/h3&gt;

&lt;p&gt;A second and related confusion: people see "fence" and assume it makes threads pause or wait for each other. It doesn't. Fences and release/acquire qualifiers are entirely about &lt;strong&gt;what one thread sees in memory when it reads&lt;/strong&gt; — not about pausing or aligning threads in time.&lt;/p&gt;

&lt;p&gt;Two different concerns, two different toolboxes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Visibility&lt;/strong&gt; (this article): one thread's writes become observable to another thread's reads, in a well-defined order. Fences and release/acquire qualifiers handle this. A fence may wait until the calling thread's relevant memory operations have reached the point of coherence for its scope, but it &lt;strong&gt;doesn't wait for another thread&lt;/strong&gt; to arrive, acknowledge, or read anything.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synchronization&lt;/strong&gt; (a separate concern): threads actually pause and align in time — e.g., "no thread proceeds past this point until all threads have arrived." That's the job of barriers like &lt;code&gt;bar.sync&lt;/code&gt;, &lt;code&gt;barrier.cluster&lt;/code&gt;, and the &lt;code&gt;mbarrier&lt;/code&gt; object. We'll cover those in the next article.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The spin loop in our consumer code is what bridges the two when you want them together: the consumer wants to wait for the producer, so it busy-loops on the flag. That waiting is the programmer's choice (the &lt;code&gt;while&lt;/code&gt; loop), not something the fence is doing for them.&lt;/p&gt;

&lt;p&gt;Keeping these two ideas separate is one of the most useful mental moves you can make in concurrent GPU programming. The memory model is purely about visibility; for actual "wait until X" behavior, reach for the synchronization primitives in the next article.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fences have a "scope"
&lt;/h3&gt;

&lt;p&gt;In PTX you'll write things like &lt;code&gt;fence.release.gpu&lt;/code&gt; or &lt;code&gt;fence.release.cta&lt;/code&gt;. The scope says which threads can directly participate in the ordering guarantee.&lt;/p&gt;

&lt;p&gt;To understand why scope matters, picture the GPU's memory as a layered hierarchy, with caches and storage at different distances from each thread:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   Per-SM:    L1 / shared-memory crossbar       (closest, fastest)
   Cluster:   DSMEM crossbar (Hopper+)
   Chip:      L2 cache                          (chip-wide)
   System:    HBM and host-memory fabric        (farthest, slowest)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For two threads to communicate via a release/acquire pair, the operations must use a scope that includes both threads. A useful hardware mental model is that wider scopes generally have to order through a farther-away coherence point, so they cost more. The exact cache behavior is an implementation detail; the architectural rule is the set of threads covered by the scope.&lt;/p&gt;

&lt;p&gt;The scopes are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;.cta&lt;/code&gt;&lt;/strong&gt; — between threads in the same thread block. This is the smallest PTX memory-model scope and is usually the cheapest.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;.cluster&lt;/code&gt;&lt;/strong&gt; — between threads in the same thread block cluster (Hopper's new feature). This is useful for cluster-level shared memory and cluster barriers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;.gpu&lt;/code&gt;&lt;/strong&gt; — between any threads in the current program on the same GPU. In the usual global-memory mental model, this means ordering at a chip-wide level such as L2, so it is more expensive than &lt;code&gt;.cta&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;.sys&lt;/code&gt;&lt;/strong&gt; — across the whole program, including host threads and kernels on other GPUs. This may involve the system fabric and is the most expensive scope.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Rough hardware mental model:

   .cta     ──►  usually local CTA-visible path
   .cluster ──►  cluster-level shared-memory network
   .gpu     ──►  chip-wide ordering point, often L2
   .sys     ──►  system-visible fabric

           ──── generally increasing cost ────►
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pick the smallest scope that covers your producer-consumer pair.&lt;/strong&gt; A &lt;code&gt;fence.release.cta&lt;/code&gt; is much cheaper than a &lt;code&gt;fence.release.gpu&lt;/code&gt;, which is much cheaper than a &lt;code&gt;fence.release.sys&lt;/code&gt;. There's no reason to pay for a wider scope than your actual readers need. If both threads are guaranteed to be in the same CTA, use &lt;code&gt;.cta&lt;/code&gt;. If they could be on any SM, you need &lt;code&gt;.gpu&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We'll see scopes again in later articles when we discuss thread block clusters and the TMA engine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Single-threaded code never needs fences
&lt;/h3&gt;

&lt;p&gt;If only one thread is touching some piece of memory, you don't need any of this. The reordering issues only matter when there's a second thread observing. Inside one thread, you get this guarantee for free:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Whenever you read a memory location, you see the most recent value &lt;em&gt;your own thread&lt;/em&gt; wrote to &lt;strong&gt;that same location&lt;/strong&gt; — in source-code order, no fences needed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The important nuance is the phrase &lt;strong&gt;"that same location."&lt;/strong&gt; The compiler and hardware are still free to reorder operations to &lt;em&gt;different&lt;/em&gt; memory addresses, even within a single thread. That reorder is invisible to you, because nothing inside your thread depends on the order. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data[0] = 42;        // (1)
data[1] = 99;        // (2)
x = data[0];         // (3) guaranteed to see 42 — same address as (1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, &lt;code&gt;(1)&lt;/code&gt; and &lt;code&gt;(2)&lt;/code&gt; write to &lt;em&gt;different&lt;/em&gt; addresses, so the compiler may commit them to memory in either order. But &lt;code&gt;(3)&lt;/code&gt; reads from the same address as &lt;code&gt;(1)&lt;/code&gt;, so it's guaranteed to see &lt;code&gt;42&lt;/code&gt; — the per-address ordering rule pins this down. From your thread's perspective, the reorder of &lt;code&gt;(1)&lt;/code&gt; and &lt;code&gt;(2)&lt;/code&gt; is invisible because nothing in your code reads &lt;code&gt;data[1]&lt;/code&gt; afterward to detect it.&lt;/p&gt;

&lt;p&gt;This is exactly the source of the bug at the start of the article: the producer's &lt;code&gt;data&lt;/code&gt; write and &lt;code&gt;flag&lt;/code&gt; write are to &lt;em&gt;different&lt;/em&gt; addresses, so they can be reordered freely from the producer's own perspective. It only becomes a problem because &lt;em&gt;another&lt;/em&gt; thread (the consumer) reads both addresses and notices the reorder. The fences fix it for that exact case.&lt;/p&gt;

&lt;p&gt;So the rule is: fences are needed only at the precise places where two threads communicate, and even then only because they need to coordinate their views of &lt;em&gt;different&lt;/em&gt; memory addresses. Within one thread, same-address reads work without fences.&lt;/p&gt;




&lt;h2&gt;
  
  
  Mental model summary
&lt;/h2&gt;

&lt;p&gt;If you remember nothing else from this article, remember this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Memory ordering on a GPU is a contract between a producer and a consumer.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The producer uses a &lt;strong&gt;release fence&lt;/strong&gt; to publish all its prior writes before announcing "done."&lt;/p&gt;

&lt;p&gt;The consumer uses an &lt;strong&gt;acquire fence&lt;/strong&gt; to ensure all its subsequent reads see fresh data after observing the "done" signal.&lt;/p&gt;

&lt;p&gt;Both halves of the contract are needed. Each fence is a one-way wall — release pins prior writes above it; acquire pins subsequent reads below it. Together they make a producer-consumer pattern correct on a system where independent operations can otherwise be reordered freely.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What's next in this series
&lt;/h2&gt;

&lt;p&gt;The memory consistency model is the foundation. With it in hand, the rest of the Hopper stack starts to make sense. Coming up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Execution synchronization&lt;/strong&gt; — &lt;code&gt;bar.sync&lt;/code&gt;, &lt;code&gt;barrier.cluster&lt;/code&gt;, and how to align threads in time (not just memory).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;mbarrier&lt;/code&gt;&lt;/strong&gt; — a programmable synchronization object that combines memory ordering, thread arrival counting, and async-engine completion tracking. The Hopper workhorse.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Asynchronous copies&lt;/strong&gt; — the TMA engine, &lt;code&gt;cp.async.bulk&lt;/code&gt;, and the powerful but subtle &lt;code&gt;CUtensorMap&lt;/code&gt; descriptor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;wgmma&lt;/code&gt;&lt;/strong&gt; — Hopper's warp-group matrix multiply-accumulate, the engine that drives modern GEMM and attention kernels.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each will build on the same release-acquire ideas you just learned. The TMA engine, mbarriers, and warp-group MMA all rely on exactly the same kind of "one side publishes, the other side acquires" contract — just with more sophisticated machinery to support things like async byte-counting and distributed shared memory.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>computerscience</category>
      <category>performance</category>
      <category>programming</category>
    </item>
    <item>
      <title>CUDA Graphs in LLM Inference: Deep Dive</title>
      <dc:creator>Shah Fahad</dc:creator>
      <pubDate>Sat, 21 Feb 2026 07:09:21 +0000</pubDate>
      <link>https://dev.to/sfahad/cuda-graphs-in-llm-inference-deep-dive-36pb</link>
      <guid>https://dev.to/sfahad/cuda-graphs-in-llm-inference-deep-dive-36pb</guid>
      <description>&lt;h2&gt;
  
  
  Why CUDA Graphs Matter for LLM Inference
&lt;/h2&gt;

&lt;p&gt;LLM inference -- especially the token generation (decode) phase -- is &lt;strong&gt;often dominated by CPU overhead rather than GPU compute&lt;/strong&gt;. Each decode step generates a single token per sequence: the actual GPU work (small matmuls, attention over one query) can finish in microseconds, but the CPU can spend tens of microseconds &lt;em&gt;per kernel launch&lt;/em&gt; on launch bookkeeping, driver calls, and synchronization. With hundreds of kernel launches per transformer forward pass, this CPU overhead can become the bottleneck (though at higher batch sizes or with heavier kernels, decode can still become GPU-bound).&lt;/p&gt;

&lt;p&gt;Making matters worse, the CPU isn't just launching kernels -- it's also preparing data for the next batch: updating token IDs, managing the KV cache block table, running the scheduler, and handling request arrivals/completions. All of this competes for CPU time with kernel launches, amplifying the bottleneck. The GPU ends up sitting idle between launches, throughput drops, latency rises, and expensive GPU cycles are wasted on nothing.&lt;/p&gt;

&lt;p&gt;CUDA graphs solve this by &lt;strong&gt;recording the entire kernel sequence once&lt;/strong&gt; and &lt;strong&gt;replaying it with a single CPU call&lt;/strong&gt;. The driver overhead is paid once at capture time; every subsequent replay amortizes hundreds of per-kernel launches into a single replay launch, largely avoiding the repeated per-kernel launch bookkeeping. For decode-heavy workloads, this can eliminate the majority of per-step overhead.&lt;/p&gt;

&lt;p&gt;This post walks through how CUDA graphs work in the context of LLM serving -- why decode is a natural fit, why context/mixed batches are harder, and how TensorRT-LLM (TRT-LLM) implements both monolithic and piecewise CUDA graph strategies.&lt;/p&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;1. CUDA Graphs Fundamentals&lt;/li&gt;
&lt;li&gt;2. Generation (Decode) CUDA Graphs&lt;/li&gt;
&lt;li&gt;3. KV Cache with Static Addresses&lt;/li&gt;
&lt;li&gt;4. Why Context &amp;amp; Mixed Batches Are Hard&lt;/li&gt;
&lt;li&gt;5. Piecewise CUDA Graphs (torch.compile)&lt;/li&gt;
&lt;li&gt;6. Configuration Guide&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  1. CUDA Graphs Fundamentals
&lt;/h2&gt;

&lt;p&gt;A CUDA graph captures a sequence of GPU operations (kernel launches, memory copies) into a single replayable unit.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Gets Captured (Fixed)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------------------------------------------------------------------+
| CUDA Graph Recording                                               |
|                                                                    |
| +----------+      +----------+      +----------+      +----------+ |
| | Kernel A |      | Kernel B |      | Kernel C |      | Kernel D | |
| |grid(4,1) |-----&amp;gt;|grid(8,1) |-----&amp;gt;|grid(4,1) |-----&amp;gt;|grid(2,1) | |
| |@0x100 -&amp;gt; |      |@0x200 -&amp;gt; |      |@0x300 -&amp;gt; |      |@0x400 -&amp;gt; | |
| |  0x200   |      |  0x300   |      |  0x400   |      |  0x500   | |
| +----------+      +----------+      +----------+      +----------+ |
+--------------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Baked into the graph:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which kernels to launch, in what order&lt;/li&gt;
&lt;li&gt;Memory addresses (pointers) each kernel reads/writes&lt;/li&gt;
&lt;li&gt;Kernel launch parameters (grid dims, block dims, shared memory)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;NOT baked (can change between replays):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The actual data at those addresses&lt;/li&gt;
&lt;li&gt;Data-dependent control flow inside kernels (loops, branches)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Replay Contract
&lt;/h3&gt;

&lt;p&gt;On replay, the entire sequence launches with minimal CPU overhead. The user's responsibility is to place correct data at the captured addresses before each replay.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why It's Fast
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+----------------------------+
| Without CUDA Graph (eager) |
|                            |
| CPU -- launch --&amp;gt; Kernel A |
| CPU &amp;lt;-- wait ----+         |
| CPU -- launch --&amp;gt; Kernel B |
| CPU &amp;lt;-- wait ----+         |
| CPU -- launch --&amp;gt; Kernel C |
| CPU &amp;lt;-- wait ----+         |
| CPU -- launch --&amp;gt; Kernel D |
|                            |
| = 4x CPU round-trips       |
+----------------------------+

+------------------------------------------+
| With CUDA Graph                          |
|                                          |
| CPU -- replay --&amp;gt; [ Kernel A, B, C, D ]  |
|                                          |
| = 1 launch, entire chain executes on GPU |
+------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  2. Generation (Decode) CUDA Graphs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why Decode Is Well-Suited
&lt;/h3&gt;

&lt;p&gt;In decode, each sequence contributes exactly &lt;strong&gt;1 new token&lt;/strong&gt; per step. Total tokens = batch size. This makes the input shape predictable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+---------------------------------------------------------------+
| Decode step N                                                 |
|                                                               |
| seq0: 1 token  \                                              |
| seq1: 1 token   \                                             |
|                   &amp;gt;-- batch_size = 4, shape = [4, hidden_dim] |
| seq2: 1 token   /                                             |
| seq3: 1 token  /                                              |
+---------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pre-allocated Static Buffers
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-----------------------------------------------------------------+
| Input token IDs buffer (pre-allocated, max_batch_size = 4096)   |
|                                                                 |
| [ token_0 ][ token_1 ][ token_2 ][ token_3 ] ... [ token_4095 ] |
|   @addr_0    @addr_1    @addr_2    @addr_3          @addr_4095  |
|                                                                 |
|   fixed addresses -- same every replay                          |
+-----------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Multiple Graphs for Different Batch Sizes
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Captured graphs (one per supported batch size, typically powers of two):

  batch_size   grid size     reads
  ----------   ---------     -----
       1  --&amp;gt;  (1, ...)  --&amp;gt; addr_0
       2  --&amp;gt;  (2, ...)  --&amp;gt; addr_0..1
       4  --&amp;gt;  (4, ...)  --&amp;gt; addr_0..3
       8  --&amp;gt;  (8, ...)  --&amp;gt; addr_0..7
       :
    4096  --&amp;gt;  (4096,..) --&amp;gt; addr_0..4095
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At runtime with 5 active sequences → use batch_size=8 graph, pad 3 dummy sequences.&lt;/p&gt;

&lt;h3&gt;
  
  
  Intermediate Activations Have Stable Addresses
&lt;/h3&gt;

&lt;p&gt;During capture, intermediate tensors are allocated from a graph-private memory pool, giving them stable device addresses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+----------------------------------------------------------+
| Transformer layer (captured; all addresses fixed)        |
|                                                          |
| [QKV Projection] ----&amp;gt; [Attention] ----&amp;gt; [Output Proj]   |
|  in @A, out @B          in @B, out @C    in @C, out @D   |
|                                               |          |
|                                               v          |
| [FFN Layer 1] --------&amp;gt; [FFN Layer 2] ----&amp;gt; (next layer) |
|  in @D, out @E           in @E, out @F                   |
+----------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On replay, the same chain executes at the same addresses. Intermediate buffers are never freed between replays -- they persist in the graph's memory pool. This is why &lt;strong&gt;each captured batch size has its own set of stable-address buffers&lt;/strong&gt;, and capturing many batch sizes consumes significant GPU memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the Runtime Updates Before Each Replay
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-----------------------------------------------------+
| 1. input_token_ids[0:B]  &amp;lt;-- new token IDs          |
| 2. position_ids[0:B]     &amp;lt;-- new positions          |
| 3. sequence_lengths[0:B] += 1                       |
| 4. block_table           &amp;lt;-- update if new KV block |
+-----------------------------------------------------+
| 5. &amp;gt;&amp;gt;&amp;gt; REPLAY GRAPH &amp;lt;&amp;lt;&amp;lt;                             |
+-----------------------------------------------------+
| 6. new_logits &amp;lt;-- output_buffer[0:B]                |
+-----------------------------------------------------+
| B = batch_size                                      |
+-----------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  3. KV Cache with Static Addresses
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Apparent Contradiction
&lt;/h3&gt;

&lt;p&gt;KV cache grows every step (new K,V written for each token), yet CUDA graphs require fixed addresses. The solution: &lt;strong&gt;paged/block-based KV cache&lt;/strong&gt; with an &lt;strong&gt;indirection table&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Block-Based KV Cache Pool
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-------------------------------------------------------------+
| KV cache pool (pre-allocated; addresses never change)       |
|                                                             |
| [ Block 0 ][ Block 1 ][ Block 2 ][ Block 3 ][ Block 4 ] ... |
|   @blk_0     @blk_1     @blk_2     @blk_3     @blk_4        |
|  32 slots   32 slots   32 slots   32 slots   32 slots       |
|                                                             |
| each block holds K,V for a fixed number of tokens (e.g. 32) |
+-------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Block Table (Indirection)
&lt;/h3&gt;

&lt;p&gt;Each sequence has a block table mapping logical positions to physical blocks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Logical positions&lt;/th&gt;
&lt;th&gt;Physical block&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;tokens 0–31&lt;/td&gt;
&lt;td&gt;Block 7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tokens 32–63&lt;/td&gt;
&lt;td&gt;Block 12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tokens 64–95&lt;/td&gt;
&lt;td&gt;Block 3 (partially filled, e.g. up to 82)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Sequence 0's block table at fixed address @tbl_0&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How Attention Kernel Uses Indirection
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Inside the attention kernel (pseudo-code):
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;each&lt;/span&gt; &lt;span class="n"&gt;past&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="n"&gt;position&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sequence_length&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;seq_id&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="n"&gt;block_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;block_table&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;seq_id&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;block_size&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;    &lt;span class="c1"&gt;# read from @tbl_0
&lt;/span&gt;    &lt;span class="n"&gt;offset&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;block_size&lt;/span&gt;
    &lt;span class="n"&gt;K_i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kv_cache_pool&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;block_idx&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;              &lt;span class="c1"&gt;# indirect lookup into pool
&lt;/span&gt;    &lt;span class="n"&gt;V_i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kv_cache_pool&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;block_idx&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K_i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step-by-Step: How KV Cache Grows Within CUDA Graph
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Buffer&lt;/th&gt;
&lt;th&gt;Step N&lt;/th&gt;
&lt;th&gt;Step N+1&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;block_table&lt;/code&gt; @tbl_0&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[7, 12, 3]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[7, 12, 3]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Same address, same indices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;seq_length&lt;/code&gt; @len_0&lt;/td&gt;
&lt;td&gt;&lt;code&gt;82&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;83&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Same address, incremented&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kv_pool Block 3, slot 18&lt;/td&gt;
&lt;td&gt;K,V for token 82&lt;/td&gt;
&lt;td&gt;K,V for token 82&lt;/td&gt;
&lt;td&gt;Unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kv_pool Block 3, slot 19&lt;/td&gt;
&lt;td&gt;&lt;em&gt;(empty)&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;K,V for token 83&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;NEW&lt;/strong&gt; — written by kernel&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The kernel wrote to a different slot because &lt;code&gt;sequence_length&lt;/code&gt; told it to. All addresses remain fixed -- only the data changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Doesn't Violate CUDA Graph Rules
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What's fixed (baked in graph)&lt;/th&gt;
&lt;th&gt;What changes (data at fixed addrs)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;kv_cache_pool&lt;/code&gt; base address&lt;/td&gt;
&lt;td&gt;Which blocks are assigned (block_table data)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;block_table&lt;/code&gt; buffer address&lt;/td&gt;
&lt;td&gt;The integer block indices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;sequence_length&lt;/code&gt; buffer address&lt;/td&gt;
&lt;td&gt;The actual length values&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kernel grid dimensions&lt;/td&gt;
&lt;td&gt;Data-dependent loops inside kernel iterate more/fewer times&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  4. Why Context &amp;amp; Mixed Batches Are Hard
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Core Problem: Variable Total Token Count
&lt;/h3&gt;

&lt;p&gt;In decode, total tokens = batch size (each sequence = 1 token). In context/mixed, total tokens varies wildly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Batch type&lt;/th&gt;
&lt;th&gt;Sequences&lt;/th&gt;
&lt;th&gt;Total tokens&lt;/th&gt;
&lt;th&gt;Predictable?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Decode&lt;/td&gt;
&lt;td&gt;&lt;code&gt;seq₀(1) + seq₁(1) + seq₂(1)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Yes — always = batch_size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context&lt;/td&gt;
&lt;td&gt;&lt;code&gt;seq₀(137) + seq₁(2048)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2185&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mixed&lt;/td&gt;
&lt;td&gt;&lt;code&gt;seq₀(512 prefill) + seq₁(1 decode)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;513&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Problem 1: Kernel Grid Dimensions Depend on Total Tokens
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Kernel launch -- grid dims are a function of input shape&lt;/span&gt;
&lt;span class="n"&gt;dim3&lt;/span&gt; &lt;span class="nf"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;total_tokens&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;TILE_M&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;TILE_M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hidden_dim&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;TILE_N&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;TILE_N&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;matmul_kernel&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hidden_dim&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;total_tokens&lt;/th&gt;
&lt;th&gt;grid size&lt;/th&gt;
&lt;th&gt;Implication&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(4, …)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4 blocks — one graph&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3072&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(24, …)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;24 blocks — &lt;strong&gt;different&lt;/strong&gt; graph required&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The grid is baked at capture time. Different total tokens = different grid = different graph.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 2: Attention Grid Depends on Max Context Seq Length and Num Context Requests
&lt;/h3&gt;

&lt;p&gt;For MLP, every token is independent: &lt;code&gt;output[i] = MLP(input[i])&lt;/code&gt;. Fix total_tokens and you're done.&lt;/p&gt;

&lt;p&gt;For attention, the kernel grid depends on &lt;strong&gt;two per-iteration variables&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------------------------------------------------------------+
| TRT-LLM attention grid (simplified call chain)               |
|                                                              |
| Python (trtllm.py)                                           |
|   max_ctx_seq_len = seq_lens[:num_contexts].max()            |
|                             |                                |
|                             v                                |
| C++ (fmhaRunner / fused_multihead_attention_v2)              |
|   |                   |                   |                  |
|   v                   v                   v                  |
|   grid.x              grid.y              grid.z             |
|   ceil(s/unroll)      num_heads           num_ctx_requests   |
|   [VARIES]            [FIXED]             [VARIES]           |
|                                                              |
|   --&amp;gt; grid = ( ceil(s/unroll), num_heads, num_ctx_requests ) |
+--------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Grid = &lt;code&gt;(ceil(max_ctx_seq_len / unroll_step), num_heads, num_context_requests)&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;TRT-LLM uses a &lt;strong&gt;padded tiling strategy&lt;/strong&gt;: the grid is sized for the longest context request, and shorter requests have their extra tiles skip computation (the kernel checks &lt;code&gt;cu_seqlens&lt;/code&gt; internally):&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Padded tiling: 3 context requests, &lt;code&gt;seq_lens = [64, 128, 256]&lt;/code&gt;, &lt;code&gt;unroll_step = 64&lt;/code&gt;.&lt;br&gt;
Grid = &lt;code&gt;(4, num_heads, 3)&lt;/code&gt; — sized for longest request (256).&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Tile 0&lt;/th&gt;
&lt;th&gt;Tile 1&lt;/th&gt;
&lt;th&gt;Tile 2&lt;/th&gt;
&lt;th&gt;Tile 3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Req 0&lt;/strong&gt; (64 tokens)&lt;/td&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;td&gt;skip&lt;/td&gt;
&lt;td&gt;skip&lt;/td&gt;
&lt;td&gt;skip&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Req 1&lt;/strong&gt; (128 tokens)&lt;/td&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;td&gt;skip&lt;/td&gt;
&lt;td&gt;skip&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Req 2&lt;/strong&gt; (256 tokens)&lt;/td&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Even with this padded approach, the grid changes per iteration because &lt;strong&gt;both &lt;code&gt;max_ctx_seq_len&lt;/code&gt; and &lt;code&gt;num_context_requests&lt;/code&gt;&lt;/strong&gt; change depending on which requests the scheduler assigns to the context phase:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Iteration&lt;/th&gt;
&lt;th&gt;Context requests&lt;/th&gt;
&lt;th&gt;max_len&lt;/th&gt;
&lt;th&gt;grid&lt;/th&gt;
&lt;th&gt;What changed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(2, heads, 32)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(2, heads, 1)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;grid.z&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(4, heads, 2)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;grid.x and z&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Different iterations produce different grids/launch parameters — the combination space explodes across multiple variables (e.g., &lt;code&gt;max_ctx_seq_len&lt;/code&gt;, &lt;code&gt;num_context_requests&lt;/code&gt;, and sequence-length distributions), making “one reusable CUDA graph” impractical.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A CUDA graph captured with one grid would produce &lt;strong&gt;incorrect results&lt;/strong&gt; if replayed with a different grid/launch configuration (missing tiles = unprocessed tokens; extra tiles = out-of-bounds/garbage work). To make this safe, you’d need to capture graphs for many combinations or pad/standardize to a fixed worst-case launch shape.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why Decode Attention Doesn't Have This Problem
&lt;/h3&gt;

&lt;p&gt;In decode, every sequence has exactly 1 query token. The decode attention uses a different kernel path where:&lt;/p&gt;

&lt;p&gt;Decode attention: &lt;code&gt;grid = (batch_size, num_heads)&lt;/code&gt; — both fixed per captured graph.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;batch_size&lt;/code&gt; is fixed per captured graph (one graph per supported batch size)&lt;/li&gt;
&lt;li&gt;Variable KV cache lengths are handled by data-dependent loops &lt;strong&gt;inside&lt;/strong&gt; the kernel (loop over &lt;code&gt;sequence_length[i]&lt;/code&gt;) -- the grid doesn't change&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Where Each Layer Type Falls
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Shape&lt;/th&gt;
&lt;th&gt;Capturable?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Layer norm&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;[total_tokens, hidden]&lt;/code&gt; — flat&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q, K, V projections&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;[total_tokens, hidden]&lt;/code&gt; — flat matmuls&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Fused attention&lt;/strong&gt; (Q@K^T, softmax, scores@V)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;per-sequence, variable tiles&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;No&lt;/strong&gt; — grid varies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output projection&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;[total_tokens, hidden]&lt;/code&gt; — flat matmul&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MLP&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;[total_tokens, hidden]&lt;/code&gt; — flat matmuls&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  5. Piecewise CUDA Graphs (torch.compile)
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Two Separate CUDA Graph Systems
&lt;/h3&gt;

&lt;p&gt;TRT-LLM uses &lt;strong&gt;two independent&lt;/strong&gt; CUDA graph systems -- understanding this distinction is critical:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                  Python model forward()
                          |
            +-------------+-------------+
            |                           |
            v                           v
+-------------------------+ +-------------------------+
| torch.compile           | | Native CUDA Graph       |
| (Dynamo tracing)        | | (stream capture)        |
+-------------------------+ +-------------------------+
| Traces Python -&amp;gt; FX     | | Records GPU kernels     |
| Decomposes to ATen ops  | | on the CUDA stream      |
| Custom ops -&amp;gt; split pt  | | Captures everything     |
+-------------------------+ +-------------------------+
| Result: Pieces          | | Result: One monolithic  |
| [graph][eager][graph]...| | graph of full fwd pass  |
+-------------------------+ +-------------------------+
            |                           |
            v                           v
  Used for: mixed/context    Used for: decode-only
  (attn grid varies)         (attn grid fixed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Generation-only (decode)&lt;/strong&gt;: Uses &lt;strong&gt;native &lt;code&gt;torch.cuda.CUDAGraph&lt;/code&gt;&lt;/strong&gt; capture. This records every kernel launch on the CUDA stream at the driver level -- including FlashAttention. It doesn't need to "understand" the kernels; it just records them. This works because decode attention's grid depends only on &lt;code&gt;batch_size&lt;/code&gt; (fixed per capture).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Piecewise (mixed/context)&lt;/strong&gt;: Uses &lt;strong&gt;torch.compile&lt;/strong&gt; to trace the model into an FX graph, then TRT-LLM's custom backend splits at attention boundaries and captures each non-attention piece as a CUDA graph. Attention runs eagerly.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Piecewise Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------------------------------------------------------+
| CUDA GRAPH -- piece 1                     [captured]   |
|   layer_norm -&amp;gt; qkv_projection                         |
|   pre-allocates output buffer @ addr_X                 |
+--------------------------------------------------------+
|                         |                              |
|                         v                              |
+--------------------------------------------------------+
| EAGER -- not graphed                 [runs every time] |
|   flash_attention(q, k, v, cu_seqlens, ...)            |
|   writes result IN-PLACE to addr_X                     |
+--------------------------------------------------------+
|                         |                              |
|                         v                              |
+--------------------------------------------------------+
| CUDA GRAPH -- piece 2                     [captured]   |
|   reads from addr_X                                    |
|   output_proj -&amp;gt; layer_norm -&amp;gt; mlp_up -&amp;gt;               |
|   activation -&amp;gt; mlp_down -&amp;gt; residual_add               |
+--------------------------------------------------------+
|                         |                              |
|                         v                              |
|                 ... next layer ...                     |
+--------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The in-place attention design is critical: attention writes into a buffer pre-allocated by piece 1, ensuring piece 2's captured graph reads from the correct fixed address.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Attention Is Excluded
&lt;/h3&gt;

&lt;p&gt;Attention is excluded from CUDA graph capture for a &lt;strong&gt;correctness&lt;/strong&gt; reason, not a tracing limitation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The tracing works fine.&lt;/strong&gt; TRT-LLM registers a FakeTensor implementation for the attention custom op, so &lt;code&gt;torch.compile&lt;/code&gt; in fullgraph mode traces the entire forward pass into one FX graph without graph breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The exclusion is a deliberate choice.&lt;/strong&gt; TRT-LLM's &lt;code&gt;piecewise_optimizer.py&lt;/code&gt; explicitly identifies attention ops and excludes them from CUDA graph pieces:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tensorrt_llm/_torch/compilation/piecewise_optimizer.py
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_call_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ops&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trtllm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attn_custom_op_inplace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ops&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trtllm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mla_custom_op_inplace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="n"&gt;exclude_modules_id&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ← excluded from CUDA graph capture
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The reason: replay correctness.&lt;/strong&gt; If attention were captured in a CUDA graph, the kernel's grid dimensions would be baked in. But attention's grid depends on the per-sequence query distribution, not just total tokens:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Kernel source&lt;/th&gt;
&lt;th&gt;grid.x&lt;/th&gt;
&lt;th&gt;grid.y&lt;/th&gt;
&lt;th&gt;grid.z&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fused_multihead_attention_v2.cpp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ceil(params.s / mUnrollStep)&lt;/code&gt; — &lt;strong&gt;varies&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;params.h&lt;/code&gt; (heads) — fixed&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;params.b&lt;/code&gt; (batch) — &lt;strong&gt;varies&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;triton_attention.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;num_prefill&lt;/code&gt; — &lt;strong&gt;varies&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;n_heads&lt;/code&gt; — fixed&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ceil(max(seq_len) / SEQ_BLOCK)&lt;/code&gt; — &lt;strong&gt;varies&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;unfusedAttentionKernels.cu&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ceil(q_length / 32.0f)&lt;/code&gt; — &lt;strong&gt;varies&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For the same &lt;code&gt;total_tokens=4096&lt;/code&gt;, different sequence distributions can produce different grids/launch metadata. A captured graph replays the capture-time launch configuration; unless you pad/standardize to that same configuration, replaying on a different distribution would be incorrect. MLP doesn't have this problem because its grid depends primarily on &lt;code&gt;total_tokens&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What &lt;code&gt;capture_num_tokens&lt;/code&gt; Controls
&lt;/h3&gt;

&lt;p&gt;Pre-captures piecewise graphs for specific total token counts. At runtime, pads &lt;strong&gt;up&lt;/strong&gt; to the next captured value.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;capture_num_tokens: [1, 2, 4, 8, ..., 8192]

Runtime: 4160 total tokens → pad up to the next captured value (e.g., 5120)
  - Waste: (5120 - 4160) / 5120 = 18.7% extra compute
  - Benefit: CUDA graph replay for MLP pieces (zero launch overhead)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Graph Type Summary
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Graph Type&lt;/th&gt;
&lt;th&gt;Capture Mechanism&lt;/th&gt;
&lt;th&gt;What It Captures&lt;/th&gt;
&lt;th&gt;When Used&lt;/th&gt;
&lt;th&gt;Key Parameter&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Generation-only&lt;/td&gt;
&lt;td&gt;Native &lt;code&gt;torch.cuda.CUDAGraph&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Full forward pass (including attention)&lt;/td&gt;
&lt;td&gt;Pure decode iterations&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cuda_graph_config.batch_sizes&lt;/code&gt; or &lt;code&gt;max_batch_size&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Piecewise&lt;/td&gt;
&lt;td&gt;torch.compile + native capture per piece&lt;/td&gt;
&lt;td&gt;All non-attention ops (attention runs eager)&lt;/td&gt;
&lt;td&gt;Mixed/context iterations&lt;/td&gt;
&lt;td&gt;&lt;code&gt;torch_compile_config.capture_num_tokens&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Memory vs. Coverage Trade-off
&lt;/h3&gt;

&lt;p&gt;Each piecewise capture at token count N pre-allocates intermediate buffers of size &lt;code&gt;[N, hidden_dim]&lt;/code&gt; per piece per layer. Capturing at large N (e.g., 8192) can consume enough GPU memory to shrink KV cache capacity below usable levels. In some setups, pushing &lt;code&gt;capture_num_tokens&lt;/code&gt; too high (e.g., up to 8192) with aggressive &lt;code&gt;kv_cache_free_gpu_mem_fraction&lt;/code&gt; can shrink the KV cache max length enough to cause warmup failures.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Configuration Guide
&lt;/h2&gt;

&lt;h3&gt;
  
  
  TensorRT-LLM &lt;code&gt;llm_api_options_yaml&lt;/code&gt; Settings
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Generation-only CUDA graphs (decode phase)&lt;/span&gt;
&lt;span class="na"&gt;cuda_graph_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enable_padding&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;max_batch_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4096&lt;/span&gt;    &lt;span class="c1"&gt;# or explicit batch_sizes list&lt;/span&gt;

&lt;span class="c1"&gt;# Piecewise CUDA graphs (context/mixed phases, requires torch.compile)&lt;/span&gt;
&lt;span class="na"&gt;torch_compile_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enable_piecewise_cuda_graph&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;capture_num_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;1&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;2&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;4&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;...&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# Must cover runtime max_num_tokens!&lt;/span&gt;
  &lt;span class="na"&gt;enable_userbuffers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;             &lt;span class="c1"&gt;# Default is true; disable if needed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Principles for &lt;code&gt;capture_num_tokens&lt;/code&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Must cover &lt;code&gt;max_num_tokens&lt;/code&gt;&lt;/strong&gt;: If the runtime scheduler can produce up to N total tokens, the largest capture point must be &amp;gt;= N. Otherwise, iterations exceeding the max fall back to eager.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dense where iterations cluster&lt;/strong&gt;: Use iteration logs to find the hot zone. Pack capture points there to minimize padding waste.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sparse where few iterations land&lt;/strong&gt;: Ramp-up and transition regions need minimal captures (powers of 2 suffice).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fewer captures = less memory&lt;/strong&gt;: Each capture pre-allocates intermediate buffers sized &lt;code&gt;[capture_tokens, hidden_dim]&lt;/code&gt; per piece. On memory-constrained systems, fewer large captures may be preferable.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  TorchCompileConfig Defaults (TensorRT-LLM)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;torch_compile_config&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;None&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Torch compile completely off unless explicitly set&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;enable_piecewise_cuda_graph&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;False&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Must opt-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;capture_num_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;None&lt;/code&gt; (auto: max 3072)&lt;/td&gt;
&lt;td&gt;Auto-generated: &lt;code&gt;[1,2,4,...,128,256,512,...,3072]&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;enable_userbuffers&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;True&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Enabled by default when torch compile is on&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;enable_fullgraph&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;True&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Full graph compilation in torch.compile&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;enable_inductor&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;False&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inductor backend disabled by default&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Checking Coverage at Runtime
&lt;/h3&gt;

&lt;p&gt;Parse the iteration log and compute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;total_tokens_per_iter = numCtxTokens + numGenRequests

For each iteration:
  - If numCtxTokens == 0: uses generation-only CUDA graph (match on numGenRequests)
  - If numCtxTokens &amp;gt; 0:  uses piecewise CUDA graph (match on total_tokens)

Hit rate = iterations with total_tokens &amp;lt;= max(capture_num_tokens) / total iterations
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Target: &lt;strong&gt;&amp;gt;95% hit rate&lt;/strong&gt; on piecewise graphs for meaningful benefit.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>llm</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
