<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shah Fahad</title>
    <description>The latest articles on DEV Community by Shah Fahad (@sfahad).</description>
    <link>https://dev.to/sfahad</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3783516%2F5f595b19-bafe-4d58-85e5-0d83a323d253.jpg</url>
      <title>DEV Community: Shah Fahad</title>
      <link>https://dev.to/sfahad</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sfahad"/>
    <language>en</language>
    <item>
      <title>Programming Hopper GPUs: The Memory Consistency Model</title>
      <dc:creator>Shah Fahad</dc:creator>
      <pubDate>Sat, 25 Apr 2026 14:05:13 +0000</pubDate>
      <link>https://dev.to/sfahad/programming-hopper-gpus-the-memory-consistency-model-24m7</link>
      <guid>https://dev.to/sfahad/programming-hopper-gpus-the-memory-consistency-model-24m7</guid>
      <description>&lt;p&gt;You've decided to write fast code for an NVIDIA Hopper GPU. Maybe you want to build a custom attention kernel. Maybe you're trying to understand how CUTLASS and ThunderKittens work under the hood. Either way, before you can use any of the cool Hopper hardware — TMA, wgmma, mbarriers, clusters — you need to understand one thing: &lt;strong&gt;how memory works when thousands of threads share it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's what the &lt;em&gt;memory consistency model&lt;/em&gt; describes. It's the rulebook for what one thread can or cannot see about another thread's writes. Without it, the rest of the stack is undefined behavior waiting to happen.&lt;/p&gt;

&lt;p&gt;This article covers the minimum you need to write correct multi-threaded GPU code. We'll build it from one concrete bug, then introduce the two primitives that fix it.&lt;/p&gt;

&lt;p&gt;Code examples are schematic unless noted. Full PTX spells out scope, state space, and type — for example &lt;code&gt;fence.release.gpu&lt;/code&gt; or &lt;code&gt;st.release.gpu.global.u32&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The bug we're trying to prevent
&lt;/h2&gt;

&lt;p&gt;Imagine the simplest possible producer-consumer pattern. One thread fills a buffer, then sets a flag to say "I'm done." Another thread waits for the flag, then reads the buffer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Producer (thread T1):              Consumer (thread T2):

  data = 42;       // (1)            while (flag == 0) {}    // (3) wait for flag
  flag = 1;        // (2)            x = data;               // (4) expects 42
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looks fine, right? It's how you'd write it in any single-threaded language. But on a GPU (or any modern CPU, for that matter), this code can produce &lt;code&gt;x == 0&lt;/code&gt; instead of &lt;code&gt;x == 42&lt;/code&gt;. Here's why.&lt;/p&gt;

&lt;p&gt;From the producer's own perspective, &lt;code&gt;(1)&lt;/code&gt; and &lt;code&gt;(2)&lt;/code&gt; are two completely unrelated stores to two different memory locations. There's nothing in T1's own code that depends on the order in which they hit memory. So the &lt;strong&gt;compiler is free to reorder them&lt;/strong&gt;, and the &lt;strong&gt;hardware is free to commit them out of order&lt;/strong&gt; — neither change is visible from inside T1.&lt;/p&gt;

&lt;p&gt;But the consumer T2 &lt;em&gt;can&lt;/em&gt; see the reorder. If &lt;code&gt;flag = 1&lt;/code&gt; arrives in memory &lt;em&gt;before&lt;/em&gt; &lt;code&gt;data = 42&lt;/code&gt; does, then T2 might see &lt;code&gt;flag == 1&lt;/code&gt;, exit its wait loop, and read &lt;code&gt;data&lt;/code&gt; while it's still &lt;code&gt;0&lt;/code&gt;. The consumer's assumption — "if the flag is set, the data is ready" — silently breaks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;What T1 thinks happens:        What T2 might actually see:

  data = 42                      flag = 1     ← becomes visible first
  flag = 1                       data = 42    ← arrives later

  T2 reads:                      T2 reads:
    flag → 1 ✓                     flag → 1 ✓ (exits wait loop)
    data → 42 ✓                    data → 0 ✗ (reads stale value!)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;This is the central problem of shared-memory programming.&lt;/strong&gt; A program that looks logically correct can fail because the system reorders operations that are independent from the issuing thread's perspective but not independent from another thread's perspective.&lt;/p&gt;

&lt;p&gt;The memory consistency model gives us tools to prevent this. The two main tools are called the &lt;strong&gt;release fence&lt;/strong&gt; and the &lt;strong&gt;acquire fence&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The big idea: pair of fences
&lt;/h2&gt;

&lt;p&gt;Think of fences as a contract between the producer and the consumer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Producer side:&lt;/strong&gt; "I promise to &lt;em&gt;publish&lt;/em&gt; all my prior work before I tell anyone I'm done."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumer side:&lt;/strong&gt; "I promise to wait for the announcement, then &lt;em&gt;read fresh data&lt;/em&gt; — not anything I might have cached earlier."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The producer holds up its end of the contract using a &lt;strong&gt;release fence&lt;/strong&gt;. The consumer holds up its end using an &lt;strong&gt;acquire fence&lt;/strong&gt;. When both sides cooperate, the bug above goes away.&lt;/p&gt;

&lt;p&gt;Let's see exactly what each fence does, then we'll put them together to fix the bug.&lt;/p&gt;




&lt;h2&gt;
  
  
  The release fence
&lt;/h2&gt;

&lt;p&gt;A release fence sits between two pieces of producer code. Conceptually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  data = 42;          // some prior work

  fence.release;      // "publish everything above me before anything below me"

  flag = 1;           // the announcement
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things to understand:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The compiler cannot move prior memory operations past the fence.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without the fence, a compiler optimization might reorder &lt;code&gt;data = 42&lt;/code&gt; to come &lt;em&gt;after&lt;/em&gt; &lt;code&gt;flag = 1&lt;/code&gt;. With &lt;code&gt;fence.release&lt;/code&gt; in between, that's forbidden. The fence is a one-way wall: stuff above stays above.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without the fence:                With fence.release:

  data = 42                         data = 42
  flag = 1                          fence.release  ← prior writes pinned above
                                    flag = 1
  ⇒ compiler/hardware may swap     ⇒ data = 42 stays before flag = 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The fence is one-way, not two-way.&lt;/strong&gt; This is a subtle but important point. The release fence only blocks downward motion — operations &lt;em&gt;above&lt;/em&gt; it can't sink below. Operations &lt;em&gt;below&lt;/em&gt; the fence are technically allowed to move above it; the fence does not pin them in place. This is by design: the release only promises "everything above me is published," so it has nothing to say about what comes after.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   memory op above                    ← cannot sink below the fence
   memory op above                    ← cannot sink below the fence
   ─── fence.release ───
   memory op below                    ← in general, CAN move above the fence
   memory op below                    ← in general, CAN move above the fence
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This sounds dangerous — wouldn't the compiler also be free to move the &lt;code&gt;flag = 1&lt;/code&gt; write above the fence? In source PTX, the answer is no for the store that forms the release pattern: NVIDIA defines the pattern by program order, so moving that store would change the program's synchronization behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The fence makes the prior writes "publishable."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the more subtle part. After the fence, when the producer writes &lt;code&gt;flag = 1&lt;/code&gt;, that flag write effectively &lt;em&gt;carries with it&lt;/em&gt; the promise that everything before the fence is also visible. It's like attaching a receipt to the announcement: "by the way, my data is ready too."&lt;/p&gt;

&lt;p&gt;This is called a &lt;strong&gt;release pattern&lt;/strong&gt;: a &lt;code&gt;fence.release&lt;/code&gt; followed by a write to a flag. The pair together is what publishes the producer's prior work to anyone watching the flag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern exception — the paired flag write is pinned.&lt;/strong&gt; The "later operations can move above the fence" rule has one exception: the &lt;em&gt;specific strong write&lt;/em&gt; that pairs with the fence to form the release pattern. NVIDIA's model defines &lt;code&gt;fence.release; st.relaxed [flag], 1&lt;/code&gt; as a release pattern because those instructions occur in that program order. A compiler or assembler cannot preserve the same PTX semantics while hoisting that store above the fence, because doing so would dismantle the release pattern entirely.&lt;/p&gt;

&lt;p&gt;So in practice, the rules around a release fence are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anything before the fence stays before it.&lt;/li&gt;
&lt;li&gt;Unrelated memory operations after the fence are not protected by the release semantics in the same way.&lt;/li&gt;
&lt;li&gt;The specific strong flag write that pairs with the fence must stay after the fence — it's the whole point of the pattern.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;One important nuance:&lt;/strong&gt; the fence does NOT eagerly broadcast anything to other threads. It just guarantees an &lt;em&gt;ordering&lt;/em&gt; — that the data is ready in memory before the flag is. Other threads will observe the change at their own pace; the release fence doesn't push anything to them. This is why we'll need a spin loop on the consumer side.&lt;/p&gt;




&lt;h2&gt;
  
  
  The acquire fence
&lt;/h2&gt;

&lt;p&gt;The acquire fence is the mirror image. It sits between two pieces of consumer code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  flag_value = flag;   // read the announcement

  fence.acquire;       // "anything below me sees a fresh view of memory"

  x = data;            // read the data, guaranteed up-to-date
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things to understand:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The compiler cannot move later memory operations before the fence.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without the fence, a compiler might prefetch &lt;code&gt;data&lt;/code&gt; into a register before the flag check (a common optimization). With &lt;code&gt;fence.acquire&lt;/code&gt;, that's forbidden. The fence is again a one-way wall, but in the opposite direction this time: stuff below stays below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without the fence:                With fence.acquire:

  flag_value = flag                 flag_value = flag
  x = data         ← may be         fence.acquire  ← later reads pinned below
                     prefetched     x = data
                     above flag!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The fence is one-way, in the opposite direction from release.&lt;/strong&gt; Symmetrically with &lt;code&gt;fence.release&lt;/code&gt;, the acquire fence only blocks upward motion — operations &lt;em&gt;below&lt;/em&gt; it can't hoist above. Operations &lt;em&gt;above&lt;/em&gt; the fence are technically allowed to move below it; the fence doesn't pin them in place. This is by design: the acquire only promises "everything below me sees a fresh view," so it has nothing to say about what came before.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   memory op above                    ← in general, CAN move below the fence
   memory op above                    ← in general, CAN move below the fence
   ─── fence.acquire ───
   memory op below                    ← cannot hoist above the fence
   memory op below                    ← cannot hoist above the fence
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Just like for release, this raises a question: couldn't the compiler also move the &lt;code&gt;flag_value = flag&lt;/code&gt; read &lt;em&gt;below&lt;/em&gt; the fence? Not for the read that forms the acquire pattern: NVIDIA defines the pattern by program order, so moving that read would change the synchronization behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The fence makes subsequent reads "fresh."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After the fence, later reads are ordered after any matching release that was observed through the flag. In other words, the consumer cannot use an old value of &lt;code&gt;data&lt;/code&gt; from before the producer's published write; it must see the released write or a later write in that location's coherence order.&lt;/p&gt;

&lt;p&gt;This is called an &lt;strong&gt;acquire pattern&lt;/strong&gt;: a read of a flag, followed by a &lt;code&gt;fence.acquire&lt;/code&gt;. The pair together is what makes the consumer pick up the producer's published data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern exception — the paired flag read is pinned.&lt;/strong&gt; The "earlier operations can move below the fence" rule has one exception: the &lt;em&gt;specific strong read&lt;/em&gt; that pairs with the fence to form the acquire pattern. NVIDIA's model defines &lt;code&gt;ld.relaxed [flag]; fence.acquire&lt;/code&gt; as an acquire pattern because those instructions occur in that program order. Moving the flag read below the fence would dismantle the pattern.&lt;/p&gt;

&lt;p&gt;So in practice, the rules around an acquire fence are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anything after the fence stays after it.&lt;/li&gt;
&lt;li&gt;Unrelated memory operations before the fence are not protected by the acquire semantics in the same way.&lt;/li&gt;
&lt;li&gt;The specific strong flag read that pairs with the fence must stay before the fence — it's the whole point of the pattern.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Putting them together: the bug, fixed
&lt;/h2&gt;

&lt;p&gt;Now let's go back to our buggy code and fix it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Producer (thread T1):                 Consumer (thread T2):

  data = 42;          // (1)            while (1) {
  fence.release;      // (F_R)            flag_value = flag;     // (3)
  flag = 1;           // (2)              if (flag_value == 1) break;
                                        }
                                        fence.acquire;           // (F_A)
                                        x = data;                // (4)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Walk through it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;T1 writes &lt;code&gt;data = 42&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;T1 hits &lt;code&gt;fence.release&lt;/code&gt;. This guarantees &lt;code&gt;data = 42&lt;/code&gt; is committed to memory before anything that follows.&lt;/li&gt;
&lt;li&gt;T1 writes &lt;code&gt;flag = 1&lt;/code&gt;. This is the announcement; it now "carries" the visibility of &lt;code&gt;data = 42&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;T2 spins on &lt;code&gt;flag&lt;/code&gt; until it reads &lt;code&gt;1&lt;/code&gt;. (We need to spin because the flag's new value takes some real time to propagate to T2 — fences don't push, they just promise ordering.)&lt;/li&gt;
&lt;li&gt;T2 hits &lt;code&gt;fence.acquire&lt;/code&gt;. This guarantees subsequent reads see fresh data — no stale cached values can satisfy them.&lt;/li&gt;
&lt;li&gt;T2 reads &lt;code&gt;data&lt;/code&gt;. Because the producer's release published &lt;code&gt;data = 42&lt;/code&gt; before publishing the flag, and T2's acquire ensures fresh reads after seeing the flag, T2 is guaranteed to see &lt;code&gt;42&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The key insight:&lt;/strong&gt; the bug only got fixed because BOTH sides cooperated. The release alone wouldn't help — the consumer would still cache stale data. The acquire alone wouldn't help — the producer's writes might still arrive out of order. Memory ordering is always a &lt;em&gt;contract&lt;/em&gt; between producer and consumer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   Producer side:                       Consumer side:

   data = 42                            spin: flag_value = flag
       │                                              │
       ▼                                              ▼
   fence.release        ─ publishes ─►   fence.acquire   ◄─ acquires
       │                  data + flag    │                  fresh view
       ▼                  together       ▼
   flag = 1                              x = data → 42 ✓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Shorter forms: baked-in release and acquire
&lt;/h2&gt;

&lt;p&gt;In the example above, the producer wrote two separate instructions for the publish step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fence.release;
st.relaxed [flag], 1;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PTX provides a shorter form that bakes the release semantics directly into the store. Instead of the two-instruction pair, you can write:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data = 42;
st.release [flag], 1;     // store with release semantics built in
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For this adjacent publish-store pattern, this gives the same release/acquire synchronization guarantee as &lt;code&gt;fence.release; st.relaxed [flag], 1&lt;/code&gt;. The release behavior is fused into the store, so the release pattern is inherent in one instruction.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important: it has to be &lt;code&gt;st.release&lt;/code&gt;, not &lt;code&gt;st.relaxed&lt;/code&gt;.&lt;/strong&gt; A bare &lt;code&gt;st.relaxed [flag], 1&lt;/code&gt; is &lt;em&gt;not&lt;/em&gt; a release pattern — it's a strong store with no release ordering effect on prior writes. The release pattern requires either &lt;code&gt;st.release&lt;/code&gt; (release baked in) &lt;strong&gt;or&lt;/strong&gt; the explicit pair &lt;code&gt;fence.release; st.relaxed [flag], 1&lt;/code&gt;. Don't drop the &lt;code&gt;fence.release&lt;/code&gt; and assume the &lt;code&gt;.relaxed&lt;/code&gt; qualifier carries any release meaning — it doesn't.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The consumer side has the matching shortcut:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flag_value = ld.acquire [flag];   // load with acquire semantics built in
x = data;                         // sees fresh data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ld.acquire&lt;/code&gt; is shorthand for "do the read, and include acquire semantics in that same operation." Same synchronization guarantee, one instruction.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Same warning, mirrored.&lt;/strong&gt; A bare &lt;code&gt;ld.relaxed [flag]&lt;/code&gt; is &lt;em&gt;not&lt;/em&gt; an acquire pattern — it's a strong load with no acquire ordering effect on later reads. The acquire pattern requires either &lt;code&gt;ld.acquire&lt;/code&gt; (acquire baked in) &lt;strong&gt;or&lt;/strong&gt; the explicit pair &lt;code&gt;ld.relaxed [flag]; fence.acquire&lt;/code&gt;. Don't drop the &lt;code&gt;fence.acquire&lt;/code&gt; and assume the &lt;code&gt;.relaxed&lt;/code&gt; qualifier carries any acquire meaning.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The fully-baked-in producer/consumer becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Producer:                          Consumer:

  data = 42;                         while (1) {
  st.release [flag], 1;                flag_value = ld.acquire [flag];
                                       if (flag_value == 1) break;
                                     }
                                     x = data;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same contract, less typing. &lt;strong&gt;For most producer-consumer patterns, this is the form you want.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When would you reach for the explicit &lt;code&gt;fence.release&lt;/code&gt; / &lt;code&gt;fence.acquire&lt;/code&gt; form instead? A few cases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;You want one fence to publish multiple flags.&lt;/strong&gt; A single &lt;code&gt;fence.release&lt;/code&gt; followed by several flag writes gives all of them release semantics — no need to repeat the fence per flag.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want a cheap relaxed read inside a spin loop, then "commit" to acquiring only when you see the right value.&lt;/strong&gt; Repeating &lt;code&gt;ld.acquire&lt;/code&gt; on every spin iteration can be more expensive than a &lt;code&gt;ld.relaxed&lt;/code&gt; loop followed by one &lt;code&gt;fence.acquire&lt;/code&gt; after exit. We'll see this pattern again with &lt;code&gt;mbarrier&lt;/code&gt; in a later article.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need the combined fence variant&lt;/strong&gt; discussed next.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  When you need both: &lt;code&gt;fence.acq_rel&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;So far we've kept the producer and consumer roles cleanly separated — release on the producer side, acquire on the consumer side. Each side has a single direction to worry about.&lt;/p&gt;

&lt;p&gt;But sometimes a single point in your code needs to do both jobs at once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Refresh&lt;/strong&gt; its view to see what someone else published (acquire side).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Publish&lt;/strong&gt; its own writes for someone else to see (release side).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's exactly what &lt;code&gt;fence.acq_rel&lt;/code&gt; is for. It's a &lt;code&gt;fence.release&lt;/code&gt; and &lt;code&gt;fence.acquire&lt;/code&gt; rolled into one — both effects fused at the same point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fence.acq_rel;     // both: prior writes published AND subsequent reads refreshed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both directions of the "one-way wall" apply at once: prior memory ops cannot sink below it (release side), and later memory ops cannot hoist above it (acquire side).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   memory op above                    ← cannot sink below the fence (release side)
   memory op above                    ← cannot sink below the fence
   ─── fence.acq_rel ───
   memory op below                    ← cannot hoist above the fence (acquire side)
   memory op below                    ← cannot hoist above the fence
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Where this actually shows up: atomic operations
&lt;/h3&gt;

&lt;p&gt;The clearest case for &lt;code&gt;acq_rel&lt;/code&gt; semantics is &lt;strong&gt;atomic operations that read AND write at the same time&lt;/strong&gt; — like &lt;code&gt;atom.cas&lt;/code&gt; (compare-and-swap), &lt;code&gt;atom.exch&lt;/code&gt; (exchange), &lt;code&gt;atom.add&lt;/code&gt; (fetch-and-add). These instructions can be simultaneously consumer-like (they read an old value) and producer-like (they write a new value).&lt;/p&gt;

&lt;p&gt;Take a shared work queue as an example. Producers fill slots in a queue, then publish progress by advancing a shared index. Consumers atomically claim index values and then read the corresponding slots. In a queue like that, the index is not just a counter — it's also the handoff point between "someone published work" and "someone else is allowed to consume it."&lt;/p&gt;

&lt;p&gt;Now imagine one thread advances the index with an atomic fetch-and-add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Atomic fetch-and-add: returns the old value
old = atom.add.acq_rel [queue_index], 1;
slot = old;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why might this atomic want &lt;strong&gt;acq_rel&lt;/strong&gt; semantics? Because the same operation may be doing both jobs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Acquire half (the read).&lt;/strong&gt; The old counter value may point to a slot whose contents were written by another thread before that thread published the index. Before this thread reads &lt;code&gt;queue[slot]&lt;/code&gt;, it needs to acquire those slot writes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Release half (the write).&lt;/strong&gt; The new counter value may become the value that a later thread observes before reading work or metadata this thread prepared. If this thread did setup before advancing the index, the write-half of the atomic can publish that setup.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the kind of situation where &lt;code&gt;atom.add.acq_rel&lt;/code&gt; makes sense: the same read-modify-write is consuming someone else's publication and publishing this thread's own update. PTX bakes both effects into the atomic with the &lt;code&gt;.acq_rel&lt;/code&gt; qualifier, for example &lt;code&gt;atom.acq_rel.gpu.global.add.u32&lt;/code&gt; in full PTX syntax.&lt;/p&gt;

&lt;p&gt;For comparison, an ordinary lock usually does &lt;strong&gt;not&lt;/strong&gt; need &lt;code&gt;acq_rel&lt;/code&gt; on lock acquire. Taking a lock is normally just an acquire operation: it consumes the previous holder's release. Releasing the lock is normally just a release store:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;st.release [lock], 0;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the common lock pattern is acquire on lock acquisition and release on unlock. Reach for &lt;code&gt;acq_rel&lt;/code&gt; when a single atomic really does both jobs for your algorithm.&lt;/p&gt;

&lt;h3&gt;
  
  
  Standalone &lt;code&gt;fence.acq_rel&lt;/code&gt; is rarer
&lt;/h3&gt;

&lt;p&gt;A standalone &lt;code&gt;fence.acq_rel&lt;/code&gt; (not attached to an atomic) shows up less often, but it's there if you have a non-atomic point in your code that needs both effects. Most of the time it appears inside higher-level synchronization primitives (atomics, mbarriers — which we'll see in later articles) rather than as a programmer-written explicit fence.&lt;/p&gt;

&lt;p&gt;One bit of PTX trivia: &lt;code&gt;fence.acq_rel&lt;/code&gt; is the &lt;strong&gt;default&lt;/strong&gt; when you write a plain &lt;code&gt;fence&lt;/code&gt; without a &lt;code&gt;.sem&lt;/code&gt; qualifier. So &lt;code&gt;fence.gpu&lt;/code&gt; is shorthand for &lt;code&gt;fence.acq_rel.gpu&lt;/code&gt;. This is partly why &lt;code&gt;acq_rel&lt;/code&gt; shows up so often in PTX disassembly.&lt;/p&gt;

&lt;p&gt;For typical one-direction producer-consumer patterns, &lt;strong&gt;stick with the matched &lt;code&gt;release&lt;/code&gt; + &lt;code&gt;acquire&lt;/code&gt; pair&lt;/strong&gt; we built up in the earlier sections — it's simpler and avoids asking for stronger ordering than you need. Reach for &lt;code&gt;acq_rel&lt;/code&gt; when one fence point (or one atomic operation) really does need to do both jobs at once, like the work-queue atomic above.&lt;/p&gt;




&lt;h2&gt;
  
  
  Things to remember
&lt;/h2&gt;

&lt;p&gt;A handful of practical points worth keeping in your head as you write code.&lt;/p&gt;

&lt;h3&gt;
  
  
  The producer's flag write must be a "real" memory write
&lt;/h3&gt;

&lt;p&gt;In PTX, this means using something like &lt;code&gt;st.relaxed [flag], 1&lt;/code&gt; rather than the default &lt;code&gt;st [flag], 1&lt;/code&gt;. The default is what's called a "weak" write — the memory model gives it no cross-thread guarantees, and the release pattern won't form correctly with it. For any flag another thread will read, use a strong store such as &lt;code&gt;st.relaxed&lt;/code&gt; or &lt;code&gt;st.release&lt;/code&gt; (the baked-in form discussed above). (&lt;code&gt;.acquire&lt;/code&gt; and &lt;code&gt;.acq_rel&lt;/code&gt; do not apply to ordinary stores.)&lt;/p&gt;

&lt;p&gt;The same applies to the consumer's flag read: use &lt;code&gt;ld.relaxed [flag]&lt;/code&gt; or &lt;code&gt;ld.acquire [flag]&lt;/code&gt;, not plain &lt;code&gt;ld [flag]&lt;/code&gt;. (&lt;code&gt;.release&lt;/code&gt; only applies to stores, not loads.)&lt;/p&gt;

&lt;h3&gt;
  
  
  Spin loops are normal — fences don't push data
&lt;/h3&gt;

&lt;p&gt;A common confusion is "I issued the release, why doesn't the consumer see it immediately?" The release fence orders your writes; it doesn't shove them down other threads' throats. The consumer's &lt;code&gt;flag&lt;/code&gt; read may still return &lt;code&gt;0&lt;/code&gt; for a while after the producer wrote &lt;code&gt;1&lt;/code&gt;, because the new value has to propagate through the memory system.&lt;/p&gt;

&lt;p&gt;That's why the consumer is in a &lt;code&gt;while (flag == 0)&lt;/code&gt; loop — it's bridging the propagation gap. This is the standard, correct idiom.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory ordering is about visibility, not synchronization
&lt;/h3&gt;

&lt;p&gt;A second and related confusion: people see "fence" and assume it makes threads pause or wait for each other. It doesn't. Fences and release/acquire qualifiers are entirely about &lt;strong&gt;what one thread sees in memory when it reads&lt;/strong&gt; — not about pausing or aligning threads in time.&lt;/p&gt;

&lt;p&gt;Two different concerns, two different toolboxes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Visibility&lt;/strong&gt; (this article): one thread's writes become observable to another thread's reads, in a well-defined order. Fences and release/acquire qualifiers handle this. A fence may wait until the calling thread's relevant memory operations have reached the point of coherence for its scope, but it &lt;strong&gt;doesn't wait for another thread&lt;/strong&gt; to arrive, acknowledge, or read anything.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synchronization&lt;/strong&gt; (a separate concern): threads actually pause and align in time — e.g., "no thread proceeds past this point until all threads have arrived." That's the job of barriers like &lt;code&gt;bar.sync&lt;/code&gt;, &lt;code&gt;barrier.cluster&lt;/code&gt;, and the &lt;code&gt;mbarrier&lt;/code&gt; object. We'll cover those in the next article.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The spin loop in our consumer code is what bridges the two when you want them together: the consumer wants to wait for the producer, so it busy-loops on the flag. That waiting is the programmer's choice (the &lt;code&gt;while&lt;/code&gt; loop), not something the fence is doing for them.&lt;/p&gt;

&lt;p&gt;Keeping these two ideas separate is one of the most useful mental moves you can make in concurrent GPU programming. The memory model is purely about visibility; for actual "wait until X" behavior, reach for the synchronization primitives in the next article.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fences have a "scope"
&lt;/h3&gt;

&lt;p&gt;In PTX you'll write things like &lt;code&gt;fence.release.gpu&lt;/code&gt; or &lt;code&gt;fence.release.cta&lt;/code&gt;. The scope says which threads can directly participate in the ordering guarantee.&lt;/p&gt;

&lt;p&gt;To understand why scope matters, picture the GPU's memory as a layered hierarchy, with caches and storage at different distances from each thread:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   Per-SM:    L1 / shared-memory crossbar       (closest, fastest)
   Cluster:   DSMEM crossbar (Hopper+)
   Chip:      L2 cache                          (chip-wide)
   System:    HBM and host-memory fabric        (farthest, slowest)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For two threads to communicate via a release/acquire pair, the operations must use a scope that includes both threads. A useful hardware mental model is that wider scopes generally have to order through a farther-away coherence point, so they cost more. The exact cache behavior is an implementation detail; the architectural rule is the set of threads covered by the scope.&lt;/p&gt;

&lt;p&gt;The scopes are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;.cta&lt;/code&gt;&lt;/strong&gt; — between threads in the same thread block. This is the smallest PTX memory-model scope and is usually the cheapest.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;.cluster&lt;/code&gt;&lt;/strong&gt; — between threads in the same thread block cluster (Hopper's new feature). This is useful for cluster-level shared memory and cluster barriers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;.gpu&lt;/code&gt;&lt;/strong&gt; — between any threads in the current program on the same GPU. In the usual global-memory mental model, this means ordering at a chip-wide level such as L2, so it is more expensive than &lt;code&gt;.cta&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;.sys&lt;/code&gt;&lt;/strong&gt; — across the whole program, including host threads and kernels on other GPUs. This may involve the system fabric and is the most expensive scope.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Rough hardware mental model:

   .cta     ──►  usually local CTA-visible path
   .cluster ──►  cluster-level shared-memory network
   .gpu     ──►  chip-wide ordering point, often L2
   .sys     ──►  system-visible fabric

           ──── generally increasing cost ────►
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pick the smallest scope that covers your producer-consumer pair.&lt;/strong&gt; A &lt;code&gt;fence.release.cta&lt;/code&gt; is much cheaper than a &lt;code&gt;fence.release.gpu&lt;/code&gt;, which is much cheaper than a &lt;code&gt;fence.release.sys&lt;/code&gt;. There's no reason to pay for a wider scope than your actual readers need. If both threads are guaranteed to be in the same CTA, use &lt;code&gt;.cta&lt;/code&gt;. If they could be on any SM, you need &lt;code&gt;.gpu&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We'll see scopes again in later articles when we discuss thread block clusters and the TMA engine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Single-threaded code never needs fences
&lt;/h3&gt;

&lt;p&gt;If only one thread is touching some piece of memory, you don't need any of this. The reordering issues only matter when there's a second thread observing. Inside one thread, you get this guarantee for free:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Whenever you read a memory location, you see the most recent value &lt;em&gt;your own thread&lt;/em&gt; wrote to &lt;strong&gt;that same location&lt;/strong&gt; — in source-code order, no fences needed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The important nuance is the phrase &lt;strong&gt;"that same location."&lt;/strong&gt; The compiler and hardware are still free to reorder operations to &lt;em&gt;different&lt;/em&gt; memory addresses, even within a single thread. That reorder is invisible to you, because nothing inside your thread depends on the order. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data[0] = 42;        // (1)
data[1] = 99;        // (2)
x = data[0];         // (3) guaranteed to see 42 — same address as (1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, &lt;code&gt;(1)&lt;/code&gt; and &lt;code&gt;(2)&lt;/code&gt; write to &lt;em&gt;different&lt;/em&gt; addresses, so the compiler may commit them to memory in either order. But &lt;code&gt;(3)&lt;/code&gt; reads from the same address as &lt;code&gt;(1)&lt;/code&gt;, so it's guaranteed to see &lt;code&gt;42&lt;/code&gt; — the per-address ordering rule pins this down. From your thread's perspective, the reorder of &lt;code&gt;(1)&lt;/code&gt; and &lt;code&gt;(2)&lt;/code&gt; is invisible because nothing in your code reads &lt;code&gt;data[1]&lt;/code&gt; afterward to detect it.&lt;/p&gt;

&lt;p&gt;This is exactly the source of the bug at the start of the article: the producer's &lt;code&gt;data&lt;/code&gt; write and &lt;code&gt;flag&lt;/code&gt; write are to &lt;em&gt;different&lt;/em&gt; addresses, so they can be reordered freely from the producer's own perspective. It only becomes a problem because &lt;em&gt;another&lt;/em&gt; thread (the consumer) reads both addresses and notices the reorder. The fences fix it for that exact case.&lt;/p&gt;

&lt;p&gt;So the rule is: fences are needed only at the precise places where two threads communicate, and even then only because they need to coordinate their views of &lt;em&gt;different&lt;/em&gt; memory addresses. Within one thread, same-address reads work without fences.&lt;/p&gt;




&lt;h2&gt;
  
  
  Mental model summary
&lt;/h2&gt;

&lt;p&gt;If you remember nothing else from this article, remember this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Memory ordering on a GPU is a contract between a producer and a consumer.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The producer uses a &lt;strong&gt;release fence&lt;/strong&gt; to publish all its prior writes before announcing "done."&lt;/p&gt;

&lt;p&gt;The consumer uses an &lt;strong&gt;acquire fence&lt;/strong&gt; to ensure all its subsequent reads see fresh data after observing the "done" signal.&lt;/p&gt;

&lt;p&gt;Both halves of the contract are needed. Each fence is a one-way wall — release pins prior writes above it; acquire pins subsequent reads below it. Together they make a producer-consumer pattern correct on a system where independent operations can otherwise be reordered freely.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What's next in this series
&lt;/h2&gt;

&lt;p&gt;The memory consistency model is the foundation. With it in hand, the rest of the Hopper stack starts to make sense. Coming up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Execution synchronization&lt;/strong&gt; — &lt;code&gt;bar.sync&lt;/code&gt;, &lt;code&gt;barrier.cluster&lt;/code&gt;, and how to align threads in time (not just memory).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;mbarrier&lt;/code&gt;&lt;/strong&gt; — a programmable synchronization object that combines memory ordering, thread arrival counting, and async-engine completion tracking. The Hopper workhorse.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Asynchronous copies&lt;/strong&gt; — the TMA engine, &lt;code&gt;cp.async.bulk&lt;/code&gt;, and the powerful but subtle &lt;code&gt;CUtensorMap&lt;/code&gt; descriptor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;wgmma&lt;/code&gt;&lt;/strong&gt; — Hopper's warp-group matrix multiply-accumulate, the engine that drives modern GEMM and attention kernels.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each will build on the same release-acquire ideas you just learned. The TMA engine, mbarriers, and warp-group MMA all rely on exactly the same kind of "one side publishes, the other side acquires" contract — just with more sophisticated machinery to support things like async byte-counting and distributed shared memory.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>computerscience</category>
      <category>performance</category>
      <category>programming</category>
    </item>
    <item>
      <title>CUDA Graphs in LLM Inference: Deep Dive</title>
      <dc:creator>Shah Fahad</dc:creator>
      <pubDate>Sat, 21 Feb 2026 07:09:21 +0000</pubDate>
      <link>https://dev.to/sfahad/cuda-graphs-in-llm-inference-deep-dive-36pb</link>
      <guid>https://dev.to/sfahad/cuda-graphs-in-llm-inference-deep-dive-36pb</guid>
      <description>&lt;h2&gt;
  
  
  Why CUDA Graphs Matter for LLM Inference
&lt;/h2&gt;

&lt;p&gt;LLM inference -- especially the token generation (decode) phase -- is &lt;strong&gt;often dominated by CPU overhead rather than GPU compute&lt;/strong&gt;. Each decode step generates a single token per sequence: the actual GPU work (small matmuls, attention over one query) can finish in microseconds, but the CPU can spend tens of microseconds &lt;em&gt;per kernel launch&lt;/em&gt; on launch bookkeeping, driver calls, and synchronization. With hundreds of kernel launches per transformer forward pass, this CPU overhead can become the bottleneck (though at higher batch sizes or with heavier kernels, decode can still become GPU-bound).&lt;/p&gt;

&lt;p&gt;Making matters worse, the CPU isn't just launching kernels -- it's also preparing data for the next batch: updating token IDs, managing the KV cache block table, running the scheduler, and handling request arrivals/completions. All of this competes for CPU time with kernel launches, amplifying the bottleneck. The GPU ends up sitting idle between launches, throughput drops, latency rises, and expensive GPU cycles are wasted on nothing.&lt;/p&gt;

&lt;p&gt;CUDA graphs solve this by &lt;strong&gt;recording the entire kernel sequence once&lt;/strong&gt; and &lt;strong&gt;replaying it with a single CPU call&lt;/strong&gt;. The driver overhead is paid once at capture time; every subsequent replay amortizes hundreds of per-kernel launches into a single replay launch, largely avoiding the repeated per-kernel launch bookkeeping. For decode-heavy workloads, this can eliminate the majority of per-step overhead.&lt;/p&gt;

&lt;p&gt;This post walks through how CUDA graphs work in the context of LLM serving -- why decode is a natural fit, why context/mixed batches are harder, and how TensorRT-LLM (TRT-LLM) implements both monolithic and piecewise CUDA graph strategies.&lt;/p&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;1. CUDA Graphs Fundamentals&lt;/li&gt;
&lt;li&gt;2. Generation (Decode) CUDA Graphs&lt;/li&gt;
&lt;li&gt;3. KV Cache with Static Addresses&lt;/li&gt;
&lt;li&gt;4. Why Context &amp;amp; Mixed Batches Are Hard&lt;/li&gt;
&lt;li&gt;5. Piecewise CUDA Graphs (torch.compile)&lt;/li&gt;
&lt;li&gt;6. Configuration Guide&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  1. CUDA Graphs Fundamentals
&lt;/h2&gt;

&lt;p&gt;A CUDA graph captures a sequence of GPU operations (kernel launches, memory copies) into a single replayable unit.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Gets Captured (Fixed)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------------------------------------------------------------------+
| CUDA Graph Recording                                               |
|                                                                    |
| +----------+      +----------+      +----------+      +----------+ |
| | Kernel A |      | Kernel B |      | Kernel C |      | Kernel D | |
| |grid(4,1) |-----&amp;gt;|grid(8,1) |-----&amp;gt;|grid(4,1) |-----&amp;gt;|grid(2,1) | |
| |@0x100 -&amp;gt; |      |@0x200 -&amp;gt; |      |@0x300 -&amp;gt; |      |@0x400 -&amp;gt; | |
| |  0x200   |      |  0x300   |      |  0x400   |      |  0x500   | |
| +----------+      +----------+      +----------+      +----------+ |
+--------------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Baked into the graph:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which kernels to launch, in what order&lt;/li&gt;
&lt;li&gt;Memory addresses (pointers) each kernel reads/writes&lt;/li&gt;
&lt;li&gt;Kernel launch parameters (grid dims, block dims, shared memory)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;NOT baked (can change between replays):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The actual data at those addresses&lt;/li&gt;
&lt;li&gt;Data-dependent control flow inside kernels (loops, branches)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Replay Contract
&lt;/h3&gt;

&lt;p&gt;On replay, the entire sequence launches with minimal CPU overhead. The user's responsibility is to place correct data at the captured addresses before each replay.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why It's Fast
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+----------------------------+
| Without CUDA Graph (eager) |
|                            |
| CPU -- launch --&amp;gt; Kernel A |
| CPU &amp;lt;-- wait ----+         |
| CPU -- launch --&amp;gt; Kernel B |
| CPU &amp;lt;-- wait ----+         |
| CPU -- launch --&amp;gt; Kernel C |
| CPU &amp;lt;-- wait ----+         |
| CPU -- launch --&amp;gt; Kernel D |
|                            |
| = 4x CPU round-trips       |
+----------------------------+

+------------------------------------------+
| With CUDA Graph                          |
|                                          |
| CPU -- replay --&amp;gt; [ Kernel A, B, C, D ]  |
|                                          |
| = 1 launch, entire chain executes on GPU |
+------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  2. Generation (Decode) CUDA Graphs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why Decode Is Well-Suited
&lt;/h3&gt;

&lt;p&gt;In decode, each sequence contributes exactly &lt;strong&gt;1 new token&lt;/strong&gt; per step. Total tokens = batch size. This makes the input shape predictable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+---------------------------------------------------------------+
| Decode step N                                                 |
|                                                               |
| seq0: 1 token  \                                              |
| seq1: 1 token   \                                             |
|                   &amp;gt;-- batch_size = 4, shape = [4, hidden_dim] |
| seq2: 1 token   /                                             |
| seq3: 1 token  /                                              |
+---------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pre-allocated Static Buffers
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-----------------------------------------------------------------+
| Input token IDs buffer (pre-allocated, max_batch_size = 4096)   |
|                                                                 |
| [ token_0 ][ token_1 ][ token_2 ][ token_3 ] ... [ token_4095 ] |
|   @addr_0    @addr_1    @addr_2    @addr_3          @addr_4095  |
|                                                                 |
|   fixed addresses -- same every replay                          |
+-----------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Multiple Graphs for Different Batch Sizes
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Captured graphs (one per supported batch size, typically powers of two):

  batch_size   grid size     reads
  ----------   ---------     -----
       1  --&amp;gt;  (1, ...)  --&amp;gt; addr_0
       2  --&amp;gt;  (2, ...)  --&amp;gt; addr_0..1
       4  --&amp;gt;  (4, ...)  --&amp;gt; addr_0..3
       8  --&amp;gt;  (8, ...)  --&amp;gt; addr_0..7
       :
    4096  --&amp;gt;  (4096,..) --&amp;gt; addr_0..4095
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At runtime with 5 active sequences → use batch_size=8 graph, pad 3 dummy sequences.&lt;/p&gt;

&lt;h3&gt;
  
  
  Intermediate Activations Have Stable Addresses
&lt;/h3&gt;

&lt;p&gt;During capture, intermediate tensors are allocated from a graph-private memory pool, giving them stable device addresses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+----------------------------------------------------------+
| Transformer layer (captured; all addresses fixed)        |
|                                                          |
| [QKV Projection] ----&amp;gt; [Attention] ----&amp;gt; [Output Proj]   |
|  in @A, out @B          in @B, out @C    in @C, out @D   |
|                                               |          |
|                                               v          |
| [FFN Layer 1] --------&amp;gt; [FFN Layer 2] ----&amp;gt; (next layer) |
|  in @D, out @E           in @E, out @F                   |
+----------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On replay, the same chain executes at the same addresses. Intermediate buffers are never freed between replays -- they persist in the graph's memory pool. This is why &lt;strong&gt;each captured batch size has its own set of stable-address buffers&lt;/strong&gt;, and capturing many batch sizes consumes significant GPU memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the Runtime Updates Before Each Replay
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-----------------------------------------------------+
| 1. input_token_ids[0:B]  &amp;lt;-- new token IDs          |
| 2. position_ids[0:B]     &amp;lt;-- new positions          |
| 3. sequence_lengths[0:B] += 1                       |
| 4. block_table           &amp;lt;-- update if new KV block |
+-----------------------------------------------------+
| 5. &amp;gt;&amp;gt;&amp;gt; REPLAY GRAPH &amp;lt;&amp;lt;&amp;lt;                             |
+-----------------------------------------------------+
| 6. new_logits &amp;lt;-- output_buffer[0:B]                |
+-----------------------------------------------------+
| B = batch_size                                      |
+-----------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  3. KV Cache with Static Addresses
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Apparent Contradiction
&lt;/h3&gt;

&lt;p&gt;KV cache grows every step (new K,V written for each token), yet CUDA graphs require fixed addresses. The solution: &lt;strong&gt;paged/block-based KV cache&lt;/strong&gt; with an &lt;strong&gt;indirection table&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Block-Based KV Cache Pool
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-------------------------------------------------------------+
| KV cache pool (pre-allocated; addresses never change)       |
|                                                             |
| [ Block 0 ][ Block 1 ][ Block 2 ][ Block 3 ][ Block 4 ] ... |
|   @blk_0     @blk_1     @blk_2     @blk_3     @blk_4        |
|  32 slots   32 slots   32 slots   32 slots   32 slots       |
|                                                             |
| each block holds K,V for a fixed number of tokens (e.g. 32) |
+-------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Block Table (Indirection)
&lt;/h3&gt;

&lt;p&gt;Each sequence has a block table mapping logical positions to physical blocks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Logical positions&lt;/th&gt;
&lt;th&gt;Physical block&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;tokens 0–31&lt;/td&gt;
&lt;td&gt;Block 7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tokens 32–63&lt;/td&gt;
&lt;td&gt;Block 12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tokens 64–95&lt;/td&gt;
&lt;td&gt;Block 3 (partially filled, e.g. up to 82)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Sequence 0's block table at fixed address @tbl_0&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How Attention Kernel Uses Indirection
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Inside the attention kernel (pseudo-code):
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;each&lt;/span&gt; &lt;span class="n"&gt;past&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="n"&gt;position&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sequence_length&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;seq_id&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="n"&gt;block_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;block_table&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;seq_id&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;block_size&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;    &lt;span class="c1"&gt;# read from @tbl_0
&lt;/span&gt;    &lt;span class="n"&gt;offset&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;block_size&lt;/span&gt;
    &lt;span class="n"&gt;K_i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kv_cache_pool&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;block_idx&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;              &lt;span class="c1"&gt;# indirect lookup into pool
&lt;/span&gt;    &lt;span class="n"&gt;V_i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kv_cache_pool&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;block_idx&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;K_i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step-by-Step: How KV Cache Grows Within CUDA Graph
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Buffer&lt;/th&gt;
&lt;th&gt;Step N&lt;/th&gt;
&lt;th&gt;Step N+1&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;block_table&lt;/code&gt; @tbl_0&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[7, 12, 3]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[7, 12, 3]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Same address, same indices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;seq_length&lt;/code&gt; @len_0&lt;/td&gt;
&lt;td&gt;&lt;code&gt;82&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;83&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Same address, incremented&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kv_pool Block 3, slot 18&lt;/td&gt;
&lt;td&gt;K,V for token 82&lt;/td&gt;
&lt;td&gt;K,V for token 82&lt;/td&gt;
&lt;td&gt;Unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;kv_pool Block 3, slot 19&lt;/td&gt;
&lt;td&gt;&lt;em&gt;(empty)&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;K,V for token 83&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;NEW&lt;/strong&gt; — written by kernel&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The kernel wrote to a different slot because &lt;code&gt;sequence_length&lt;/code&gt; told it to. All addresses remain fixed -- only the data changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Doesn't Violate CUDA Graph Rules
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What's fixed (baked in graph)&lt;/th&gt;
&lt;th&gt;What changes (data at fixed addrs)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;kv_cache_pool&lt;/code&gt; base address&lt;/td&gt;
&lt;td&gt;Which blocks are assigned (block_table data)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;block_table&lt;/code&gt; buffer address&lt;/td&gt;
&lt;td&gt;The integer block indices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;sequence_length&lt;/code&gt; buffer address&lt;/td&gt;
&lt;td&gt;The actual length values&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kernel grid dimensions&lt;/td&gt;
&lt;td&gt;Data-dependent loops inside kernel iterate more/fewer times&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  4. Why Context &amp;amp; Mixed Batches Are Hard
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Core Problem: Variable Total Token Count
&lt;/h3&gt;

&lt;p&gt;In decode, total tokens = batch size (each sequence = 1 token). In context/mixed, total tokens varies wildly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Batch type&lt;/th&gt;
&lt;th&gt;Sequences&lt;/th&gt;
&lt;th&gt;Total tokens&lt;/th&gt;
&lt;th&gt;Predictable?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Decode&lt;/td&gt;
&lt;td&gt;&lt;code&gt;seq₀(1) + seq₁(1) + seq₂(1)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Yes — always = batch_size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context&lt;/td&gt;
&lt;td&gt;&lt;code&gt;seq₀(137) + seq₁(2048)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2185&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mixed&lt;/td&gt;
&lt;td&gt;&lt;code&gt;seq₀(512 prefill) + seq₁(1 decode)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;513&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Problem 1: Kernel Grid Dimensions Depend on Total Tokens
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Kernel launch -- grid dims are a function of input shape&lt;/span&gt;
&lt;span class="n"&gt;dim3&lt;/span&gt; &lt;span class="nf"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;total_tokens&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;TILE_M&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;TILE_M&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hidden_dim&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;TILE_N&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;TILE_N&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;matmul_kernel&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hidden_dim&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;total_tokens&lt;/th&gt;
&lt;th&gt;grid size&lt;/th&gt;
&lt;th&gt;Implication&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(4, …)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4 blocks — one graph&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3072&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(24, …)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;24 blocks — &lt;strong&gt;different&lt;/strong&gt; graph required&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The grid is baked at capture time. Different total tokens = different grid = different graph.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 2: Attention Grid Depends on Max Context Seq Length and Num Context Requests
&lt;/h3&gt;

&lt;p&gt;For MLP, every token is independent: &lt;code&gt;output[i] = MLP(input[i])&lt;/code&gt;. Fix total_tokens and you're done.&lt;/p&gt;

&lt;p&gt;For attention, the kernel grid depends on &lt;strong&gt;two per-iteration variables&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------------------------------------------------------------+
| TRT-LLM attention grid (simplified call chain)               |
|                                                              |
| Python (trtllm.py)                                           |
|   max_ctx_seq_len = seq_lens[:num_contexts].max()            |
|                             |                                |
|                             v                                |
| C++ (fmhaRunner / fused_multihead_attention_v2)              |
|   |                   |                   |                  |
|   v                   v                   v                  |
|   grid.x              grid.y              grid.z             |
|   ceil(s/unroll)      num_heads           num_ctx_requests   |
|   [VARIES]            [FIXED]             [VARIES]           |
|                                                              |
|   --&amp;gt; grid = ( ceil(s/unroll), num_heads, num_ctx_requests ) |
+--------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Grid = &lt;code&gt;(ceil(max_ctx_seq_len / unroll_step), num_heads, num_context_requests)&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;TRT-LLM uses a &lt;strong&gt;padded tiling strategy&lt;/strong&gt;: the grid is sized for the longest context request, and shorter requests have their extra tiles skip computation (the kernel checks &lt;code&gt;cu_seqlens&lt;/code&gt; internally):&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Padded tiling: 3 context requests, &lt;code&gt;seq_lens = [64, 128, 256]&lt;/code&gt;, &lt;code&gt;unroll_step = 64&lt;/code&gt;.&lt;br&gt;
Grid = &lt;code&gt;(4, num_heads, 3)&lt;/code&gt; — sized for longest request (256).&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Tile 0&lt;/th&gt;
&lt;th&gt;Tile 1&lt;/th&gt;
&lt;th&gt;Tile 2&lt;/th&gt;
&lt;th&gt;Tile 3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Req 0&lt;/strong&gt; (64 tokens)&lt;/td&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;td&gt;skip&lt;/td&gt;
&lt;td&gt;skip&lt;/td&gt;
&lt;td&gt;skip&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Req 1&lt;/strong&gt; (128 tokens)&lt;/td&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;td&gt;skip&lt;/td&gt;
&lt;td&gt;skip&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Req 2&lt;/strong&gt; (256 tokens)&lt;/td&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;td&gt;compute&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Even with this padded approach, the grid changes per iteration because &lt;strong&gt;both &lt;code&gt;max_ctx_seq_len&lt;/code&gt; and &lt;code&gt;num_context_requests&lt;/code&gt;&lt;/strong&gt; change depending on which requests the scheduler assigns to the context phase:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Iteration&lt;/th&gt;
&lt;th&gt;Context requests&lt;/th&gt;
&lt;th&gt;max_len&lt;/th&gt;
&lt;th&gt;grid&lt;/th&gt;
&lt;th&gt;What changed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(2, heads, 32)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(2, heads, 1)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;grid.z&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(4, heads, 2)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;grid.x and z&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Different iterations produce different grids/launch parameters — the combination space explodes across multiple variables (e.g., &lt;code&gt;max_ctx_seq_len&lt;/code&gt;, &lt;code&gt;num_context_requests&lt;/code&gt;, and sequence-length distributions), making “one reusable CUDA graph” impractical.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A CUDA graph captured with one grid would produce &lt;strong&gt;incorrect results&lt;/strong&gt; if replayed with a different grid/launch configuration (missing tiles = unprocessed tokens; extra tiles = out-of-bounds/garbage work). To make this safe, you’d need to capture graphs for many combinations or pad/standardize to a fixed worst-case launch shape.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why Decode Attention Doesn't Have This Problem
&lt;/h3&gt;

&lt;p&gt;In decode, every sequence has exactly 1 query token. The decode attention uses a different kernel path where:&lt;/p&gt;

&lt;p&gt;Decode attention: &lt;code&gt;grid = (batch_size, num_heads)&lt;/code&gt; — both fixed per captured graph.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;batch_size&lt;/code&gt; is fixed per captured graph (one graph per supported batch size)&lt;/li&gt;
&lt;li&gt;Variable KV cache lengths are handled by data-dependent loops &lt;strong&gt;inside&lt;/strong&gt; the kernel (loop over &lt;code&gt;sequence_length[i]&lt;/code&gt;) -- the grid doesn't change&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Where Each Layer Type Falls
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Shape&lt;/th&gt;
&lt;th&gt;Capturable?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Layer norm&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;[total_tokens, hidden]&lt;/code&gt; — flat&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q, K, V projections&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;[total_tokens, hidden]&lt;/code&gt; — flat matmuls&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Fused attention&lt;/strong&gt; (Q@K^T, softmax, scores@V)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;per-sequence, variable tiles&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;No&lt;/strong&gt; — grid varies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output projection&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;[total_tokens, hidden]&lt;/code&gt; — flat matmul&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MLP&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;[total_tokens, hidden]&lt;/code&gt; — flat matmuls&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  5. Piecewise CUDA Graphs (torch.compile)
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Two Separate CUDA Graph Systems
&lt;/h3&gt;

&lt;p&gt;TRT-LLM uses &lt;strong&gt;two independent&lt;/strong&gt; CUDA graph systems -- understanding this distinction is critical:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                  Python model forward()
                          |
            +-------------+-------------+
            |                           |
            v                           v
+-------------------------+ +-------------------------+
| torch.compile           | | Native CUDA Graph       |
| (Dynamo tracing)        | | (stream capture)        |
+-------------------------+ +-------------------------+
| Traces Python -&amp;gt; FX     | | Records GPU kernels     |
| Decomposes to ATen ops  | | on the CUDA stream      |
| Custom ops -&amp;gt; split pt  | | Captures everything     |
+-------------------------+ +-------------------------+
| Result: Pieces          | | Result: One monolithic  |
| [graph][eager][graph]...| | graph of full fwd pass  |
+-------------------------+ +-------------------------+
            |                           |
            v                           v
  Used for: mixed/context    Used for: decode-only
  (attn grid varies)         (attn grid fixed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Generation-only (decode)&lt;/strong&gt;: Uses &lt;strong&gt;native &lt;code&gt;torch.cuda.CUDAGraph&lt;/code&gt;&lt;/strong&gt; capture. This records every kernel launch on the CUDA stream at the driver level -- including FlashAttention. It doesn't need to "understand" the kernels; it just records them. This works because decode attention's grid depends only on &lt;code&gt;batch_size&lt;/code&gt; (fixed per capture).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Piecewise (mixed/context)&lt;/strong&gt;: Uses &lt;strong&gt;torch.compile&lt;/strong&gt; to trace the model into an FX graph, then TRT-LLM's custom backend splits at attention boundaries and captures each non-attention piece as a CUDA graph. Attention runs eagerly.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Piecewise Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------------------------------------------------------+
| CUDA GRAPH -- piece 1                     [captured]   |
|   layer_norm -&amp;gt; qkv_projection                         |
|   pre-allocates output buffer @ addr_X                 |
+--------------------------------------------------------+
|                         |                              |
|                         v                              |
+--------------------------------------------------------+
| EAGER -- not graphed                 [runs every time] |
|   flash_attention(q, k, v, cu_seqlens, ...)            |
|   writes result IN-PLACE to addr_X                     |
+--------------------------------------------------------+
|                         |                              |
|                         v                              |
+--------------------------------------------------------+
| CUDA GRAPH -- piece 2                     [captured]   |
|   reads from addr_X                                    |
|   output_proj -&amp;gt; layer_norm -&amp;gt; mlp_up -&amp;gt;               |
|   activation -&amp;gt; mlp_down -&amp;gt; residual_add               |
+--------------------------------------------------------+
|                         |                              |
|                         v                              |
|                 ... next layer ...                     |
+--------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The in-place attention design is critical: attention writes into a buffer pre-allocated by piece 1, ensuring piece 2's captured graph reads from the correct fixed address.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Attention Is Excluded
&lt;/h3&gt;

&lt;p&gt;Attention is excluded from CUDA graph capture for a &lt;strong&gt;correctness&lt;/strong&gt; reason, not a tracing limitation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The tracing works fine.&lt;/strong&gt; TRT-LLM registers a FakeTensor implementation for the attention custom op, so &lt;code&gt;torch.compile&lt;/code&gt; in fullgraph mode traces the entire forward pass into one FX graph without graph breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The exclusion is a deliberate choice.&lt;/strong&gt; TRT-LLM's &lt;code&gt;piecewise_optimizer.py&lt;/code&gt; explicitly identifies attention ops and excludes them from CUDA graph pieces:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tensorrt_llm/_torch/compilation/piecewise_optimizer.py
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;is_call_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ops&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trtllm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attn_custom_op_inplace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ops&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trtllm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mla_custom_op_inplace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="n"&gt;exclude_modules_id&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ← excluded from CUDA graph capture
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The reason: replay correctness.&lt;/strong&gt; If attention were captured in a CUDA graph, the kernel's grid dimensions would be baked in. But attention's grid depends on the per-sequence query distribution, not just total tokens:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Kernel source&lt;/th&gt;
&lt;th&gt;grid.x&lt;/th&gt;
&lt;th&gt;grid.y&lt;/th&gt;
&lt;th&gt;grid.z&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fused_multihead_attention_v2.cpp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ceil(params.s / mUnrollStep)&lt;/code&gt; — &lt;strong&gt;varies&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;params.h&lt;/code&gt; (heads) — fixed&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;params.b&lt;/code&gt; (batch) — &lt;strong&gt;varies&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;triton_attention.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;num_prefill&lt;/code&gt; — &lt;strong&gt;varies&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;n_heads&lt;/code&gt; — fixed&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ceil(max(seq_len) / SEQ_BLOCK)&lt;/code&gt; — &lt;strong&gt;varies&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;unfusedAttentionKernels.cu&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ceil(q_length / 32.0f)&lt;/code&gt; — &lt;strong&gt;varies&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For the same &lt;code&gt;total_tokens=4096&lt;/code&gt;, different sequence distributions can produce different grids/launch metadata. A captured graph replays the capture-time launch configuration; unless you pad/standardize to that same configuration, replaying on a different distribution would be incorrect. MLP doesn't have this problem because its grid depends primarily on &lt;code&gt;total_tokens&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What &lt;code&gt;capture_num_tokens&lt;/code&gt; Controls
&lt;/h3&gt;

&lt;p&gt;Pre-captures piecewise graphs for specific total token counts. At runtime, pads &lt;strong&gt;up&lt;/strong&gt; to the next captured value.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;capture_num_tokens: [1, 2, 4, 8, ..., 8192]

Runtime: 4160 total tokens → pad up to the next captured value (e.g., 5120)
  - Waste: (5120 - 4160) / 5120 = 18.7% extra compute
  - Benefit: CUDA graph replay for MLP pieces (zero launch overhead)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Graph Type Summary
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Graph Type&lt;/th&gt;
&lt;th&gt;Capture Mechanism&lt;/th&gt;
&lt;th&gt;What It Captures&lt;/th&gt;
&lt;th&gt;When Used&lt;/th&gt;
&lt;th&gt;Key Parameter&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Generation-only&lt;/td&gt;
&lt;td&gt;Native &lt;code&gt;torch.cuda.CUDAGraph&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Full forward pass (including attention)&lt;/td&gt;
&lt;td&gt;Pure decode iterations&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cuda_graph_config.batch_sizes&lt;/code&gt; or &lt;code&gt;max_batch_size&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Piecewise&lt;/td&gt;
&lt;td&gt;torch.compile + native capture per piece&lt;/td&gt;
&lt;td&gt;All non-attention ops (attention runs eager)&lt;/td&gt;
&lt;td&gt;Mixed/context iterations&lt;/td&gt;
&lt;td&gt;&lt;code&gt;torch_compile_config.capture_num_tokens&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Memory vs. Coverage Trade-off
&lt;/h3&gt;

&lt;p&gt;Each piecewise capture at token count N pre-allocates intermediate buffers of size &lt;code&gt;[N, hidden_dim]&lt;/code&gt; per piece per layer. Capturing at large N (e.g., 8192) can consume enough GPU memory to shrink KV cache capacity below usable levels. In some setups, pushing &lt;code&gt;capture_num_tokens&lt;/code&gt; too high (e.g., up to 8192) with aggressive &lt;code&gt;kv_cache_free_gpu_mem_fraction&lt;/code&gt; can shrink the KV cache max length enough to cause warmup failures.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Configuration Guide
&lt;/h2&gt;

&lt;h3&gt;
  
  
  TensorRT-LLM &lt;code&gt;llm_api_options_yaml&lt;/code&gt; Settings
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Generation-only CUDA graphs (decode phase)&lt;/span&gt;
&lt;span class="na"&gt;cuda_graph_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enable_padding&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;max_batch_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4096&lt;/span&gt;    &lt;span class="c1"&gt;# or explicit batch_sizes list&lt;/span&gt;

&lt;span class="c1"&gt;# Piecewise CUDA graphs (context/mixed phases, requires torch.compile)&lt;/span&gt;
&lt;span class="na"&gt;torch_compile_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enable_piecewise_cuda_graph&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;capture_num_tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;1&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;2&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;4&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;...&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# Must cover runtime max_num_tokens!&lt;/span&gt;
  &lt;span class="na"&gt;enable_userbuffers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;             &lt;span class="c1"&gt;# Default is true; disable if needed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Principles for &lt;code&gt;capture_num_tokens&lt;/code&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Must cover &lt;code&gt;max_num_tokens&lt;/code&gt;&lt;/strong&gt;: If the runtime scheduler can produce up to N total tokens, the largest capture point must be &amp;gt;= N. Otherwise, iterations exceeding the max fall back to eager.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dense where iterations cluster&lt;/strong&gt;: Use iteration logs to find the hot zone. Pack capture points there to minimize padding waste.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sparse where few iterations land&lt;/strong&gt;: Ramp-up and transition regions need minimal captures (powers of 2 suffice).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fewer captures = less memory&lt;/strong&gt;: Each capture pre-allocates intermediate buffers sized &lt;code&gt;[capture_tokens, hidden_dim]&lt;/code&gt; per piece. On memory-constrained systems, fewer large captures may be preferable.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  TorchCompileConfig Defaults (TensorRT-LLM)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;torch_compile_config&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;None&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Torch compile completely off unless explicitly set&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;enable_piecewise_cuda_graph&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;False&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Must opt-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;capture_num_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;None&lt;/code&gt; (auto: max 3072)&lt;/td&gt;
&lt;td&gt;Auto-generated: &lt;code&gt;[1,2,4,...,128,256,512,...,3072]&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;enable_userbuffers&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;True&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Enabled by default when torch compile is on&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;enable_fullgraph&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;True&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Full graph compilation in torch.compile&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;enable_inductor&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;False&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inductor backend disabled by default&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Checking Coverage at Runtime
&lt;/h3&gt;

&lt;p&gt;Parse the iteration log and compute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;total_tokens_per_iter = numCtxTokens + numGenRequests

For each iteration:
  - If numCtxTokens == 0: uses generation-only CUDA graph (match on numGenRequests)
  - If numCtxTokens &amp;gt; 0:  uses piecewise CUDA graph (match on total_tokens)

Hit rate = iterations with total_tokens &amp;lt;= max(capture_num_tokens) / total iterations
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Target: &lt;strong&gt;&amp;gt;95% hit rate&lt;/strong&gt; on piecewise graphs for meaningful benefit.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>llm</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
