DEV Community: Shah Fahad

Roofline Model: Why Your Kernel Is Slow, Geometrically

Shah Fahad — Sun, 10 May 2026 10:31:05 +0000

Every kernel does two kinds of work: it performs arithmetic, and it moves data. A kernel is fast only when both sides are used well. If the arithmetic units are waiting for data, the kernel is memory-bandwidth-limited. If data is arriving fast enough but the arithmetic units are saturated, the kernel is compute-limited.

The Roofline performance model helps separate those two cases. It gives a first-order way to ask: is this kernel limited by peak arithmetic throughput, by peak memory bandwidth, or by inefficient use of the hardware, where neither peak compute throughput nor peak memory bandwidth is being achieved?

The model is intentionally simplified. It ignores most hardware details and keeps only the first-order limits that matter for performance: how fast the machine can do FLOPs, and how fast it can move bytes.

The Simplified Machine

Roofline starts with a deliberately simple picture of the hardware:

+----------------------+   bytes moved through HBM   +----------------------+
| GPU SMs              | <-------------------------> | HBM                  |
|                      |                             |                      |
| peak_FLOPs_per_sec   |                             | peak_BW              |
+----------------------+                             +----------------------+

        compute work: FLOPs                 memory work: bytes

The left side is the processor. For this article, assume the compute units are a GPU's SMs. We summarize all of them with one number: the maximum rate at which they can perform floating-point work, peak_FLOPs_per_sec.

The right side is memory the processor needs to read from and write to. For this article, assume that memory is HBM. We summarize the path between HBM and the SMs with one number too: the maximum rate at which bytes can be transferred between HBM and the compute units, peak_BW.

For a kernel to run, two things must happen:

memory work:   bytes must move to or from HBM
compute work:  FLOPs must be executed

Terminology matters here:

FLOPs means total floating-point operations.
FLOP/s means a rate: floating-point operations per second.

Arithmetic Intensity

The key quantity in Roofline is arithmetic intensity:

Arithmetic Intensity (AI) = FLOPs / bytes

It measures how much computation the kernel performs for each byte moved to or from HBM. A low-AI kernel moves a lot of data for each unit of arithmetic. A high-AI kernel reuses data well and performs many FLOPs per byte.

There are two versions of AI, and the difference between them matters.

Algorithmic AI is the ideal value implied by the algorithm itself. You count the FLOPs the algorithm must perform, then divide by the minimum bytes the algorithm must move. In this view, every input is loaded only when it is truly needed, reused perfectly after that, and every output is written only as required. Algorithmic AI answers:

If memory reuse were perfect, how many FLOPs could this algorithm get per byte?

Observed AI is what the implemented kernel actually achieves at runtime. You still count FLOPs, but now the byte count comes from the real traffic through HBM. If the same value is loaded multiple times, those bytes count multiple times. If an uncoalesced access fetches a full memory sector but uses only part of it, the fetched bytes count. If register spills or cache misses create extra traffic, those bytes count too. Observed AI answers:

Given the traffic this implementation really generated, how many FLOPs did it get per byte?

Ideally, under consistent FLOP accounting and the same memory boundary:

observed AI <= algorithmic AI

The Roofline point uses observed AI. The algorithmic AI is an upper-bound reference: it tells you how far right the kernel should be able to move if wasted memory traffic is removed.

Drawing the Roofline

Now take the simplified machine and run one kernel on it.

Plot the kernel point

At runtime, suppose the kernel:

moves some total number of bytes to or from HBM
performs some total number of FLOPs
takes some amount of time to finish

From those observed quantities, we compute two values:

observed AI          = FLOPs / bytes moved through HBM
achieved performance = FLOPs / time

These two values become the kernel's point on the Roofline chart:

x-position = observed AI
y-position = achieved performance

Next we draw the two hardware limits.

Draw the compute roof

The first limit comes from the SMs. No matter how much data reuse the kernel has, it cannot run faster than the maximum arithmetic throughput of the SMs:

compute roof = peak_FLOPs_per_sec

This is a horizontal line on the chart.

Draw the bandwidth roof

The second limit comes from HBM bandwidth. The hardware has some peak HBM bandwidth, peak_BW, measured in bytes per second. For a kernel with arithmetic intensity AI, every byte moved from HBM supports AI FLOPs. So if the kernel could use the full HBM bandwidth, the maximum compute throughput that HBM could feed is:

bandwidth roof = AI × peak_BW

This is a diagonal line. At low AI, each FLOP requires many bytes, so even peak HBM bandwidth cannot feed enough data to reach the compute roof. As AI increases, each byte supports more FLOPs, so the bandwidth-limited ceiling rises until it eventually meets the compute roof.

log(FLOP/s)
   ^
   |        bandwidth roof                     compute roof
   |                         o======================================
   |                       / |
   |                     /   |
   |                   /     |
   |                 /       |
   |               /         |
   |             /           |
   |           /             |
   |         /               |
   |       /                 |
   |     /                   |
   |   /                     |
   +-------------------------+--------------------> log(AI = FLOPs / Bytes)
                             |
                             ridge point

        bandwidth-limited    |    compute-limited

The Roofline is the lower of those two ceilings. At low AI, the diagonal bandwidth roof is lower, so HBM bandwidth is the applicable ceiling. At high AI, the horizontal compute roof is lower, so SM arithmetic throughput is the applicable ceiling.

Find the ridge point

The point where the two ceilings meet is the ridge point, also called the machine balance:

AI_ridge = peak_FLOPs_per_sec / peak_BW

Now compare the kernel's observed AI — the x-position we computed from runtime FLOPs and HBM bytes — against this ridge point.

Observed AI left of the ridge: the bandwidth roof is the lower ceiling. At this AI, even perfect HBM bandwidth utilization would not reach peak compute throughput. If the point sits below the diagonal roof, the implementation is also failing to use the available bandwidth efficiently.
Observed AI right of the ridge: the compute roof is the lower ceiling. At this AI, HBM bandwidth is high enough in the Roofline model, so peak arithmetic throughput becomes the main limit.

The exact ridge location depends on the machine and on which compute peak you choose. The important question is not the absolute value of the ridge; it is whether the kernel's observed AI lands to the left or right of it.

The Two Diagnostic Gaps

Once the kernel point is on the chart, there are two different questions to ask.

First: is the point below the Roofline at its current observed AI? That is a vertical gap.

Second: is the observed AI far left of the algorithmic AI? That is a horizontal gap.

These gaps mean different things. A vertical gap means the kernel is not using the relevant hardware limit efficiently. A horizontal gap means the kernel is moving more bytes than the algorithm ideally requires.

Vertical Gap: Below the Roofline

At a fixed observed AI, the Roofline tells you the best performance the hardware could provide. If the kernel point sits below that roof, the kernel is not reaching the applicable ceiling.

log(FLOP/s)
   ^
   |        bandwidth roof                     compute roof
   |                         R==============================
   |                       / |                         |
   |                     /   |                         |
   |                   /     |                         v compute-side gap
   |                 /       |                         C
   |               /         |
   |             /           |
   |           /             |
   |         /               |
   |       /                 |
   |     / |                 |
   |   /   v memory-side gap |
   | /     M                 |
   +-------------------------+-------------------------> log(AI = FLOPs / Bytes)
                             |
                         ridge point

        bandwidth-limited    |    compute-limited

Point M is left of the ridge, so its applicable roof is the diagonal bandwidth roof. The vertical distance between the bandwidth roof and M means the kernel is not achieving peak HBM bandwidth for its current observed AI. The bytes are what they are, but they are not being moved fast enough.

Common reasons:

Too few memory operations in flight to saturate HBM bandwidth.
Synchronous loads that stall instead of overlapping with compute.
Poor producer-consumer overlap: load phase, then compute phase, instead of a pipeline.
HBM row-buffer thrashing or memory channel imbalance.
A problem too small to expose enough parallelism.

Point C is right of the ridge, so its applicable roof is the horizontal compute roof. The vertical distance between the compute roof and C means the kernel is not achieving peak arithmetic throughput. HBM bandwidth is no longer the limiting ceiling; the SMs are not being kept fully productive.

Common reasons:

Shared-memory bank conflicts that stall operand delivery.
Register dependency chains or low instruction-level parallelism.
Warp divergence.
Not enough asynchronous-MMA work in flight to hide latency.

Horizontal Gap: Observed AI vs Algorithmic AI

The horizontal gap is different. It compares the kernel's observed AI to the algorithm's algorithmic AI.

observed AI    = FLOPs / actual HBM bytes moved
algorithmic AI = FLOPs / minimum bytes required by the algorithm

If observed AI is far to the left of algorithmic AI, the implementation is moving extra bytes. The FLOPs may be the same, but the denominator is larger than it should be.

Common reasons:

Uncoalesced or scattered global loads that over-fetch memory sectors.
Cache thrashing: data is evicted before reuse and loaded again.
Redundant loads across thread blocks.
Register spills that create local-memory traffic.
Unfused kernels that write intermediates to HBM and read them back later.

This is not about whether HBM bandwidth is saturated. A kernel can sit exactly on the bandwidth roof and still have a large horizontal gap. That means it is moving too many bytes, but moving them efficiently.

Optimization as a 2D Walk

Once you have the two-gap picture, you can think of optimization as walking the dot on the chart. Every optimization moves the dot in a specific direction.

Right (raise observed AI)

These optimizations change which bytes (or how many bytes) cross the boundary you're measuring at. Same FLOPs, fewer bytes.

Tiling and blocking for shared memory and registers.
Larger thread-block tiles so each loaded byte is reused more times before being evicted.
Kernel fusion — eliminate HBM round trips for intermediates.
Multicast loads and cache-residency hints.

The maximum reference target for "right" is the algorithmic AI. Under consistent accounting, it is the upper bound for how much HBM traffic reduction can improve observed AI.

Up (raise achieved FLOP/s at current AI)

These optimizations change how fast the bytes already in flight are processed. Same bytes, higher throughput.

Increase ILP — issue more independent MMAs before any synchronizing instruction.
Software pipelining and double- or multi-buffering to overlap data movement with compute.
Eliminate shared-memory bank conflicts using the right swizzled layouts.
Coalesce global memory accesses (this also shifts the dot right, so it's a both-axis optimization).
Raise occupancy when you're latency-bound; lower it when register pressure is helping ILP. Both happen.

The maximum target for "up" is the applicable roof — bandwidth roof if you're memory-bandwidth-limited, compute roof if you're compute-limited.

Up-and-right (combined)

This is the typical trajectory of a real rewrite: naive triple-loop GEMM → tiled GEMM crosses the ridge point and continues climbing toward the compute roof. The dot may zig-zag a bit as you fix one bottleneck and uncover the next.

There is one important constraint to internalize: once the dot is well into the compute-limited region, further AI is usually no longer the main lever. In the simple Roofline model, the horizontal compute roof is now the ceiling. At that point, the important question is how close the kernel gets to peak compute throughput.

This is why people working on GEMM kernels at any reasonable size obsess over the vertical gap: their algorithmic AI is so far past the ridge that additional tiling is usually no longer the main lever; the remaining question is how close to peak compute they can get.

The Optimization Mental Model

Before optimizing, ask what the change is supposed to improve.

If the optimization reduces the number of HBM bytes needed for the same FLOPs, it increases observed AI. The point moves right. These are reuse and traffic-reduction optimizations.

If the optimization keeps the same observed AI but makes the kernel run faster, it increases achieved FLOP/s. The point moves up. These are utilization and pipelining optimizations.

Some optimizations do both: they reduce traffic and improve throughput. But the distinction is still useful, because it tells you what movement you should expect on the chart.

Optimization	Primary axis	What changes
Coalescing	X (right)	Fewer over-fetched sector bytes.
Async copy / pipelining	Y (up, memory-bandwidth-limited side)	Memory latency is hidden and bandwidth utilization improves.
SMEM swizzling / bank-conflict fixes	Y (up, compute-limited side)	Same HBM bytes, math pipe stalls less.
Tiling / blocking	X (right)	Same algorithm, fewer HBM round trips per FLOP.
Kernel fusion	X (right)	Eliminates HBM round trips for intermediates.
Multicast loads	X (right)	One thread block's load serves many; eliminates redundant cross-block traffic.

GEMM as a Roofline Example

Now tie the pieces together with GEMM:

C = A × B

Assume a square GEMM where M = N = K, with FP16 inputs and FP16 output. The algorithmic work is:

FLOPs = 2 * M^3

For this simplified C = A × B case, with no read of an old C value, the compulsory HBM traffic is one read of A, one read of B, and one write of C:

Compulsory bytes = 6 * M^2

So the algorithmic AI is:

algorithmic AI = FLOPs / compulsory bytes = M / 3

This is the rightward target. It says that as the matrix grows, GEMM can theoretically do more and more FLOPs for each byte moved from HBM. A large GEMM should therefore live far to the right on the Roofline chart, usually in the compute-limited region.

But that only describes the algorithm. The implementation still has to earn that AI.

Naive GEMM

A naive implementation might reload the same elements of A and B many times from HBM. The FLOP count is still the GEMM FLOP count, but the HBM byte count is much larger than the compulsory byte count.

That means:

observed AI << algorithmic AI

On the chart, the point moves far left. If it is left of the ridge, the kernel is in the bandwidth-limited regime. If it is also below the diagonal roof, then it has both problems:

a horizontal gap, because it moves too many HBM bytes
a vertical memory-side gap, because it is not using peak HBM bandwidth efficiently

Tiled GEMM

Tiling attacks the horizontal gap. Instead of loading an element from HBM every time it is used, the kernel loads a tile once and reuses it many times from faster on-chip storage.

The FLOPs are the same, but HBM bytes go down:

observed AI increases

On the chart, the point moves right toward the algorithmic AI. If the point crosses the ridge, the applicable ceiling changes: HBM bandwidth is no longer the lower roof, and the kernel moves into the compute-limited region.

Well-pipelined GEMM

Once the kernel is in the compute-limited region, moving further right is not enough. The applicable ceiling is now the horizontal compute roof. The remaining problem is the vertical gap between the point and that roof.

Now the question becomes: are the SMs kept busy?

Optimizations in this phase do not primarily reduce HBM bytes. They improve achieved FLOP/s at roughly the same observed AI:

overlap HBM loads with computation
keep enough independent math work in flight
avoid shared-memory bank conflicts
avoid long dependency chains
reduce synchronization stalls

On the chart, these changes move the point up toward the compute roof.

The Roofline Reading

GEMM is useful because it shows both directions clearly:

Bad reuse moves the point left of algorithmic AI.
Better tiling moves the point right.
Poor scheduling or operand delivery leaves the point vertically below the applicable roof.
Better pipelining and utilization move the point up.

The ideal GEMM implementation is therefore not just "high AI" and not just "high FLOP/s." It is both: observed AI close to algorithmic AI, and achieved performance close to the applicable Roofline ceiling.

Closing Thoughts

Roofline is useful because it turns performance tuning into a sequence of concrete questions.

Measure the kernel's FLOPs, HBM bytes, and time. Compute observed AI and achieved FLOP/s. Place the point on the chart. Then ask:

Is observed AI left or right of the ridge?
Is the point below the applicable roof?
Is observed AI far left of algorithmic AI?

Those answers tell you the next direction. Move right by reducing HBM traffic. Move up by improving utilization of the current limiting resource. If the point is already near the roof and near algorithmic AI, the kernel is close to what this model says the hardware can do.

Programming Hopper GPUs: The Memory Consistency Model

Shah Fahad — Sat, 25 Apr 2026 14:05:13 +0000

You've decided to write fast code for an NVIDIA Hopper GPU. Maybe you want to build a custom attention kernel. Maybe you're trying to understand how CUTLASS and ThunderKittens work under the hood. Either way, before you can use any of the cool Hopper hardware — TMA, wgmma, mbarriers, clusters — you need to understand one thing: how memory works when thousands of threads share it.

That's what the memory consistency model describes. It's the rulebook for what one thread can or cannot see about another thread's writes. Without it, the rest of the stack is undefined behavior waiting to happen.

This article covers the minimum you need to write correct multi-threaded GPU code. We'll build it from one concrete bug, then introduce the two primitives that fix it.

Code examples are schematic unless noted. Full PTX spells out scope, state space, and type — for example fence.release.gpu or st.release.gpu.global.u32.

The bug we're trying to prevent

Imagine the simplest possible producer-consumer pattern. One thread fills a buffer, then sets a flag to say "I'm done." Another thread waits for the flag, then reads the buffer.

Producer (thread T1):              Consumer (thread T2):

  data = 42;       // (1)            while (flag == 0) {}    // (3) wait for flag
  flag = 1;        // (2)            x = data;               // (4) expects 42

Looks fine, right? It's how you'd write it in any single-threaded language. But on a GPU (or any modern CPU, for that matter), this code can produce x == 0 instead of x == 42. Here's why.

From the producer's own perspective, (1) and (2) are two completely unrelated stores to two different memory locations. There's nothing in T1's own code that depends on the order in which they hit memory. So the compiler is free to reorder them, and the hardware is free to commit them out of order — neither change is visible from inside T1.

But the consumer T2 can see the reorder. If flag = 1 arrives in memory before data = 42 does, then T2 might see flag == 1, exit its wait loop, and read data while it's still 0. The consumer's assumption — "if the flag is set, the data is ready" — silently breaks.

What T1 thinks happens:        What T2 might actually see:

  data = 42                      flag = 1     ← becomes visible first
  flag = 1                       data = 42    ← arrives later

  T2 reads:                      T2 reads:
    flag → 1 ✓                     flag → 1 ✓ (exits wait loop)
    data → 42 ✓                    data → 0 ✗ (reads stale value!)

This is the central problem of shared-memory programming. A program that looks logically correct can fail because the system reorders operations that are independent from the issuing thread's perspective but not independent from another thread's perspective.

The memory consistency model gives us tools to prevent this. The two main tools are called the release fence and the acquire fence.

The big idea: pair of fences

Think of fences as a contract between the producer and the consumer:

Producer side: "I promise to publish all my prior work before I tell anyone I'm done."
Consumer side: "I promise to wait for the announcement, then read fresh data — not anything I might have cached earlier."

The producer holds up its end of the contract using a release fence. The consumer holds up its end using an acquire fence. When both sides cooperate, the bug above goes away.

Let's see exactly what each fence does, then we'll put them together to fix the bug.

The release fence

A release fence sits between two pieces of producer code. Conceptually:

  data = 42;          // some prior work

  fence.release;      // "publish everything above me before anything below me"

  flag = 1;           // the announcement

Two things to understand:

1. The compiler cannot move prior memory operations past the fence.

Without the fence, a compiler optimization might reorder data = 42 to come after flag = 1. With fence.release in between, that's forbidden. The fence is a one-way wall: stuff above stays above.

Without the fence:                With fence.release:

  data = 42                         data = 42
  flag = 1                          fence.release  ← prior writes pinned above
                                    flag = 1
  ⇒ compiler/hardware may swap     ⇒ data = 42 stays before flag = 1

The fence is one-way, not two-way. This is a subtle but important point. The release fence only blocks downward motion — operations above it can't sink below. Operations below the fence are technically allowed to move above it; the fence does not pin them in place. This is by design: the release only promises "everything above me is published," so it has nothing to say about what comes after.

   memory op above                    ← cannot sink below the fence
   memory op above                    ← cannot sink below the fence
   ─── fence.release ───
   memory op below                    ← in general, CAN move above the fence
   memory op below                    ← in general, CAN move above the fence

This sounds dangerous — wouldn't the compiler also be free to move the flag = 1 write above the fence? In source PTX, the answer is no for the store that forms the release pattern: NVIDIA defines the pattern by program order, so moving that store would change the program's synchronization behavior.

2. The fence makes the prior writes "publishable."

This is the more subtle part. After the fence, when the producer writes flag = 1, that flag write effectively carries with it the promise that everything before the fence is also visible. It's like attaching a receipt to the announcement: "by the way, my data is ready too."

This is called a release pattern: a fence.release followed by a write to a flag. The pair together is what publishes the producer's prior work to anyone watching the flag.

Pattern exception — the paired flag write is pinned. The "later operations can move above the fence" rule has one exception: the specific strong write that pairs with the fence to form the release pattern. NVIDIA's model defines fence.release; st.relaxed [flag], 1 as a release pattern because those instructions occur in that program order. A compiler or assembler cannot preserve the same PTX semantics while hoisting that store above the fence, because doing so would dismantle the release pattern entirely.

So in practice, the rules around a release fence are:

Anything before the fence stays before it.
Unrelated memory operations after the fence are not protected by the release semantics in the same way.
The specific strong flag write that pairs with the fence must stay after the fence — it's the whole point of the pattern.

One important nuance: the fence does NOT eagerly broadcast anything to other threads. It just guarantees an ordering — that the data is ready in memory before the flag is. Other threads will observe the change at their own pace; the release fence doesn't push anything to them. This is why we'll need a spin loop on the consumer side.

The acquire fence

The acquire fence is the mirror image. It sits between two pieces of consumer code:

  flag_value = flag;   // read the announcement

  fence.acquire;       // "anything below me sees a fresh view of memory"

  x = data;            // read the data, guaranteed up-to-date

Two things to understand:

1. The compiler cannot move later memory operations before the fence.

Without the fence, a compiler might prefetch data into a register before the flag check (a common optimization). With fence.acquire, that's forbidden. The fence is again a one-way wall, but in the opposite direction this time: stuff below stays below.

Without the fence:                With fence.acquire:

  flag_value = flag                 flag_value = flag
  x = data         ← may be         fence.acquire  ← later reads pinned below
                     prefetched     x = data
                     above flag!

The fence is one-way, in the opposite direction from release. Symmetrically with fence.release, the acquire fence only blocks upward motion — operations below it can't hoist above. Operations above the fence are technically allowed to move below it; the fence doesn't pin them in place. This is by design: the acquire only promises "everything below me sees a fresh view," so it has nothing to say about what came before.

   memory op above                    ← in general, CAN move below the fence
   memory op above                    ← in general, CAN move below the fence
   ─── fence.acquire ───
   memory op below                    ← cannot hoist above the fence
   memory op below                    ← cannot hoist above the fence

Just like for release, this raises a question: couldn't the compiler also move the flag_value = flag read below the fence? Not for the read that forms the acquire pattern: NVIDIA defines the pattern by program order, so moving that read would change the synchronization behavior.

2. The fence makes subsequent reads "fresh."

After the fence, later reads are ordered after any matching release that was observed through the flag. In other words, the consumer cannot use an old value of data from before the producer's published write; it must see the released write or a later write in that location's coherence order.

This is called an acquire pattern: a read of a flag, followed by a fence.acquire. The pair together is what makes the consumer pick up the producer's published data.

Pattern exception — the paired flag read is pinned. The "earlier operations can move below the fence" rule has one exception: the specific strong read that pairs with the fence to form the acquire pattern. NVIDIA's model defines ld.relaxed [flag]; fence.acquire as an acquire pattern because those instructions occur in that program order. Moving the flag read below the fence would dismantle the pattern.

So in practice, the rules around an acquire fence are:

Anything after the fence stays after it.
Unrelated memory operations before the fence are not protected by the acquire semantics in the same way.
The specific strong flag read that pairs with the fence must stay before the fence — it's the whole point of the pattern.

Putting them together: the bug, fixed

Now let's go back to our buggy code and fix it:

Producer (thread T1):                 Consumer (thread T2):

  data = 42;          // (1)            while (1) {
  fence.release;      // (F_R)            flag_value = flag;     // (3)
  flag = 1;           // (2)              if (flag_value == 1) break;
                                        }
                                        fence.acquire;           // (F_A)
                                        x = data;                // (4)

Walk through it:

T1 writes data = 42.
T1 hits fence.release. This guarantees data = 42 is committed to memory before anything that follows.
T1 writes flag = 1. This is the announcement; it now "carries" the visibility of data = 42.
T2 spins on flag until it reads 1. (We need to spin because the flag's new value takes some real time to propagate to T2 — fences don't push, they just promise ordering.)
T2 hits fence.acquire. This guarantees subsequent reads see fresh data — no stale cached values can satisfy them.
T2 reads data. Because the producer's release published data = 42 before publishing the flag, and T2's acquire ensures fresh reads after seeing the flag, T2 is guaranteed to see 42.

The key insight: the bug only got fixed because BOTH sides cooperated. The release alone wouldn't help — the consumer would still cache stale data. The acquire alone wouldn't help — the producer's writes might still arrive out of order. Memory ordering is always a contract between producer and consumer.

   Producer side:                       Consumer side:

   data = 42                            spin: flag_value = flag
       │                                              │
       ▼                                              ▼
   fence.release        ─ publishes ─►   fence.acquire   ◄─ acquires
       │                  data + flag    │                  fresh view
       ▼                  together       ▼
   flag = 1                              x = data → 42 ✓

Shorter forms: baked-in release and acquire

In the example above, the producer wrote two separate instructions for the publish step:

fence.release;
st.relaxed [flag], 1;

PTX provides a shorter form that bakes the release semantics directly into the store. Instead of the two-instruction pair, you can write:

data = 42;
st.release [flag], 1;     // store with release semantics built in

For this adjacent publish-store pattern, this gives the same release/acquire synchronization guarantee as fence.release; st.relaxed [flag], 1. The release behavior is fused into the store, so the release pattern is inherent in one instruction.

Important: it has to be st.release, not st.relaxed. A bare st.relaxed [flag], 1 is not a release pattern — it's a strong store with no release ordering effect on prior writes. The release pattern requires either st.release (release baked in) or the explicit pair fence.release; st.relaxed [flag], 1. Don't drop the fence.release and assume the .relaxed qualifier carries any release meaning — it doesn't.

The consumer side has the matching shortcut:

flag_value = ld.acquire [flag];   // load with acquire semantics built in
x = data;                         // sees fresh data

ld.acquire is shorthand for "do the read, and include acquire semantics in that same operation." Same synchronization guarantee, one instruction.

Same warning, mirrored. A bare ld.relaxed [flag] is not an acquire pattern — it's a strong load with no acquire ordering effect on later reads. The acquire pattern requires either ld.acquire (acquire baked in) or the explicit pair ld.relaxed [flag]; fence.acquire. Don't drop the fence.acquire and assume the .relaxed qualifier carries any acquire meaning.

The fully-baked-in producer/consumer becomes:

Producer:                          Consumer:

  data = 42;                         while (1) {
  st.release [flag], 1;                flag_value = ld.acquire [flag];
                                       if (flag_value == 1) break;
                                     }
                                     x = data;

Same contract, less typing. For most producer-consumer patterns, this is the form you want.

When would you reach for the explicit fence.release / fence.acquire form instead? A few cases:

You want one fence to publish multiple flags. A single fence.release followed by several flag writes gives all of them release semantics — no need to repeat the fence per flag.
You want a cheap relaxed read inside a spin loop, then "commit" to acquiring only when you see the right value. Repeating ld.acquire on every spin iteration can be more expensive than a ld.relaxed loop followed by one fence.acquire after exit. We'll see this pattern again with mbarrier in a later article.
You need the combined fence variant discussed next.

When you need both: `fence.acq_rel`

So far we've kept the producer and consumer roles cleanly separated — release on the producer side, acquire on the consumer side. Each side has a single direction to worry about.

But sometimes a single point in your code needs to do both jobs at once:

Refresh its view to see what someone else published (acquire side).
Publish its own writes for someone else to see (release side).

That's exactly what fence.acq_rel is for. It's a fence.release and fence.acquire rolled into one — both effects fused at the same point.

fence.acq_rel;     // both: prior writes published AND subsequent reads refreshed

Both directions of the "one-way wall" apply at once: prior memory ops cannot sink below it (release side), and later memory ops cannot hoist above it (acquire side).

   memory op above                    ← cannot sink below the fence (release side)
   memory op above                    ← cannot sink below the fence
   ─── fence.acq_rel ───
   memory op below                    ← cannot hoist above the fence (acquire side)
   memory op below                    ← cannot hoist above the fence

Where this actually shows up: atomic operations

The clearest case for acq_rel semantics is atomic operations that read AND write at the same time — like atom.cas (compare-and-swap), atom.exch (exchange), atom.add (fetch-and-add). These instructions can be simultaneously consumer-like (they read an old value) and producer-like (they write a new value).

Take a shared work queue as an example. Producers fill slots in a queue, then publish progress by advancing a shared index. Consumers atomically claim index values and then read the corresponding slots. In a queue like that, the index is not just a counter — it's also the handoff point between "someone published work" and "someone else is allowed to consume it."

Now imagine one thread advances the index with an atomic fetch-and-add:

// Atomic fetch-and-add: returns the old value
old = atom.add.acq_rel [queue_index], 1;
slot = old;

Why might this atomic want acq_rel semantics? Because the same operation may be doing both jobs:

Acquire half (the read). The old counter value may point to a slot whose contents were written by another thread before that thread published the index. Before this thread reads queue[slot], it needs to acquire those slot writes.
Release half (the write). The new counter value may become the value that a later thread observes before reading work or metadata this thread prepared. If this thread did setup before advancing the index, the write-half of the atomic can publish that setup.

That is the kind of situation where atom.add.acq_rel makes sense: the same read-modify-write is consuming someone else's publication and publishing this thread's own update. PTX bakes both effects into the atomic with the .acq_rel qualifier, for example atom.acq_rel.gpu.global.add.u32 in full PTX syntax.

For comparison, an ordinary lock usually does not need acq_rel on lock acquire. Taking a lock is normally just an acquire operation: it consumes the previous holder's release. Releasing the lock is normally just a release store:

st.release [lock], 0;

So the common lock pattern is acquire on lock acquisition and release on unlock. Reach for acq_rel when a single atomic really does both jobs for your algorithm.

Standalone `fence.acq_rel` is rarer

A standalone fence.acq_rel (not attached to an atomic) shows up less often, but it's there if you have a non-atomic point in your code that needs both effects. Most of the time it appears inside higher-level synchronization primitives (atomics, mbarriers — which we'll see in later articles) rather than as a programmer-written explicit fence.

One bit of PTX trivia: fence.acq_rel is the default when you write a plain fence without a .sem qualifier. So fence.gpu is shorthand for fence.acq_rel.gpu. This is partly why acq_rel shows up so often in PTX disassembly.

For typical one-direction producer-consumer patterns, stick with the matched release + acquire pair we built up in the earlier sections — it's simpler and avoids asking for stronger ordering than you need. Reach for acq_rel when one fence point (or one atomic operation) really does need to do both jobs at once, like the work-queue atomic above.

Things to remember

A handful of practical points worth keeping in your head as you write code.

The producer's flag write must be a "real" memory write

In PTX, this means using something like st.relaxed [flag], 1 rather than the default st [flag], 1. The default is what's called a "weak" write — the memory model gives it no cross-thread guarantees, and the release pattern won't form correctly with it. For any flag another thread will read, use a strong store such as st.relaxed or st.release (the baked-in form discussed above). (.acquire and .acq_rel do not apply to ordinary stores.)

The same applies to the consumer's flag read: use ld.relaxed [flag] or ld.acquire [flag], not plain ld [flag]. (.release only applies to stores, not loads.)

Spin loops are normal — fences don't push data

A common confusion is "I issued the release, why doesn't the consumer see it immediately?" The release fence orders your writes; it doesn't shove them down other threads' throats. The consumer's flag read may still return 0 for a while after the producer wrote 1, because the new value has to propagate through the memory system.

That's why the consumer is in a while (flag == 0) loop — it's bridging the propagation gap. This is the standard, correct idiom.

Memory ordering is about visibility, not synchronization

A second and related confusion: people see "fence" and assume it makes threads pause or wait for each other. It doesn't. Fences and release/acquire qualifiers are entirely about what one thread sees in memory when it reads — not about pausing or aligning threads in time.

Two different concerns, two different toolboxes:

Visibility (this article): one thread's writes become observable to another thread's reads, in a well-defined order. Fences and release/acquire qualifiers handle this. A fence may wait until the calling thread's relevant memory operations have reached the point of coherence for its scope, but it doesn't wait for another thread to arrive, acknowledge, or read anything.
Synchronization (a separate concern): threads actually pause and align in time — e.g., "no thread proceeds past this point until all threads have arrived." That's the job of barriers like bar.sync, barrier.cluster, and the mbarrier object. We'll cover those in the next article.

The spin loop in our consumer code is what bridges the two when you want them together: the consumer wants to wait for the producer, so it busy-loops on the flag. That waiting is the programmer's choice (the while loop), not something the fence is doing for them.

Keeping these two ideas separate is one of the most useful mental moves you can make in concurrent GPU programming. The memory model is purely about visibility; for actual "wait until X" behavior, reach for the synchronization primitives in the next article.

Fences have a "scope"

In PTX you'll write things like fence.release.gpu or fence.release.cta. The scope says which threads can directly participate in the ordering guarantee.

To understand why scope matters, picture the GPU's memory as a layered hierarchy, with caches and storage at different distances from each thread:

   Per-SM:    L1 / shared-memory crossbar       (closest, fastest)
   Cluster:   DSMEM crossbar (Hopper+)
   Chip:      L2 cache                          (chip-wide)
   System:    HBM and host-memory fabric        (farthest, slowest)

For two threads to communicate via a release/acquire pair, the operations must use a scope that includes both threads. A useful hardware mental model is that wider scopes generally have to order through a farther-away coherence point, so they cost more. The exact cache behavior is an implementation detail; the architectural rule is the set of threads covered by the scope.

The scopes are:

.cta — between threads in the same thread block. This is the smallest PTX memory-model scope and is usually the cheapest.
.cluster — between threads in the same thread block cluster (Hopper's new feature). This is useful for cluster-level shared memory and cluster barriers.
.gpu — between any threads in the current program on the same GPU. In the usual global-memory mental model, this means ordering at a chip-wide level such as L2, so it is more expensive than .cta.
.sys — across the whole program, including host threads and kernels on other GPUs. This may involve the system fabric and is the most expensive scope.

Rough hardware mental model:

   .cta     ──►  usually local CTA-visible path
   .cluster ──►  cluster-level shared-memory network
   .gpu     ──►  chip-wide ordering point, often L2
   .sys     ──►  system-visible fabric

           ──── generally increasing cost ────►

Pick the smallest scope that covers your producer-consumer pair. A fence.release.cta is much cheaper than a fence.release.gpu, which is much cheaper than a fence.release.sys. There's no reason to pay for a wider scope than your actual readers need. If both threads are guaranteed to be in the same CTA, use .cta. If they could be on any SM, you need .gpu.

We'll see scopes again in later articles when we discuss thread block clusters and the TMA engine.

Single-threaded code never needs fences

If only one thread is touching some piece of memory, you don't need any of this. The reordering issues only matter when there's a second thread observing. Inside one thread, you get this guarantee for free:

Whenever you read a memory location, you see the most recent value your own thread wrote to that same location — in source-code order, no fences needed.

The important nuance is the phrase "that same location." The compiler and hardware are still free to reorder operations to different memory addresses, even within a single thread. That reorder is invisible to you, because nothing inside your thread depends on the order. For example:

data[0] = 42;        // (1)
data[1] = 99;        // (2)
x = data[0];         // (3) guaranteed to see 42 — same address as (1)

Here, (1) and (2) write to different addresses, so the compiler may commit them to memory in either order. But (3) reads from the same address as (1), so it's guaranteed to see 42 — the per-address ordering rule pins this down. From your thread's perspective, the reorder of (1) and (2) is invisible because nothing in your code reads data[1] afterward to detect it.

This is exactly the source of the bug at the start of the article: the producer's data write and flag write are to different addresses, so they can be reordered freely from the producer's own perspective. It only becomes a problem because another thread (the consumer) reads both addresses and notices the reorder. The fences fix it for that exact case.

So the rule is: fences are needed only at the precise places where two threads communicate, and even then only because they need to coordinate their views of different memory addresses. Within one thread, same-address reads work without fences.

Mental model summary

If you remember nothing else from this article, remember this:

Memory ordering on a GPU is a contract between a producer and a consumer.

The producer uses a release fence to publish all its prior writes before announcing "done."

The consumer uses an acquire fence to ensure all its subsequent reads see fresh data after observing the "done" signal.

Both halves of the contract are needed. Each fence is a one-way wall — release pins prior writes above it; acquire pins subsequent reads below it. Together they make a producer-consumer pattern correct on a system where independent operations can otherwise be reordered freely.

What's next in this series

The memory consistency model is the foundation. With it in hand, the rest of the Hopper stack starts to make sense. Coming up:

Execution synchronization — bar.sync, barrier.cluster, and how to align threads in time (not just memory).
mbarrier — a programmable synchronization object that combines memory ordering, thread arrival counting, and async-engine completion tracking. The Hopper workhorse.
Asynchronous copies — the TMA engine, cp.async.bulk, and the powerful but subtle CUtensorMap descriptor.
wgmma — Hopper's warp-group matrix multiply-accumulate, the engine that drives modern GEMM and attention kernels.

Each will build on the same release-acquire ideas you just learned. The TMA engine, mbarriers, and warp-group MMA all rely on exactly the same kind of "one side publishes, the other side acquires" contract — just with more sophisticated machinery to support things like async byte-counting and distributed shared memory.

CUDA Graphs in LLM Inference: Deep Dive

Shah Fahad — Sat, 21 Feb 2026 07:09:21 +0000

Why CUDA Graphs Matter for LLM Inference

LLM inference -- especially the token generation (decode) phase -- is often dominated by CPU overhead rather than GPU compute. Each decode step generates a single token per sequence: the actual GPU work (small matmuls, attention over one query) can finish in microseconds, but the CPU can spend tens of microseconds per kernel launch on launch bookkeeping, driver calls, and synchronization. With hundreds of kernel launches per transformer forward pass, this CPU overhead can become the bottleneck (though at higher batch sizes or with heavier kernels, decode can still become GPU-bound).

Making matters worse, the CPU isn't just launching kernels -- it's also preparing data for the next batch: updating token IDs, managing the KV cache block table, running the scheduler, and handling request arrivals/completions. All of this competes for CPU time with kernel launches, amplifying the bottleneck. The GPU ends up sitting idle between launches, throughput drops, latency rises, and expensive GPU cycles are wasted on nothing.

CUDA graphs solve this by recording the entire kernel sequence once and replaying it with a single CPU call. The driver overhead is paid once at capture time; every subsequent replay amortizes hundreds of per-kernel launches into a single replay launch, largely avoiding the repeated per-kernel launch bookkeeping. For decode-heavy workloads, this can eliminate the majority of per-step overhead.

This post walks through how CUDA graphs work in the context of LLM serving -- why decode is a natural fit, why context/mixed batches are harder, and how TensorRT-LLM (TRT-LLM) implements both monolithic and piecewise CUDA graph strategies.

1. CUDA Graphs Fundamentals
2. Generation (Decode) CUDA Graphs
3. KV Cache with Static Addresses
4. Why Context & Mixed Batches Are Hard
5. Piecewise CUDA Graphs (torch.compile)
6. Configuration Guide

1. CUDA Graphs Fundamentals

A CUDA graph captures a sequence of GPU operations (kernel launches, memory copies) into a single replayable unit.

What Gets Captured (Fixed)

+--------------------------------------------------------------------+
| CUDA Graph Recording                                               |
|                                                                    |
| +----------+      +----------+      +----------+      +----------+ |
| | Kernel A |      | Kernel B |      | Kernel C |      | Kernel D | |
| |grid(4,1) |----->|grid(8,1) |----->|grid(4,1) |----->|grid(2,1) | |
| |@0x100 -> |      |@0x200 -> |      |@0x300 -> |      |@0x400 -> | |
| |  0x200   |      |  0x300   |      |  0x400   |      |  0x500   | |
| +----------+      +----------+      +----------+      +----------+ |
+--------------------------------------------------------------------+

Baked into the graph:

Which kernels to launch, in what order
Memory addresses (pointers) each kernel reads/writes
Kernel launch parameters (grid dims, block dims, shared memory)

NOT baked (can change between replays):

The actual data at those addresses
Data-dependent control flow inside kernels (loops, branches)

Replay Contract

On replay, the entire sequence launches with minimal CPU overhead. The user's responsibility is to place correct data at the captured addresses before each replay.

Why It's Fast

+----------------------------+
| Without CUDA Graph (eager) |
|                            |
| CPU -- launch --> Kernel A |
| CPU <-- wait ----+         |
| CPU -- launch --> Kernel B |
| CPU <-- wait ----+         |
| CPU -- launch --> Kernel C |
| CPU <-- wait ----+         |
| CPU -- launch --> Kernel D |
|                            |
| = 4x CPU round-trips       |
+----------------------------+

+------------------------------------------+
| With CUDA Graph                          |
|                                          |
| CPU -- replay --> [ Kernel A, B, C, D ]  |
|                                          |
| = 1 launch, entire chain executes on GPU |
+------------------------------------------+

2. Generation (Decode) CUDA Graphs

Why Decode Is Well-Suited

In decode, each sequence contributes exactly 1 new token per step. Total tokens = batch size. This makes the input shape predictable.

+---------------------------------------------------------------+
| Decode step N                                                 |
|                                                               |
| seq0: 1 token  \                                              |
| seq1: 1 token   \                                             |
|                   >-- batch_size = 4, shape = [4, hidden_dim] |
| seq2: 1 token   /                                             |
| seq3: 1 token  /                                              |
+---------------------------------------------------------------+

Pre-allocated Static Buffers

+-----------------------------------------------------------------+
| Input token IDs buffer (pre-allocated, max_batch_size = 4096)   |
|                                                                 |
| [ token_0 ][ token_1 ][ token_2 ][ token_3 ] ... [ token_4095 ] |
|   @addr_0    @addr_1    @addr_2    @addr_3          @addr_4095  |
|                                                                 |
|   fixed addresses -- same every replay                          |
+-----------------------------------------------------------------+

Multiple Graphs for Different Batch Sizes

Captured graphs (one per supported batch size, typically powers of two):

  batch_size   grid size     reads
  ----------   ---------     -----
       1  -->  (1, ...)  --> addr_0
       2  -->  (2, ...)  --> addr_0..1
       4  -->  (4, ...)  --> addr_0..3
       8  -->  (8, ...)  --> addr_0..7
       :
    4096  -->  (4096,..) --> addr_0..4095

At runtime with 5 active sequences → use batch_size=8 graph, pad 3 dummy sequences.

Intermediate Activations Have Stable Addresses

During capture, intermediate tensors are allocated from a graph-private memory pool, giving them stable device addresses:

+----------------------------------------------------------+
| Transformer layer (captured; all addresses fixed)        |
|                                                          |
| [QKV Projection] ----> [Attention] ----> [Output Proj]   |
|  in @A, out @B          in @B, out @C    in @C, out @D   |
|                                               |          |
|                                               v          |
| [FFN Layer 1] --------> [FFN Layer 2] ----> (next layer) |
|  in @D, out @E           in @E, out @F                   |
+----------------------------------------------------------+

On replay, the same chain executes at the same addresses. Intermediate buffers are never freed between replays -- they persist in the graph's memory pool. This is why each captured batch size has its own set of stable-address buffers, and capturing many batch sizes consumes significant GPU memory.

What the Runtime Updates Before Each Replay

+-----------------------------------------------------+
| 1. input_token_ids[0:B]  <-- new token IDs          |
| 2. position_ids[0:B]     <-- new positions          |
| 3. sequence_lengths[0:B] += 1                       |
| 4. block_table           <-- update if new KV block |
+-----------------------------------------------------+
| 5. >>> REPLAY GRAPH <<<                             |
+-----------------------------------------------------+
| 6. new_logits <-- output_buffer[0:B]                |
+-----------------------------------------------------+
| B = batch_size                                      |
+-----------------------------------------------------+

3. KV Cache with Static Addresses

The Apparent Contradiction

KV cache grows every step (new K,V written for each token), yet CUDA graphs require fixed addresses. The solution: paged/block-based KV cache with an indirection table.

Block-Based KV Cache Pool

+-------------------------------------------------------------+
| KV cache pool (pre-allocated; addresses never change)       |
|                                                             |
| [ Block 0 ][ Block 1 ][ Block 2 ][ Block 3 ][ Block 4 ] ... |
|   @blk_0     @blk_1     @blk_2     @blk_3     @blk_4        |
|  32 slots   32 slots   32 slots   32 slots   32 slots       |
|                                                             |
| each block holds K,V for a fixed number of tokens (e.g. 32) |
+-------------------------------------------------------------+

Block Table (Indirection)

Each sequence has a block table mapping logical positions to physical blocks:

Logical positions	Physical block
tokens 0–31	Block 7
tokens 32–63	Block 12
tokens 64–95	Block 3 (partially filled, e.g. up to 82)

Sequence 0's block table at fixed address @tbl_0

How Attention Kernel Uses Indirection

# Inside the attention kernel (pseudo-code):
for each past token position i in range(sequence_length[seq_id]):
    block_idx = block_table[seq_id][i / block_size]    # read from @tbl_0
    offset    = i % block_size
    K_i = kv_cache_pool[block_idx][offset]              # indirect lookup into pool
    V_i = kv_cache_pool[block_idx][offset]
    score += dot(Q, K_i)

Step-by-Step: How KV Cache Grows Within CUDA Graph

Buffer	Step N	Step N+1	Notes
`block_table` @tbl_0	`[7, 12, 3]`	`[7, 12, 3]`	Same address, same indices
`seq_length` @len_0	`82`	`83`	Same address, incremented
kv_pool Block 3, slot 18	K,V for token 82	K,V for token 82	Unchanged
kv_pool Block 3, slot 19	(empty)	K,V for token 83	NEW — written by kernel

The kernel wrote to a different slot because sequence_length told it to. All addresses remain fixed -- only the data changes.

Why This Doesn't Violate CUDA Graph Rules

What's fixed (baked in graph)	What changes (data at fixed addrs)
`kv_cache_pool` base address	Which blocks are assigned (block_table data)
`block_table` buffer address	The integer block indices
`sequence_length` buffer address	The actual length values
Kernel grid dimensions	Data-dependent loops inside kernel iterate more/fewer times

4. Why Context & Mixed Batches Are Hard

The Core Problem: Variable Total Token Count

In decode, total tokens = batch size (each sequence = 1 token). In context/mixed, total tokens varies wildly:

Batch type	Sequences	Total tokens	Predictable?
Decode	`seq₀(1) + seq₁(1) + seq₂(1)`	3	Yes — always = batch_size
Context	`seq₀(137) + seq₁(2048)`	2185	No
Mixed	`seq₀(512 prefill) + seq₁(1 decode)`	513	No

Problem 1: Kernel Grid Dimensions Depend on Total Tokens

// Kernel launch -- grid dims are a function of input shape
dim3 grid((total_tokens + TILE_M - 1) / TILE_M, (hidden_dim + TILE_N - 1) / TILE_N);
matmul_kernel<<<grid, block>>>(input, weight, output, total_tokens, hidden_dim);

total_tokens	grid size	Implication
512	`(4, …)`	4 blocks — one graph
3072	`(24, …)`	24 blocks — different graph required

The grid is baked at capture time. Different total tokens = different grid = different graph.

Problem 2: Attention Grid Depends on Max Context Seq Length and Num Context Requests

For MLP, every token is independent: output[i] = MLP(input[i]). Fix total_tokens and you're done.

For attention, the kernel grid depends on two per-iteration variables:

+--------------------------------------------------------------+
| TRT-LLM attention grid (simplified call chain)               |
|                                                              |
| Python (trtllm.py)                                           |
|   max_ctx_seq_len = seq_lens[:num_contexts].max()            |
|                             |                                |
|                             v                                |
| C++ (fmhaRunner / fused_multihead_attention_v2)              |
|   |                   |                   |                  |
|   v                   v                   v                  |
|   grid.x              grid.y              grid.z             |
|   ceil(s/unroll)      num_heads           num_ctx_requests   |
|   [VARIES]            [FIXED]             [VARIES]           |
|                                                              |
|   --> grid = ( ceil(s/unroll), num_heads, num_ctx_requests ) |
+--------------------------------------------------------------+

Grid = (ceil(max_ctx_seq_len / unroll_step), num_heads, num_context_requests)

TRT-LLM uses a padded tiling strategy: the grid is sized for the longest context request, and shorter requests have their extra tiles skip computation (the kernel checks cu_seqlens internally):

Padded tiling: 3 context requests, seq_lens = [64, 128, 256], unroll_step = 64.
Grid = (4, num_heads, 3) — sized for longest request (256).

	Tile 0	Tile 1	Tile 2	Tile 3
Req 0 (64 tokens)	compute	skip	skip	skip
Req 1 (128 tokens)	compute	compute	skip	skip
Req 2 (256 tokens)	compute	compute	compute	compute

Even with this padded approach, the grid changes per iteration because both max_ctx_seq_len and num_context_requests change depending on which requests the scheduler assigns to the context phase:

Iteration	Context requests	max_len	grid	What changed
1	32	128	`(2, heads, 32)`	—
2	1	128	`(2, heads, 1)`	grid.z
3	2	256	`(4, heads, 2)`	grid.x and z

Different iterations produce different grids/launch parameters — the combination space explodes across multiple variables (e.g., max_ctx_seq_len, num_context_requests, and sequence-length distributions), making “one reusable CUDA graph” impractical.

A CUDA graph captured with one grid would produce incorrect results if replayed with a different grid/launch configuration (missing tiles = unprocessed tokens; extra tiles = out-of-bounds/garbage work). To make this safe, you’d need to capture graphs for many combinations or pad/standardize to a fixed worst-case launch shape.

Why Decode Attention Doesn't Have This Problem

In decode, every sequence has exactly 1 query token. The decode attention uses a different kernel path where:

Decode attention: grid = (batch_size, num_heads) — both fixed per captured graph.

batch_size is fixed per captured graph (one graph per supported batch size)
Variable KV cache lengths are handled by data-dependent loops inside the kernel (loop over sequence_length[i]) -- the grid doesn't change

Where Each Layer Type Falls

Layer	Shape	Capturable?
Layer norm	`[total_tokens, hidden]` — flat	Yes
Q, K, V projections	`[total_tokens, hidden]` — flat matmuls	Yes
Fused attention (Q@K^T, softmax, scores@V)	per-sequence, variable tiles	No — grid varies
Output projection	`[total_tokens, hidden]` — flat matmul	Yes
MLP	`[total_tokens, hidden]` — flat matmuls	Yes

5. Piecewise CUDA Graphs (torch.compile)

Two Separate CUDA Graph Systems

TRT-LLM uses two independent CUDA graph systems -- understanding this distinction is critical:

                  Python model forward()
                          |
            +-------------+-------------+
            |                           |
            v                           v
+-------------------------+ +-------------------------+
| torch.compile           | | Native CUDA Graph       |
| (Dynamo tracing)        | | (stream capture)        |
+-------------------------+ +-------------------------+
| Traces Python -> FX     | | Records GPU kernels     |
| Decomposes to ATen ops  | | on the CUDA stream      |
| Custom ops -> split pt  | | Captures everything     |
+-------------------------+ +-------------------------+
| Result: Pieces          | | Result: One monolithic  |
| [graph][eager][graph]...| | graph of full fwd pass  |
+-------------------------+ +-------------------------+
            |                           |
            v                           v
  Used for: mixed/context    Used for: decode-only
  (attn grid varies)         (attn grid fixed)

Generation-only (decode): Uses native torch.cuda.CUDAGraph capture. This records every kernel launch on the CUDA stream at the driver level -- including FlashAttention. It doesn't need to "understand" the kernels; it just records them. This works because decode attention's grid depends only on batch_size (fixed per capture).

Piecewise (mixed/context): Uses torch.compile to trace the model into an FX graph, then TRT-LLM's custom backend splits at attention boundaries and captures each non-attention piece as a CUDA graph. Attention runs eagerly.

The Piecewise Architecture

+--------------------------------------------------------+
| CUDA GRAPH -- piece 1                     [captured]   |
|   layer_norm -> qkv_projection                         |
|   pre-allocates output buffer @ addr_X                 |
+--------------------------------------------------------+
|                         |                              |
|                         v                              |
+--------------------------------------------------------+
| EAGER -- not graphed                 [runs every time] |
|   flash_attention(q, k, v, cu_seqlens, ...)            |
|   writes result IN-PLACE to addr_X                     |
+--------------------------------------------------------+
|                         |                              |
|                         v                              |
+--------------------------------------------------------+
| CUDA GRAPH -- piece 2                     [captured]   |
|   reads from addr_X                                    |
|   output_proj -> layer_norm -> mlp_up ->               |
|   activation -> mlp_down -> residual_add               |
+--------------------------------------------------------+
|                         |                              |
|                         v                              |
|                 ... next layer ...                     |
+--------------------------------------------------------+

The in-place attention design is critical: attention writes into a buffer pre-allocated by piece 1, ensuring piece 2's captured graph reads from the correct fixed address.

Why Attention Is Excluded

Attention is excluded from CUDA graph capture for a correctness reason, not a tracing limitation.

The tracing works fine. TRT-LLM registers a FakeTensor implementation for the attention custom op, so torch.compile in fullgraph mode traces the entire forward pass into one FX graph without graph breaks.

The exclusion is a deliberate choice. TRT-LLM's piecewise_optimizer.py explicitly identifies attention ops and excludes them from CUDA graph pieces:

# tensorrt_llm/_torch/compilation/piecewise_optimizer.py
if is_call_function(node, [
        torch.ops.trtllm.attn_custom_op_inplace.default,
        torch.ops.trtllm.mla_custom_op_inplace.default,
]):
    exclude_modules_id.append(idx)  # ← excluded from CUDA graph capture

The reason: replay correctness. If attention were captured in a CUDA graph, the kernel's grid dimensions would be baked in. But attention's grid depends on the per-sequence query distribution, not just total tokens:

Kernel source	grid.x	grid.y	grid.z
`fused_multihead_attention_v2.cpp`	`ceil(params.s / mUnrollStep)` — varies	`params.h` (heads) — fixed	`params.b` (batch) — varies
`triton_attention.py`	`num_prefill` — varies	`n_heads` — fixed	`ceil(max(seq_len) / SEQ_BLOCK)` — varies
`unfusedAttentionKernels.cu`	`ceil(q_length / 32.0f)` — varies

For the same total_tokens=4096, different sequence distributions can produce different grids/launch metadata. A captured graph replays the capture-time launch configuration; unless you pad/standardize to that same configuration, replaying on a different distribution would be incorrect. MLP doesn't have this problem because its grid depends primarily on total_tokens.

What `capture_num_tokens` Controls

Pre-captures piecewise graphs for specific total token counts. At runtime, pads up to the next captured value.

capture_num_tokens: [1, 2, 4, 8, ..., 8192]

Runtime: 4160 total tokens → pad up to the next captured value (e.g., 5120)
  - Waste: (5120 - 4160) / 5120 = 18.7% extra compute
  - Benefit: CUDA graph replay for MLP pieces (zero launch overhead)

Graph Type Summary

Graph Type	Capture Mechanism	What It Captures	When Used	Key Parameter
Generation-only	Native `torch.cuda.CUDAGraph`	Full forward pass (including attention)	Pure decode iterations	`cuda_graph_config.batch_sizes` or `max_batch_size`
Piecewise	torch.compile + native capture per piece	All non-attention ops (attention runs eager)	Mixed/context iterations	`torch_compile_config.capture_num_tokens`

Memory vs. Coverage Trade-off

Each piecewise capture at token count N pre-allocates intermediate buffers of size [N, hidden_dim] per piece per layer. Capturing at large N (e.g., 8192) can consume enough GPU memory to shrink KV cache capacity below usable levels. In some setups, pushing capture_num_tokens too high (e.g., up to 8192) with aggressive kv_cache_free_gpu_mem_fraction can shrink the KV cache max length enough to cause warmup failures.

6. Configuration Guide

TensorRT-LLM `llm_api_options_yaml` Settings

# Generation-only CUDA graphs (decode phase)
cuda_graph_config:
  enable_padding: true
  max_batch_size: 4096    # or explicit batch_sizes list

# Piecewise CUDA graphs (context/mixed phases, requires torch.compile)
torch_compile_config:
  enable_piecewise_cuda_graph: true
  capture_num_tokens: [1, 2, 4, ...]   # Must cover runtime max_num_tokens!
  enable_userbuffers: false             # Default is true; disable if needed

Key Principles for `capture_num_tokens`

Must cover max_num_tokens: If the runtime scheduler can produce up to N total tokens, the largest capture point must be >= N. Otherwise, iterations exceeding the max fall back to eager.
Dense where iterations cluster: Use iteration logs to find the hot zone. Pack capture points there to minimize padding waste.
Sparse where few iterations land: Ramp-up and transition regions need minimal captures (powers of 2 suffice).
Fewer captures = less memory: Each capture pre-allocates intermediate buffers sized [capture_tokens, hidden_dim] per piece. On memory-constrained systems, fewer large captures may be preferable.

TorchCompileConfig Defaults (TensorRT-LLM)

Field	Default	Notes
`torch_compile_config`	`None`	Torch compile completely off unless explicitly set
`enable_piecewise_cuda_graph`	`False`	Must opt-in
`capture_num_tokens`	`None` (auto: max 3072)	Auto-generated: `[1,2,4,...,128,256,512,...,3072]`
`enable_userbuffers`	`True`	Enabled by default when torch compile is on
`enable_fullgraph`	`True`	Full graph compilation in torch.compile
`enable_inductor`	`False`	Inductor backend disabled by default

Checking Coverage at Runtime

Parse the iteration log and compute:

total_tokens_per_iter = numCtxTokens + numGenRequests

For each iteration:
  - If numCtxTokens == 0: uses generation-only CUDA graph (match on numGenRequests)
  - If numCtxTokens > 0:  uses piecewise CUDA graph (match on total_tokens)

Hit rate = iterations with total_tokens <= max(capture_num_tokens) / total iterations

Target: >95% hit rate on piecewise graphs for meaningful benefit.

DEV Community: Shah Fahad

Roofline Model: Why Your Kernel Is Slow, Geometrically

The Simplified Machine

Arithmetic Intensity

Drawing the Roofline

Plot the kernel point

Draw the compute roof

Draw the bandwidth roof

Find the ridge point

The Two Diagnostic Gaps

Vertical Gap: Below the Roofline

Horizontal Gap: Observed AI vs Algorithmic AI

Optimization as a 2D Walk

Right (raise observed AI)

Up (raise achieved FLOP/s at current AI)

Up-and-right (combined)

The Optimization Mental Model

GEMM as a Roofline Example

Naive GEMM

Tiled GEMM

Well-pipelined GEMM

The Roofline Reading

Closing Thoughts

Programming Hopper GPUs: The Memory Consistency Model

The bug we're trying to prevent

The big idea: pair of fences

The release fence

The acquire fence

Putting them together: the bug, fixed

Shorter forms: baked-in release and acquire

When you need both: fence.acq_rel

Where this actually shows up: atomic operations

Standalone fence.acq_rel is rarer

Things to remember

The producer's flag write must be a "real" memory write

Spin loops are normal — fences don't push data

Memory ordering is about visibility, not synchronization

Fences have a "scope"

Single-threaded code never needs fences

Mental model summary

What's next in this series

CUDA Graphs in LLM Inference: Deep Dive

Why CUDA Graphs Matter for LLM Inference

Table of Contents

1. CUDA Graphs Fundamentals

What Gets Captured (Fixed)

Replay Contract

Why It's Fast

2. Generation (Decode) CUDA Graphs

Why Decode Is Well-Suited

Pre-allocated Static Buffers

Multiple Graphs for Different Batch Sizes

Intermediate Activations Have Stable Addresses

What the Runtime Updates Before Each Replay

3. KV Cache with Static Addresses

The Apparent Contradiction

Block-Based KV Cache Pool

Block Table (Indirection)

How Attention Kernel Uses Indirection

Step-by-Step: How KV Cache Grows Within CUDA Graph

Why This Doesn't Violate CUDA Graph Rules

4. Why Context & Mixed Batches Are Hard

The Core Problem: Variable Total Token Count

Problem 1: Kernel Grid Dimensions Depend on Total Tokens

Problem 2: Attention Grid Depends on Max Context Seq Length and Num Context Requests

Why Decode Attention Doesn't Have This Problem

Where Each Layer Type Falls

5. Piecewise CUDA Graphs (torch.compile)

Two Separate CUDA Graph Systems

The Piecewise Architecture

Why Attention Is Excluded

What capture_num_tokens Controls

Graph Type Summary

Memory vs. Coverage Trade-off

6. Configuration Guide

TensorRT-LLM llm_api_options_yaml Settings

Key Principles for capture_num_tokens

TorchCompileConfig Defaults (TensorRT-LLM)

Checking Coverage at Runtime

When you need both: `fence.acq_rel`

Standalone `fence.acq_rel` is rarer

What `capture_num_tokens` Controls

TensorRT-LLM `llm_api_options_yaml` Settings

Key Principles for `capture_num_tokens`