<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: aa24aa</title>
    <description>The latest articles on DEV Community by aa24aa (@kernelflowops).</description>
    <link>https://dev.to/kernelflowops</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3891752%2F563804a4-9eae-4b61-9716-cfb09d215a23.png</url>
      <title>DEV Community: aa24aa</title>
      <link>https://dev.to/kernelflowops</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kernelflowops"/>
    <language>en</language>
    <item>
      <title>From Black Magic to Science: The Evolution of the CUDA Optimization Skill</title>
      <dc:creator>aa24aa</dc:creator>
      <pubDate>Wed, 22 Apr 2026 05:39:26 +0000</pubDate>
      <link>https://dev.to/kernelflowops/from-black-magic-to-science-the-evolution-of-the-cuda-optimization-skill-3h2a</link>
      <guid>https://dev.to/kernelflowops/from-black-magic-to-science-the-evolution-of-the-cuda-optimization-skill-3h2a</guid>
      <description>&lt;h1&gt;
  
  
  🚀 The Evolution of the CUDA Optimization Skill
&lt;/h1&gt;

&lt;p&gt;If the V1 optimizer felt like a rookie stumbling blindfolded through an arsenal—working hard, but relying mostly on luck—then V2 has evolved into a special-ops soldier equipped with &lt;strong&gt;radar scanning&lt;/strong&gt;, &lt;strong&gt;shadow clones&lt;/strong&gt;, and &lt;strong&gt;post-battle forensics&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let’s peel back the code and see what kind of magic V2 introduced to turn CUDA optimization from something that felt like &lt;strong&gt;black art&lt;/strong&gt; into something much closer to &lt;strong&gt;science&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy6v0kkptwezdo33vsej0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy6v0kkptwezdo33vsej0.png" alt="skill architecture" width="800" height="1139"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔗 Project link:&lt;br&gt;
&lt;a href="https://github.com/KernelFlow-ops/cuda-optimized-skill" rel="noopener noreferrer"&gt;https://github.com/KernelFlow-ops/cuda-optimized-skill&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If this project inspires you or helps you in any way, please consider giving it a ⭐ &lt;strong&gt;Star&lt;/strong&gt;. Your support is the biggest motivation for continued iteration—thank you!&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Roofline-Driven Axis Budget Allocation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1.1 The problem in V1: fixed allocation, no awareness of bottleneck imbalance
&lt;/h3&gt;

&lt;p&gt;In V1, each iteration hard-coded the choice of &lt;strong&gt;one optimization method per axis&lt;/strong&gt;: one for compute, one for memory, and one for latency, for a fixed total of three methods.&lt;/p&gt;

&lt;p&gt;That means regardless of the kernel’s actual bottleneck distribution—even if compute utilization is only 8% while memory bandwidth is already near saturation—each axis still gets equal treatment. As a result, valuable optimization opportunities may be wasted on axes that have very little headroom left.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.2 What V2 does: dynamic allocation based on performance gaps
&lt;/h3&gt;

&lt;p&gt;In V2, after each Nsight Compute analysis and before method selection, a new &lt;strong&gt;roofline analysis stage&lt;/strong&gt; is inserted, implemented by &lt;code&gt;roofline.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This script reads two inputs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ncu_top.json&lt;/code&gt; from the current iteration, which contains measured performance metrics&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;env.json&lt;/code&gt;, which contains the hardware’s theoretical peak capabilities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It then computes three gap values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Δc (compute gap):&lt;/strong&gt; the distance between current compute utilization and the hardware peak, in the range &lt;code&gt;[0, 1]&lt;/code&gt;. A larger value means the kernel is farther from peak compute performance and has more room to improve.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Δm (memory gap):&lt;/strong&gt; the distance between current bandwidth utilization and the hardware peak.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Δl (latency gap):&lt;/strong&gt; the current maximum stall percentage, reflecting the severity of pipeline stalls.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  1.3 Budget allocation rules
&lt;/h3&gt;

&lt;p&gt;Once the three Δ values are computed, the script allocates a fixed total budget of &lt;strong&gt;3 method slots&lt;/strong&gt; across the three axes proportionally.&lt;/p&gt;

&lt;p&gt;There are two hard constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Per-axis cap = 2&lt;/strong&gt;
Even if one axis dominates the others, it can receive at most 2 method slots.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total budget = 3&lt;/strong&gt;
The sum across compute, memory, and latency always remains 3.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Δc = 0.92&lt;/li&gt;
&lt;li&gt;Δm = 0.57&lt;/li&gt;
&lt;li&gt;Δl = 0.61&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;then compute has the largest gap, and the allocation might become:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;compute = 2&lt;/li&gt;
&lt;li&gt;memory = 0&lt;/li&gt;
&lt;li&gt;latency = 1&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means that in the current round, the optimizer selects &lt;strong&gt;two methods from the compute axis&lt;/strong&gt;, &lt;strong&gt;none from memory&lt;/strong&gt;, and &lt;strong&gt;one from latency&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.4 Early stopping
&lt;/h3&gt;

&lt;p&gt;When all three gaps fall below &lt;code&gt;0.15&lt;/code&gt;, the roofline script outputs &lt;code&gt;near_peak: true&lt;/code&gt;, indicating that the kernel is already very close to the hardware’s theoretical peak and that further optimization is unlikely to produce meaningful returns.&lt;/p&gt;

&lt;p&gt;At that point, the loop terminates automatically.&lt;/p&gt;

&lt;p&gt;V1 had no such mechanism. Even when the kernel was already near its limit, it would still blindly execute all scheduled iterations.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.5 How this changes method selection
&lt;/h3&gt;

&lt;p&gt;Because axis budgets are no longer fixed at 1, the method selection logic changes as well.&lt;/p&gt;

&lt;p&gt;In V1, the rule was essentially:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;scan each axis once and pick the first method that passes the checks&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In V2, method selection becomes &lt;strong&gt;budget-aware&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;If an axis budget is 0:&lt;/strong&gt; skip it entirely&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If an axis budget is 1:&lt;/strong&gt; behave like V1 and take the first method that passes all four checks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If an axis budget is 2:&lt;/strong&gt; collect all valid candidates on that axis, sort them by &lt;strong&gt;trigger strength&lt;/strong&gt; (a continuous value in &lt;code&gt;[0, 1]&lt;/code&gt; representing the strength of evidence that the method should fire), and take the top 2&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is also why V2 adds a new &lt;code&gt;trigger_strength&lt;/code&gt; field to &lt;code&gt;methods.schema.json&lt;/code&gt;: it becomes useful only when multiple methods on the same axis compete for multiple slots.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.6 Roofline history tracking
&lt;/h3&gt;

&lt;p&gt;Each iteration’s roofline result is appended to the &lt;code&gt;roofline_history&lt;/code&gt; array in &lt;code&gt;state.json&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This enables the final summary to show a &lt;strong&gt;bottleneck migration table&lt;/strong&gt;, for example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;round 1: compute-bound&lt;/li&gt;
&lt;li&gt;round 2: bandwidth-bound&lt;/li&gt;
&lt;li&gt;round 3: near-peak&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes it much easier to visualize how the bottleneck shifts across axes as the kernel improves.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Branch-and-Select Exploration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 The problem in V1: one candidate, hyperparameters chosen by luck
&lt;/h3&gt;

&lt;p&gt;In V1, each iteration generated only &lt;strong&gt;one kernel&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Claude had to choose both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the optimization methods&lt;/li&gt;
&lt;li&gt;the hyperparameters, such as tile size, pipeline stages, warp count, and so on&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the best hyperparameter combination depends heavily on the hardware and problem size. It is very hard to get right in a single shot purely by reasoning.&lt;/p&gt;

&lt;p&gt;As a result, even if the chosen optimization method is correct, poor hyperparameter choices may hide its actual value and cause the method to be wrongly judged as ineffective.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 What V2 does: same method set, multiple hyperparameter branches, then select the winner
&lt;/h3&gt;

&lt;p&gt;V2 decouples &lt;strong&gt;method selection&lt;/strong&gt; from &lt;strong&gt;hyperparameter selection&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In each iteration, Claude first determines a set of optimization methods (step 3c), and then generates &lt;strong&gt;K branch candidates&lt;/strong&gt; based on that exact same method combination. By default, &lt;code&gt;K = 4&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;These branches share the same optimization strategy, but differ in parameters such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tile size:&lt;/strong&gt; e.g. &lt;code&gt;128×128×32&lt;/code&gt; vs &lt;code&gt;128×256×32&lt;/code&gt; vs &lt;code&gt;256×128×32&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipeline stages:&lt;/strong&gt; e.g. 3-stage vs 4-stage vs 5-stage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Warp count:&lt;/strong&gt; e.g. 4 warps vs 8 warps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implementation variants inside a method:&lt;/strong&gt; e.g. different swizzle patterns or different MMA atom choices&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.3 Champion selection
&lt;/h3&gt;

&lt;p&gt;The script &lt;code&gt;branch_explore.py&lt;/code&gt; compiles and benchmarks all K branches.&lt;/p&gt;

&lt;p&gt;Importantly, this stage does &lt;strong&gt;not&lt;/strong&gt; run Nsight Compute profiling. It only performs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;correctness validation&lt;/li&gt;
&lt;li&gt;timing benchmark&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This keeps overhead under control.&lt;/p&gt;

&lt;p&gt;The selection rule is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;among all branches that pass correctness, pick the one with the lowest execution time as the &lt;strong&gt;champion&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If none of the K branches pass validation, Claude must rewrite and resubmit, with up to 3 retries, just like in V1.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.4 Frontier retention
&lt;/h3&gt;

&lt;p&gt;The losing branches are &lt;strong&gt;not discarded&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;All branches that pass correctness but are not selected as champion are stored in the &lt;code&gt;frontier&lt;/code&gt; array in &lt;code&gt;state.json&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;These “suboptimal but valid” candidates form an &lt;strong&gt;exploration frontier&lt;/strong&gt; for future iterations. For example, if a later optimization method happens to pair especially well with the hyperparameters of a previously losing branch, that branch can become a useful starting point.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.5 How this appears in the directory structure
&lt;/h3&gt;

&lt;p&gt;Each iteration now includes a new &lt;code&gt;branches/&lt;/code&gt; subdirectory, where every branch stores its own kernel source and benchmark result.&lt;/p&gt;

&lt;p&gt;The champion kernel is copied to the iteration root as &lt;code&gt;kernel.&amp;lt;ext&amp;gt;&lt;/code&gt;, becoming the official output of that iteration.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.6 What fundamental problem this solves
&lt;/h3&gt;

&lt;p&gt;Branch exploration transforms hyperparameter tuning from &lt;strong&gt;guesswork&lt;/strong&gt; into &lt;strong&gt;experimentation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In V1, if the agent picked &lt;code&gt;tile = 128×128×32&lt;/code&gt; but the true optimum was &lt;code&gt;128×256×32&lt;/code&gt;, the whole iteration could be wasted.&lt;/p&gt;

&lt;p&gt;In V2, both candidates are actually compiled and benchmarked, so the value of the optimization method itself is much less likely to be hidden by a bad parameter choice.&lt;/p&gt;

&lt;p&gt;This dramatically improves the information efficiency of every iteration.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Ablation-Based Attribution
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 The problem in V1: bundled judgment, no way to isolate individual contributions
&lt;/h3&gt;

&lt;p&gt;In V1, each iteration applied 3 optimization methods and then checked whether total performance improved by more than a 2% noise threshold.&lt;/p&gt;

&lt;p&gt;If yes, all three methods were added to &lt;code&gt;effective_methods&lt;/code&gt;.&lt;br&gt;
If not, all three were added to &lt;code&gt;ineffective_methods&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This creates a serious attribution problem.&lt;/p&gt;

&lt;p&gt;Suppose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;method A contributes a 3 ms speedup&lt;/li&gt;
&lt;li&gt;method B contributes 0.5 ms&lt;/li&gt;
&lt;li&gt;method C actually slows things down by 1 ms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The total speedup would still be 2.5 ms, so all three methods would be labeled “effective.”&lt;/p&gt;

&lt;p&gt;But in reality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A is highly valuable&lt;/li&gt;
&lt;li&gt;B is marginal&lt;/li&gt;
&lt;li&gt;C is harmful&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;V1 had no way to distinguish them.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 What V2 does: per-method ablation experiments
&lt;/h3&gt;

&lt;p&gt;After the champion is selected and fully profiled with Nsight Compute, V2 adds a new &lt;strong&gt;ablation stage&lt;/strong&gt;, driven by &lt;code&gt;ablate.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Suppose the chosen methods in this round are &lt;strong&gt;A, B, and C&lt;/strong&gt;, and the champion kernel runs in &lt;strong&gt;2.14 ms&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Ablation then generates three variants, each removing exactly one method:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;remove A → keep only B + C → benchmark = 4.82 ms&lt;/li&gt;
&lt;li&gt;remove B → keep only A + C → benchmark = 2.31 ms&lt;/li&gt;
&lt;li&gt;remove C → keep only A + B → benchmark = 2.19 ms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The attribution score for each method is defined as:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;attribution(m) = ms_without_m − ms_champion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using the numbers above:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A:&lt;/strong&gt; &lt;code&gt;4.82 − 2.14 = +2.68 ms&lt;/code&gt;
Removing A makes the kernel much slower → A contributed a lot&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;B:&lt;/strong&gt; &lt;code&gt;2.31 − 2.14 = +0.17 ms&lt;/code&gt;
Removing B makes the kernel slightly slower → B contributed positively, but modestly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;C:&lt;/strong&gt; &lt;code&gt;2.19 − 2.14 = +0.05 ms&lt;/code&gt;
Removing C has almost no effect → below noise threshold, so C is treated as ineffective&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.3 How attribution affects method classification
&lt;/h3&gt;

&lt;p&gt;Attribution is not used in isolation. It is combined with SASS validation results, which we will discuss in the next section.&lt;/p&gt;

&lt;p&gt;The full classification logic is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Attribution &amp;gt; noise threshold&lt;/strong&gt; and &lt;strong&gt;SASS validation passed&lt;/strong&gt;
→ add to &lt;code&gt;effective_methods&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attribution ≤ noise threshold&lt;/strong&gt; but &lt;strong&gt;SASS validation passed&lt;/strong&gt;
→ add to &lt;code&gt;ineffective_methods&lt;/code&gt;
The method was genuinely implemented by the compiler, but did not help performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SASS validation failed&lt;/strong&gt;, regardless of attribution
→ add to &lt;code&gt;implementation_failed_methods&lt;/code&gt;
The method was never truly realized in the compiled code&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.4 Outputs of the ablation stage
&lt;/h3&gt;

&lt;p&gt;Each iteration now contains a new &lt;code&gt;ablations/&lt;/code&gt; subdirectory.&lt;/p&gt;

&lt;p&gt;For every ablated method, a corresponding subfolder stores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the ablated kernel source&lt;/li&gt;
&lt;li&gt;the benchmark result&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The aggregated attribution data is written to &lt;code&gt;iterv{i}/attribution.json&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.5 What fundamental problem this solves
&lt;/h3&gt;

&lt;p&gt;Ablation turns method effectiveness from a &lt;strong&gt;group-level judgment&lt;/strong&gt; into a &lt;strong&gt;per-method causal test&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This provides two major benefits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The &lt;code&gt;effective_methods&lt;/code&gt; list becomes much more trustworthy, so future iterations are less likely to be misled by false positives.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;ineffective_methods&lt;/code&gt; list also becomes more precise. A good method is no longer rejected simply because it happened to be bundled with bad companions.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  4. SASS-Level Instruction Validation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 The problem in V1: claimed optimizations may exist only on paper
&lt;/h3&gt;

&lt;p&gt;In V1, Claude might claim to have used a certain optimization method—say, tensor cores—and might even write the corresponding source code, such as inline &lt;code&gt;mma.sync&lt;/code&gt; assembly or a CUTLASS HMMA path.&lt;/p&gt;

&lt;p&gt;But the compiler may not actually adopt that intended path.&lt;/p&gt;

&lt;p&gt;Common reasons include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;compiler optimization overriding the handwritten intent&lt;/li&gt;
&lt;li&gt;register pressure forcing a fallback to scalar instructions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pragma&lt;/code&gt; or &lt;code&gt;attribute&lt;/code&gt; directives being silently ignored&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;V1 acknowledged this in its failure-mode documentation and suggested manually diffing Nsight Compute reports between iterations to see when “nothing really changed.”&lt;/p&gt;

&lt;p&gt;But this remained only a suggestion. It was not automated, and could easily be skipped.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 What V2 does: automatic validation at the disassembly level
&lt;/h3&gt;

&lt;p&gt;V2 introduces &lt;code&gt;sass_check.py&lt;/code&gt; and a reference file &lt;code&gt;references/sass_signatures.json&lt;/code&gt;, forming an automated compiled-artifact verification mechanism.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sass_signatures.json&lt;/code&gt; maps each optimization method to the SASS instruction patterns that should appear if the method was truly realized. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;compute.tensor_core&lt;/code&gt; → should contain &lt;strong&gt;HMMA&lt;/strong&gt; instructions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;latency.async_pipeline&lt;/code&gt; → should contain &lt;strong&gt;LDGSTS&lt;/strong&gt; or &lt;strong&gt;CP.ASYNC&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;memory.smem_swizzle_xor&lt;/code&gt; → should show the expected shared-memory access signature&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The workflow of &lt;code&gt;sass_check.py&lt;/code&gt; is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;run &lt;code&gt;cuobjdump --dump-sass&lt;/code&gt; on the compiled champion kernel&lt;/li&gt;
&lt;li&gt;extract the SASS disassembly&lt;/li&gt;
&lt;li&gt;grep for the expected instruction signatures corresponding to each selected method&lt;/li&gt;
&lt;li&gt;output a pass/fail result per method to &lt;code&gt;iterv{i}/sass_check.json&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  4.3 How SASS validation and ablation work together
&lt;/h3&gt;

&lt;p&gt;SASS validation and ablation answer two different questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ablation attribution:&lt;/strong&gt; did this method have a causal performance contribution?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SASS validation:&lt;/strong&gt; was this method actually realized by the compiler in the machine code?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, they produce the three-way classification described earlier.&lt;/p&gt;

&lt;p&gt;One especially important case is when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SASS validation fails&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;but the kernel still gets faster overall&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;V2 handles this by marking the method as &lt;code&gt;implementation_failed&lt;/code&gt;, while still allowing the kernel itself to remain the best result if it truly is faster.&lt;/p&gt;

&lt;p&gt;That means the speedup may have come from &lt;strong&gt;hyperparameter changes&lt;/strong&gt;, not from the claimed method itself. This is exactly the sort of distinction that V2’s attribution system can capture, but V1 completely missed.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.4 Impact on future iterations
&lt;/h3&gt;

&lt;p&gt;Methods placed into &lt;code&gt;implementation_failed_methods&lt;/code&gt; are &lt;strong&gt;not automatically blacklisted&lt;/strong&gt; in future rounds.&lt;/p&gt;

&lt;p&gt;Instead, Claude must explicitly acknowledge the earlier implementation failure in &lt;code&gt;analysis.md&lt;/code&gt; and explain why the method may succeed this time—for example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;using a different code formulation&lt;/li&gt;
&lt;li&gt;changing compiler flags&lt;/li&gt;
&lt;li&gt;reducing register pressure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is more flexible than V1’s simple blacklist approach. The method itself may still be good; only the previous implementation path failed.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Changes to the State Structure
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5.1 Overview of newly added fields
&lt;/h3&gt;

&lt;p&gt;Compared with V1, V2 adds four new top-level fields to &lt;code&gt;state.json&lt;/code&gt;, each corresponding to one of the mechanisms above.&lt;/p&gt;

&lt;h4&gt;
  
  
  ① &lt;code&gt;branches&lt;/code&gt; (integer)
&lt;/h4&gt;

&lt;p&gt;This records the number of branches explored per iteration, defaulting to 4.&lt;/p&gt;

&lt;p&gt;It is initialized in &lt;code&gt;state.py init&lt;/code&gt; via the &lt;code&gt;--branches&lt;/code&gt; argument and remains fixed throughout the optimization run.&lt;/p&gt;

&lt;p&gt;It controls:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how many hyperparameter variants Claude must generate in step 3d&lt;/li&gt;
&lt;li&gt;how many candidates &lt;code&gt;branch_explore.py&lt;/code&gt; must compile and benchmark in step 3e&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  ② &lt;code&gt;implementation_failed_methods&lt;/code&gt; (list)
&lt;/h4&gt;

&lt;p&gt;V1 had only two buckets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;effective_methods&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ineffective_methods&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;V2 adds a third:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;implementation_failed_methods&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The meanings are now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;effective_methods:&lt;/strong&gt; implemented in SASS and causally beneficial&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ineffective_methods:&lt;/strong&gt; implemented in SASS but not causally beneficial&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;implementation_failed_methods:&lt;/strong&gt; never truly manifested in machine code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This three-way split allows later iterations to distinguish between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“we tried it and it didn’t help”&lt;/li&gt;
&lt;li&gt;“we wrote it, but the compiler never really used it”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are fundamentally different failure modes and should be handled differently.&lt;/p&gt;

&lt;h4&gt;
  
  
  ③ &lt;code&gt;roofline_history&lt;/code&gt; (list)
&lt;/h4&gt;

&lt;p&gt;After each iteration, the round’s roofline result is appended here, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Δc, Δm, Δl&lt;/li&gt;
&lt;li&gt;bound type&lt;/li&gt;
&lt;li&gt;near-peak flag&lt;/li&gt;
&lt;li&gt;axis budget allocation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This serves two purposes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;it provides the data source for bottleneck-shift summaries&lt;/li&gt;
&lt;li&gt;it gives Claude historical trend information for future reasoning&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For example, if Δm keeps increasing across two rounds, that may indicate that previous compute optimizations introduced additional memory pressure.&lt;/p&gt;

&lt;h4&gt;
  
  
  ④ &lt;code&gt;frontier&lt;/code&gt; (list)
&lt;/h4&gt;

&lt;p&gt;This stores all branch candidates from all iterations that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;passed correctness&lt;/li&gt;
&lt;li&gt;but were not selected as champion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each record contains information such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;iteration number&lt;/li&gt;
&lt;li&gt;branch ID&lt;/li&gt;
&lt;li&gt;kernel path&lt;/li&gt;
&lt;li&gt;benchmark timing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The frontier is valuable because these kernels are &lt;strong&gt;valid but suboptimal&lt;/strong&gt; in the current setting. They may perform better on different GPUs, different input sizes, or serve as promising starting points for manual exploration.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.2 Changes to the inputs of &lt;code&gt;state.py update&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;In V1, the &lt;code&gt;update&lt;/code&gt; command needed only three inputs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;kernel file&lt;/li&gt;
&lt;li&gt;benchmark result&lt;/li&gt;
&lt;li&gt;&lt;code&gt;methods.json&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In V2, it adds two new required inputs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;--attribution&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--sass-check&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means state updates no longer rely solely on the single signal of “did the whole kernel get faster,” but instead incorporate two finer-grained signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;causal attribution&lt;/li&gt;
&lt;li&gt;implementation validation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5.3 Changes to method classification logic
&lt;/h3&gt;

&lt;p&gt;This is the core difference in how the state gets updated.&lt;/p&gt;

&lt;p&gt;In V1, classification was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;binary&lt;/li&gt;
&lt;li&gt;group-based&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the whole kernel was faster than the previous one by more than 2%, all methods were marked effective; otherwise all were marked ineffective.&lt;/p&gt;

&lt;p&gt;In V2, classification becomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ternary&lt;/li&gt;
&lt;li&gt;per-method&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each method, the update logic first checks SASS validation, then attribution, and assigns the method to one of the three buckets accordingly.&lt;/p&gt;

&lt;p&gt;Three methods selected in the same iteration may now end up in &lt;strong&gt;three different buckets&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.4 Richer history records
&lt;/h3&gt;

&lt;p&gt;In V1, each history entry contained only:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;iter&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;methods&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ms&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;speedup&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;status&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In V2, each history record is extended with attribution and SASS validation information, turning the history into a complete &lt;strong&gt;iteration audit log&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This makes it possible to reconstruct exactly what happened in any round, including the contribution and validation status of every method.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.5 A shift in design philosophy
&lt;/h3&gt;

&lt;p&gt;V1’s &lt;code&gt;state&lt;/code&gt; was primarily a &lt;strong&gt;recorder&lt;/strong&gt;: it faithfully stored what happened in each round, but used coarse judgment rules.&lt;/p&gt;

&lt;p&gt;V2’s &lt;code&gt;state&lt;/code&gt; is much closer to a &lt;strong&gt;knowledge base&lt;/strong&gt;: it stores not only facts, but also:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;causal attribution&lt;/li&gt;
&lt;li&gt;compiler-level validation&lt;/li&gt;
&lt;li&gt;bottleneck evolution&lt;/li&gt;
&lt;li&gt;exploration frontier&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives future iterations richer and more reliable context for decision-making.&lt;/p&gt;

&lt;p&gt;In other words, the optimization loop has evolved from:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;trial → record&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;into:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;trial → attribute → validate → learn&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  6. Results
&lt;/h2&gt;

&lt;p&gt;Using one of the operator problems from&lt;br&gt;
&lt;a href="https://tensara.org/problems" rel="noopener noreferrer"&gt;https://tensara.org/problems&lt;/a&gt;&lt;br&gt;
as an example, the local test environment was based on an &lt;strong&gt;RTX 3060&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;As a result, the observed optimization gains still do not fully reflect the performance ceiling one might expect on an &lt;strong&gt;A100&lt;/strong&gt;. The main limitation is simply that no A100 hardware was available for validation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F91juirbrbdtrlniklddc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F91juirbrbdtrlniklddc.png" alt="tensara baseline" width="800" height="391"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmt7pua3c59ms769n2k4b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmt7pua3c59ms769n2k4b.png" alt="tensara best" width="800" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⭐ &lt;a href="https://github.com/KernelFlow-ops/cuda-optimized-skill" rel="noopener noreferrer"&gt;KernelFlow-ops/cuda-optimized-skill&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you found this write-up helpful, a ⭐ Star on the repo would be greatly appreciated. Thank you!&lt;/p&gt;

</description>
      <category>cuda</category>
      <category>agents</category>
      <category>cutlass</category>
      <category>triton</category>
    </item>
  </channel>
</rss>
