🚀 The Evolution of the CUDA Optimization Skill
If the V1 optimizer felt like a rookie stumbling blindfolded through an arsenal—working hard, but relying mostly on luck—then V2 has evolved into a special-ops soldier equipped with radar scanning, shadow clones, and post-battle forensics.
Let’s peel back the code and see what kind of magic V2 introduced to turn CUDA optimization from something that felt like black art into something much closer to science.
🔗 Project link:
https://github.com/KernelFlow-ops/cuda-optimized-skill
If this project inspires you or helps you in any way, please consider giving it a ⭐ Star. Your support is the biggest motivation for continued iteration—thank you!
1. Roofline-Driven Axis Budget Allocation
1.1 The problem in V1: fixed allocation, no awareness of bottleneck imbalance
In V1, each iteration hard-coded the choice of one optimization method per axis: one for compute, one for memory, and one for latency, for a fixed total of three methods.
That means regardless of the kernel’s actual bottleneck distribution—even if compute utilization is only 8% while memory bandwidth is already near saturation—each axis still gets equal treatment. As a result, valuable optimization opportunities may be wasted on axes that have very little headroom left.
1.2 What V2 does: dynamic allocation based on performance gaps
In V2, after each Nsight Compute analysis and before method selection, a new roofline analysis stage is inserted, implemented by roofline.py.
This script reads two inputs:
-
ncu_top.jsonfrom the current iteration, which contains measured performance metrics -
env.json, which contains the hardware’s theoretical peak capabilities
It then computes three gap values:
-
Δc (compute gap): the distance between current compute utilization and the hardware peak, in the range
[0, 1]. A larger value means the kernel is farther from peak compute performance and has more room to improve. - Δm (memory gap): the distance between current bandwidth utilization and the hardware peak.
- Δl (latency gap): the current maximum stall percentage, reflecting the severity of pipeline stalls.
1.3 Budget allocation rules
Once the three Δ values are computed, the script allocates a fixed total budget of 3 method slots across the three axes proportionally.
There are two hard constraints:
- Per-axis cap = 2 Even if one axis dominates the others, it can receive at most 2 method slots.
- Total budget = 3 The sum across compute, memory, and latency always remains 3.
For example, if:
- Δc = 0.92
- Δm = 0.57
- Δl = 0.61
then compute has the largest gap, and the allocation might become:
- compute = 2
- memory = 0
- latency = 1
This means that in the current round, the optimizer selects two methods from the compute axis, none from memory, and one from latency.
1.4 Early stopping
When all three gaps fall below 0.15, the roofline script outputs near_peak: true, indicating that the kernel is already very close to the hardware’s theoretical peak and that further optimization is unlikely to produce meaningful returns.
At that point, the loop terminates automatically.
V1 had no such mechanism. Even when the kernel was already near its limit, it would still blindly execute all scheduled iterations.
1.5 How this changes method selection
Because axis budgets are no longer fixed at 1, the method selection logic changes as well.
In V1, the rule was essentially:
scan each axis once and pick the first method that passes the checks
In V2, method selection becomes budget-aware:
- If an axis budget is 0: skip it entirely
- If an axis budget is 1: behave like V1 and take the first method that passes all four checks
-
If an axis budget is 2: collect all valid candidates on that axis, sort them by trigger strength (a continuous value in
[0, 1]representing the strength of evidence that the method should fire), and take the top 2
That is also why V2 adds a new trigger_strength field to methods.schema.json: it becomes useful only when multiple methods on the same axis compete for multiple slots.
1.6 Roofline history tracking
Each iteration’s roofline result is appended to the roofline_history array in state.json.
This enables the final summary to show a bottleneck migration table, for example:
- round 1: compute-bound
- round 2: bandwidth-bound
- round 3: near-peak
This makes it much easier to visualize how the bottleneck shifts across axes as the kernel improves.
2. Branch-and-Select Exploration
2.1 The problem in V1: one candidate, hyperparameters chosen by luck
In V1, each iteration generated only one kernel.
Claude had to choose both:
- the optimization methods
- the hyperparameters, such as tile size, pipeline stages, warp count, and so on
But the best hyperparameter combination depends heavily on the hardware and problem size. It is very hard to get right in a single shot purely by reasoning.
As a result, even if the chosen optimization method is correct, poor hyperparameter choices may hide its actual value and cause the method to be wrongly judged as ineffective.
2.2 What V2 does: same method set, multiple hyperparameter branches, then select the winner
V2 decouples method selection from hyperparameter selection.
In each iteration, Claude first determines a set of optimization methods (step 3c), and then generates K branch candidates based on that exact same method combination. By default, K = 4.
These branches share the same optimization strategy, but differ in parameters such as:
-
Tile size: e.g.
128×128×32vs128×256×32vs256×128×32 - Pipeline stages: e.g. 3-stage vs 4-stage vs 5-stage
- Warp count: e.g. 4 warps vs 8 warps
- Implementation variants inside a method: e.g. different swizzle patterns or different MMA atom choices
2.3 Champion selection
The script branch_explore.py compiles and benchmarks all K branches.
Importantly, this stage does not run Nsight Compute profiling. It only performs:
- correctness validation
- timing benchmark
This keeps overhead under control.
The selection rule is simple:
among all branches that pass correctness, pick the one with the lowest execution time as the champion
If none of the K branches pass validation, Claude must rewrite and resubmit, with up to 3 retries, just like in V1.
2.4 Frontier retention
The losing branches are not discarded.
All branches that pass correctness but are not selected as champion are stored in the frontier array in state.json.
These “suboptimal but valid” candidates form an exploration frontier for future iterations. For example, if a later optimization method happens to pair especially well with the hyperparameters of a previously losing branch, that branch can become a useful starting point.
2.5 How this appears in the directory structure
Each iteration now includes a new branches/ subdirectory, where every branch stores its own kernel source and benchmark result.
The champion kernel is copied to the iteration root as kernel.<ext>, becoming the official output of that iteration.
2.6 What fundamental problem this solves
Branch exploration transforms hyperparameter tuning from guesswork into experimentation.
In V1, if the agent picked tile = 128×128×32 but the true optimum was 128×256×32, the whole iteration could be wasted.
In V2, both candidates are actually compiled and benchmarked, so the value of the optimization method itself is much less likely to be hidden by a bad parameter choice.
This dramatically improves the information efficiency of every iteration.
3. Ablation-Based Attribution
3.1 The problem in V1: bundled judgment, no way to isolate individual contributions
In V1, each iteration applied 3 optimization methods and then checked whether total performance improved by more than a 2% noise threshold.
If yes, all three methods were added to effective_methods.
If not, all three were added to ineffective_methods.
This creates a serious attribution problem.
Suppose:
- method A contributes a 3 ms speedup
- method B contributes 0.5 ms
- method C actually slows things down by 1 ms
The total speedup would still be 2.5 ms, so all three methods would be labeled “effective.”
But in reality:
- A is highly valuable
- B is marginal
- C is harmful
V1 had no way to distinguish them.
3.2 What V2 does: per-method ablation experiments
After the champion is selected and fully profiled with Nsight Compute, V2 adds a new ablation stage, driven by ablate.py.
Suppose the chosen methods in this round are A, B, and C, and the champion kernel runs in 2.14 ms.
Ablation then generates three variants, each removing exactly one method:
- remove A → keep only B + C → benchmark = 4.82 ms
- remove B → keep only A + C → benchmark = 2.31 ms
- remove C → keep only A + B → benchmark = 2.19 ms
The attribution score for each method is defined as:
attribution(m) = ms_without_m − ms_champion
Using the numbers above:
-
A:
4.82 − 2.14 = +2.68 msRemoving A makes the kernel much slower → A contributed a lot -
B:
2.31 − 2.14 = +0.17 msRemoving B makes the kernel slightly slower → B contributed positively, but modestly -
C:
2.19 − 2.14 = +0.05 msRemoving C has almost no effect → below noise threshold, so C is treated as ineffective
3.3 How attribution affects method classification
Attribution is not used in isolation. It is combined with SASS validation results, which we will discuss in the next section.
The full classification logic is:
-
Attribution > noise threshold and SASS validation passed
→ add to
effective_methods -
Attribution ≤ noise threshold but SASS validation passed
→ add to
ineffective_methodsThe method was genuinely implemented by the compiler, but did not help performance -
SASS validation failed, regardless of attribution
→ add to
implementation_failed_methodsThe method was never truly realized in the compiled code
3.4 Outputs of the ablation stage
Each iteration now contains a new ablations/ subdirectory.
For every ablated method, a corresponding subfolder stores:
- the ablated kernel source
- the benchmark result
The aggregated attribution data is written to iterv{i}/attribution.json.
3.5 What fundamental problem this solves
Ablation turns method effectiveness from a group-level judgment into a per-method causal test.
This provides two major benefits:
- The
effective_methodslist becomes much more trustworthy, so future iterations are less likely to be misled by false positives. - The
ineffective_methodslist also becomes more precise. A good method is no longer rejected simply because it happened to be bundled with bad companions.
4. SASS-Level Instruction Validation
4.1 The problem in V1: claimed optimizations may exist only on paper
In V1, Claude might claim to have used a certain optimization method—say, tensor cores—and might even write the corresponding source code, such as inline mma.sync assembly or a CUTLASS HMMA path.
But the compiler may not actually adopt that intended path.
Common reasons include:
- compiler optimization overriding the handwritten intent
- register pressure forcing a fallback to scalar instructions
-
pragmaorattributedirectives being silently ignored
V1 acknowledged this in its failure-mode documentation and suggested manually diffing Nsight Compute reports between iterations to see when “nothing really changed.”
But this remained only a suggestion. It was not automated, and could easily be skipped.
4.2 What V2 does: automatic validation at the disassembly level
V2 introduces sass_check.py and a reference file references/sass_signatures.json, forming an automated compiled-artifact verification mechanism.
sass_signatures.json maps each optimization method to the SASS instruction patterns that should appear if the method was truly realized. For example:
-
compute.tensor_core→ should contain HMMA instructions -
latency.async_pipeline→ should contain LDGSTS or CP.ASYNC -
memory.smem_swizzle_xor→ should show the expected shared-memory access signature
The workflow of sass_check.py is:
- run
cuobjdump --dump-sasson the compiled champion kernel - extract the SASS disassembly
- grep for the expected instruction signatures corresponding to each selected method
- output a pass/fail result per method to
iterv{i}/sass_check.json
4.3 How SASS validation and ablation work together
SASS validation and ablation answer two different questions:
- Ablation attribution: did this method have a causal performance contribution?
- SASS validation: was this method actually realized by the compiler in the machine code?
Together, they produce the three-way classification described earlier.
One especially important case is when:
- SASS validation fails
- but the kernel still gets faster overall
V2 handles this by marking the method as implementation_failed, while still allowing the kernel itself to remain the best result if it truly is faster.
That means the speedup may have come from hyperparameter changes, not from the claimed method itself. This is exactly the sort of distinction that V2’s attribution system can capture, but V1 completely missed.
4.4 Impact on future iterations
Methods placed into implementation_failed_methods are not automatically blacklisted in future rounds.
Instead, Claude must explicitly acknowledge the earlier implementation failure in analysis.md and explain why the method may succeed this time—for example:
- using a different code formulation
- changing compiler flags
- reducing register pressure
This is more flexible than V1’s simple blacklist approach. The method itself may still be good; only the previous implementation path failed.
5. Changes to the State Structure
5.1 Overview of newly added fields
Compared with V1, V2 adds four new top-level fields to state.json, each corresponding to one of the mechanisms above.
① branches (integer)
This records the number of branches explored per iteration, defaulting to 4.
It is initialized in state.py init via the --branches argument and remains fixed throughout the optimization run.
It controls:
- how many hyperparameter variants Claude must generate in step 3d
- how many candidates
branch_explore.pymust compile and benchmark in step 3e
② implementation_failed_methods (list)
V1 had only two buckets:
effective_methodsineffective_methods
V2 adds a third:
implementation_failed_methods
The meanings are now:
- effective_methods: implemented in SASS and causally beneficial
- ineffective_methods: implemented in SASS but not causally beneficial
- implementation_failed_methods: never truly manifested in machine code
This three-way split allows later iterations to distinguish between:
- “we tried it and it didn’t help”
- “we wrote it, but the compiler never really used it”
Those are fundamentally different failure modes and should be handled differently.
③ roofline_history (list)
After each iteration, the round’s roofline result is appended here, including:
- Δc, Δm, Δl
- bound type
- near-peak flag
- axis budget allocation
This serves two purposes:
- it provides the data source for bottleneck-shift summaries
- it gives Claude historical trend information for future reasoning
For example, if Δm keeps increasing across two rounds, that may indicate that previous compute optimizations introduced additional memory pressure.
④ frontier (list)
This stores all branch candidates from all iterations that:
- passed correctness
- but were not selected as champion
Each record contains information such as:
- iteration number
- branch ID
- kernel path
- benchmark timing
The frontier is valuable because these kernels are valid but suboptimal in the current setting. They may perform better on different GPUs, different input sizes, or serve as promising starting points for manual exploration.
5.2 Changes to the inputs of state.py update
In V1, the update command needed only three inputs:
- kernel file
- benchmark result
methods.json
In V2, it adds two new required inputs:
--attribution--sass-check
This means state updates no longer rely solely on the single signal of “did the whole kernel get faster,” but instead incorporate two finer-grained signals:
- causal attribution
- implementation validation
5.3 Changes to method classification logic
This is the core difference in how the state gets updated.
In V1, classification was:
- binary
- group-based
If the whole kernel was faster than the previous one by more than 2%, all methods were marked effective; otherwise all were marked ineffective.
In V2, classification becomes:
- ternary
- per-method
For each method, the update logic first checks SASS validation, then attribution, and assigns the method to one of the three buckets accordingly.
Three methods selected in the same iteration may now end up in three different buckets.
5.4 Richer history records
In V1, each history entry contained only:
itermethodsmsspeedupstatus
In V2, each history record is extended with attribution and SASS validation information, turning the history into a complete iteration audit log.
This makes it possible to reconstruct exactly what happened in any round, including the contribution and validation status of every method.
5.5 A shift in design philosophy
V1’s state was primarily a recorder: it faithfully stored what happened in each round, but used coarse judgment rules.
V2’s state is much closer to a knowledge base: it stores not only facts, but also:
- causal attribution
- compiler-level validation
- bottleneck evolution
- exploration frontier
This gives future iterations richer and more reliable context for decision-making.
In other words, the optimization loop has evolved from:
trial → record
into:
trial → attribute → validate → learn
6. Results
Using one of the operator problems from
https://tensara.org/problems
as an example, the local test environment was based on an RTX 3060.
As a result, the observed optimization gains still do not fully reflect the performance ceiling one might expect on an A100. The main limitation is simply that no A100 hardware was available for validation.
⭐ KernelFlow-ops/cuda-optimized-skill
If you found this write-up helpful, a ⭐ Star on the repo would be greatly appreciated. Thank you!



Top comments (0)