aa24aa

Posted on Apr 22

From Black Magic to Science: The Evolution of the CUDA Optimization Skill

#cuda #agents #cutlass #triton

🚀 The Evolution of the CUDA Optimization Skill

If the V1 optimizer felt like a rookie stumbling blindfolded through an arsenal—working hard, but relying mostly on luck—then V2 has evolved into a special-ops soldier equipped with radar scanning, shadow clones, and post-battle forensics.

Let’s peel back the code and see what kind of magic V2 introduced to turn CUDA optimization from something that felt like black art into something much closer to science.

🔗 Project link:
https://github.com/KernelFlow-ops/cuda-optimized-skill

If this project inspires you or helps you in any way, please consider giving it a ⭐ Star. Your support is the biggest motivation for continued iteration—thank you!

1. Roofline-Driven Axis Budget Allocation

1.1 The problem in V1: fixed allocation, no awareness of bottleneck imbalance

In V1, each iteration hard-coded the choice of one optimization method per axis: one for compute, one for memory, and one for latency, for a fixed total of three methods.

That means regardless of the kernel’s actual bottleneck distribution—even if compute utilization is only 8% while memory bandwidth is already near saturation—each axis still gets equal treatment. As a result, valuable optimization opportunities may be wasted on axes that have very little headroom left.

1.2 What V2 does: dynamic allocation based on performance gaps

In V2, after each Nsight Compute analysis and before method selection, a new roofline analysis stage is inserted, implemented by roofline.py.

This script reads two inputs:

ncu_top.json from the current iteration, which contains measured performance metrics
env.json, which contains the hardware’s theoretical peak capabilities

It then computes three gap values:

Δc (compute gap): the distance between current compute utilization and the hardware peak, in the range [0, 1]. A larger value means the kernel is farther from peak compute performance and has more room to improve.
Δm (memory gap): the distance between current bandwidth utilization and the hardware peak.
Δl (latency gap): the current maximum stall percentage, reflecting the severity of pipeline stalls.

1.3 Budget allocation rules

Once the three Δ values are computed, the script allocates a fixed total budget of 3 method slots across the three axes proportionally.

There are two hard constraints:

Per-axis cap = 2 Even if one axis dominates the others, it can receive at most 2 method slots.
Total budget = 3 The sum across compute, memory, and latency always remains 3.

For example, if:

Δc = 0.92
Δm = 0.57
Δl = 0.61

then compute has the largest gap, and the allocation might become:

compute = 2
memory = 0
latency = 1

This means that in the current round, the optimizer selects two methods from the compute axis, none from memory, and one from latency.

1.4 Early stopping

When all three gaps fall below 0.15, the roofline script outputs near_peak: true, indicating that the kernel is already very close to the hardware’s theoretical peak and that further optimization is unlikely to produce meaningful returns.

At that point, the loop terminates automatically.

V1 had no such mechanism. Even when the kernel was already near its limit, it would still blindly execute all scheduled iterations.

1.5 How this changes method selection

Because axis budgets are no longer fixed at 1, the method selection logic changes as well.

In V1, the rule was essentially:

scan each axis once and pick the first method that passes the checks

In V2, method selection becomes budget-aware:

If an axis budget is 0: skip it entirely
If an axis budget is 1: behave like V1 and take the first method that passes all four checks
If an axis budget is 2: collect all valid candidates on that axis, sort them by trigger strength (a continuous value in [0, 1] representing the strength of evidence that the method should fire), and take the top 2

That is also why V2 adds a new trigger_strength field to methods.schema.json: it becomes useful only when multiple methods on the same axis compete for multiple slots.

1.6 Roofline history tracking

Each iteration’s roofline result is appended to the roofline_history array in state.json.

This enables the final summary to show a bottleneck migration table, for example:

round 1: compute-bound
round 2: bandwidth-bound
round 3: near-peak

This makes it much easier to visualize how the bottleneck shifts across axes as the kernel improves.

2. Branch-and-Select Exploration

2.1 The problem in V1: one candidate, hyperparameters chosen by luck

In V1, each iteration generated only one kernel.

Claude had to choose both:

the optimization methods
the hyperparameters, such as tile size, pipeline stages, warp count, and so on

But the best hyperparameter combination depends heavily on the hardware and problem size. It is very hard to get right in a single shot purely by reasoning.

As a result, even if the chosen optimization method is correct, poor hyperparameter choices may hide its actual value and cause the method to be wrongly judged as ineffective.

2.2 What V2 does: same method set, multiple hyperparameter branches, then select the winner

V2 decouples method selection from hyperparameter selection.

In each iteration, Claude first determines a set of optimization methods (step 3c), and then generates K branch candidates based on that exact same method combination. By default, K = 4.

These branches share the same optimization strategy, but differ in parameters such as:

Tile size: e.g. 128×128×32 vs 128×256×32 vs 256×128×32
Pipeline stages: e.g. 3-stage vs 4-stage vs 5-stage
Warp count: e.g. 4 warps vs 8 warps
Implementation variants inside a method: e.g. different swizzle patterns or different MMA atom choices

2.3 Champion selection

The script branch_explore.py compiles and benchmarks all K branches.

Importantly, this stage does not run Nsight Compute profiling. It only performs:

correctness validation
timing benchmark

This keeps overhead under control.

The selection rule is simple:

among all branches that pass correctness, pick the one with the lowest execution time as the champion

If none of the K branches pass validation, Claude must rewrite and resubmit, with up to 3 retries, just like in V1.

2.4 Frontier retention

The losing branches are not discarded.

All branches that pass correctness but are not selected as champion are stored in the frontier array in state.json.

These “suboptimal but valid” candidates form an exploration frontier for future iterations. For example, if a later optimization method happens to pair especially well with the hyperparameters of a previously losing branch, that branch can become a useful starting point.

2.5 How this appears in the directory structure

Each iteration now includes a new branches/ subdirectory, where every branch stores its own kernel source and benchmark result.

The champion kernel is copied to the iteration root as kernel.<ext>, becoming the official output of that iteration.

2.6 What fundamental problem this solves

Branch exploration transforms hyperparameter tuning from guesswork into experimentation.

In V1, if the agent picked tile = 128×128×32 but the true optimum was 128×256×32, the whole iteration could be wasted.

In V2, both candidates are actually compiled and benchmarked, so the value of the optimization method itself is much less likely to be hidden by a bad parameter choice.

This dramatically improves the information efficiency of every iteration.

3. Ablation-Based Attribution

3.1 The problem in V1: bundled judgment, no way to isolate individual contributions

In V1, each iteration applied 3 optimization methods and then checked whether total performance improved by more than a 2% noise threshold.

If yes, all three methods were added to effective_methods.
If not, all three were added to ineffective_methods.

This creates a serious attribution problem.

Suppose:

method A contributes a 3 ms speedup
method B contributes 0.5 ms
method C actually slows things down by 1 ms

The total speedup would still be 2.5 ms, so all three methods would be labeled “effective.”

But in reality:

A is highly valuable
B is marginal
C is harmful

V1 had no way to distinguish them.

3.2 What V2 does: per-method ablation experiments

After the champion is selected and fully profiled with Nsight Compute, V2 adds a new ablation stage, driven by ablate.py.

Suppose the chosen methods in this round are A, B, and C, and the champion kernel runs in 2.14 ms.

Ablation then generates three variants, each removing exactly one method:

remove A → keep only B + C → benchmark = 4.82 ms
remove B → keep only A + C → benchmark = 2.31 ms
remove C → keep only A + B → benchmark = 2.19 ms

The attribution score for each method is defined as:

attribution(m) = ms_without_m − ms_champion

Using the numbers above:

A: 4.82 − 2.14 = +2.68 ms Removing A makes the kernel much slower → A contributed a lot
B: 2.31 − 2.14 = +0.17 ms Removing B makes the kernel slightly slower → B contributed positively, but modestly
C: 2.19 − 2.14 = +0.05 ms Removing C has almost no effect → below noise threshold, so C is treated as ineffective

3.3 How attribution affects method classification

Attribution is not used in isolation. It is combined with SASS validation results, which we will discuss in the next section.

The full classification logic is:

Attribution > noise threshold and SASS validation passed → add to effective_methods
Attribution ≤ noise threshold but SASS validation passed → add to ineffective_methods The method was genuinely implemented by the compiler, but did not help performance
SASS validation failed, regardless of attribution → add to implementation_failed_methods The method was never truly realized in the compiled code

3.4 Outputs of the ablation stage

Each iteration now contains a new ablations/ subdirectory.

For every ablated method, a corresponding subfolder stores:

the ablated kernel source
the benchmark result

The aggregated attribution data is written to iterv{i}/attribution.json.

3.5 What fundamental problem this solves

Ablation turns method effectiveness from a group-level judgment into a per-method causal test.

This provides two major benefits:

The effective_methods list becomes much more trustworthy, so future iterations are less likely to be misled by false positives.
The ineffective_methods list also becomes more precise. A good method is no longer rejected simply because it happened to be bundled with bad companions.

4. SASS-Level Instruction Validation

4.1 The problem in V1: claimed optimizations may exist only on paper

In V1, Claude might claim to have used a certain optimization method—say, tensor cores—and might even write the corresponding source code, such as inline mma.sync assembly or a CUTLASS HMMA path.

But the compiler may not actually adopt that intended path.

Common reasons include:

compiler optimization overriding the handwritten intent
register pressure forcing a fallback to scalar instructions
pragma or attribute directives being silently ignored

V1 acknowledged this in its failure-mode documentation and suggested manually diffing Nsight Compute reports between iterations to see when “nothing really changed.”

But this remained only a suggestion. It was not automated, and could easily be skipped.

4.2 What V2 does: automatic validation at the disassembly level

V2 introduces sass_check.py and a reference file references/sass_signatures.json, forming an automated compiled-artifact verification mechanism.

sass_signatures.json maps each optimization method to the SASS instruction patterns that should appear if the method was truly realized. For example:

compute.tensor_core → should contain HMMA instructions
latency.async_pipeline → should contain LDGSTS or CP.ASYNC
memory.smem_swizzle_xor → should show the expected shared-memory access signature

The workflow of sass_check.py is:

run cuobjdump --dump-sass on the compiled champion kernel
extract the SASS disassembly
grep for the expected instruction signatures corresponding to each selected method
output a pass/fail result per method to iterv{i}/sass_check.json

4.3 How SASS validation and ablation work together

SASS validation and ablation answer two different questions:

Ablation attribution: did this method have a causal performance contribution?
SASS validation: was this method actually realized by the compiler in the machine code?

Together, they produce the three-way classification described earlier.

One especially important case is when:

SASS validation fails
but the kernel still gets faster overall

V2 handles this by marking the method as implementation_failed, while still allowing the kernel itself to remain the best result if it truly is faster.

That means the speedup may have come from hyperparameter changes, not from the claimed method itself. This is exactly the sort of distinction that V2’s attribution system can capture, but V1 completely missed.

4.4 Impact on future iterations

Methods placed into implementation_failed_methods are not automatically blacklisted in future rounds.

Instead, Claude must explicitly acknowledge the earlier implementation failure in analysis.md and explain why the method may succeed this time—for example:

using a different code formulation
changing compiler flags
reducing register pressure

This is more flexible than V1’s simple blacklist approach. The method itself may still be good; only the previous implementation path failed.

5. Changes to the State Structure

5.1 Overview of newly added fields

Compared with V1, V2 adds four new top-level fields to state.json, each corresponding to one of the mechanisms above.

① `branches` (integer)

This records the number of branches explored per iteration, defaulting to 4.

It is initialized in state.py init via the --branches argument and remains fixed throughout the optimization run.

It controls:

how many hyperparameter variants Claude must generate in step 3d
how many candidates branch_explore.py must compile and benchmark in step 3e

② `implementation_failed_methods` (list)

V1 had only two buckets:

effective_methods
ineffective_methods

V2 adds a third:

implementation_failed_methods

The meanings are now:

effective_methods: implemented in SASS and causally beneficial
ineffective_methods: implemented in SASS but not causally beneficial
implementation_failed_methods: never truly manifested in machine code

This three-way split allows later iterations to distinguish between:

“we tried it and it didn’t help”
“we wrote it, but the compiler never really used it”

Those are fundamentally different failure modes and should be handled differently.

③ `roofline_history` (list)

After each iteration, the round’s roofline result is appended here, including:

Δc, Δm, Δl
bound type
near-peak flag
axis budget allocation

This serves two purposes:

it provides the data source for bottleneck-shift summaries
it gives Claude historical trend information for future reasoning

For example, if Δm keeps increasing across two rounds, that may indicate that previous compute optimizations introduced additional memory pressure.

④ `frontier` (list)

This stores all branch candidates from all iterations that:

passed correctness
but were not selected as champion

Each record contains information such as:

iteration number
branch ID
kernel path
benchmark timing

The frontier is valuable because these kernels are valid but suboptimal in the current setting. They may perform better on different GPUs, different input sizes, or serve as promising starting points for manual exploration.

5.2 Changes to the inputs of `state.py update`

In V1, the update command needed only three inputs:

kernel file
benchmark result
methods.json

In V2, it adds two new required inputs:

--attribution
--sass-check

This means state updates no longer rely solely on the single signal of “did the whole kernel get faster,” but instead incorporate two finer-grained signals:

causal attribution
implementation validation

5.3 Changes to method classification logic

This is the core difference in how the state gets updated.

In V1, classification was:

binary
group-based

If the whole kernel was faster than the previous one by more than 2%, all methods were marked effective; otherwise all were marked ineffective.

In V2, classification becomes:

ternary
per-method

For each method, the update logic first checks SASS validation, then attribution, and assigns the method to one of the three buckets accordingly.

Three methods selected in the same iteration may now end up in three different buckets.

5.4 Richer history records

In V1, each history entry contained only:

iter
methods
ms
speedup
status

In V2, each history record is extended with attribution and SASS validation information, turning the history into a complete iteration audit log.

This makes it possible to reconstruct exactly what happened in any round, including the contribution and validation status of every method.

5.5 A shift in design philosophy

V1’s state was primarily a recorder: it faithfully stored what happened in each round, but used coarse judgment rules.

V2’s state is much closer to a knowledge base: it stores not only facts, but also:

causal attribution
compiler-level validation
bottleneck evolution
exploration frontier

This gives future iterations richer and more reliable context for decision-making.

In other words, the optimization loop has evolved from:

trial → record

into:

trial → attribute → validate → learn

6. Results

Using one of the operator problems from
https://tensara.org/problems
as an example, the local test environment was based on an RTX 3060.

As a result, the observed optimization gains still do not fully reflect the performance ceiling one might expect on an A100. The main limitation is simply that no A100 hardware was available for validation.

⭐ KernelFlow-ops/cuda-optimized-skill

If you found this write-up helpful, a ⭐ Star on the repo would be greatly appreciated. Thank you!

DEV Community

From Black Magic to Science: The Evolution of the CUDA Optimization Skill

🚀 The Evolution of the CUDA Optimization Skill

1. Roofline-Driven Axis Budget Allocation

1.1 The problem in V1: fixed allocation, no awareness of bottleneck imbalance