DEV Community

aa24aa
aa24aa

Posted on

From Black Magic to Science: The Evolution of the CUDA Optimization Skill

🚀 The Evolution of the CUDA Optimization Skill

If the V1 optimizer felt like a rookie stumbling blindfolded through an arsenal—working hard, but relying mostly on luck—then V2 has evolved into a special-ops soldier equipped with radar scanning, shadow clones, and post-battle forensics.

Let’s peel back the code and see what kind of magic V2 introduced to turn CUDA optimization from something that felt like black art into something much closer to science.

skill architecture

🔗 Project link:
https://github.com/KernelFlow-ops/cuda-optimized-skill

If this project inspires you or helps you in any way, please consider giving it a ⭐ Star. Your support is the biggest motivation for continued iteration—thank you!

1. Roofline-Driven Axis Budget Allocation

1.1 The problem in V1: fixed allocation, no awareness of bottleneck imbalance

In V1, each iteration hard-coded the choice of one optimization method per axis: one for compute, one for memory, and one for latency, for a fixed total of three methods.

That means regardless of the kernel’s actual bottleneck distribution—even if compute utilization is only 8% while memory bandwidth is already near saturation—each axis still gets equal treatment. As a result, valuable optimization opportunities may be wasted on axes that have very little headroom left.

1.2 What V2 does: dynamic allocation based on performance gaps

In V2, after each Nsight Compute analysis and before method selection, a new roofline analysis stage is inserted, implemented by roofline.py.

This script reads two inputs:

  • ncu_top.json from the current iteration, which contains measured performance metrics
  • env.json, which contains the hardware’s theoretical peak capabilities

It then computes three gap values:

  • Δc (compute gap): the distance between current compute utilization and the hardware peak, in the range [0, 1]. A larger value means the kernel is farther from peak compute performance and has more room to improve.
  • Δm (memory gap): the distance between current bandwidth utilization and the hardware peak.
  • Δl (latency gap): the current maximum stall percentage, reflecting the severity of pipeline stalls.

1.3 Budget allocation rules

Once the three Δ values are computed, the script allocates a fixed total budget of 3 method slots across the three axes proportionally.

There are two hard constraints:

  • Per-axis cap = 2 Even if one axis dominates the others, it can receive at most 2 method slots.
  • Total budget = 3 The sum across compute, memory, and latency always remains 3.

For example, if:

  • Δc = 0.92
  • Δm = 0.57
  • Δl = 0.61

then compute has the largest gap, and the allocation might become:

  • compute = 2
  • memory = 0
  • latency = 1

This means that in the current round, the optimizer selects two methods from the compute axis, none from memory, and one from latency.

1.4 Early stopping

When all three gaps fall below 0.15, the roofline script outputs near_peak: true, indicating that the kernel is already very close to the hardware’s theoretical peak and that further optimization is unlikely to produce meaningful returns.

At that point, the loop terminates automatically.

V1 had no such mechanism. Even when the kernel was already near its limit, it would still blindly execute all scheduled iterations.

1.5 How this changes method selection

Because axis budgets are no longer fixed at 1, the method selection logic changes as well.

In V1, the rule was essentially:

scan each axis once and pick the first method that passes the checks

In V2, method selection becomes budget-aware:

  • If an axis budget is 0: skip it entirely
  • If an axis budget is 1: behave like V1 and take the first method that passes all four checks
  • If an axis budget is 2: collect all valid candidates on that axis, sort them by trigger strength (a continuous value in [0, 1] representing the strength of evidence that the method should fire), and take the top 2

That is also why V2 adds a new trigger_strength field to methods.schema.json: it becomes useful only when multiple methods on the same axis compete for multiple slots.

1.6 Roofline history tracking

Each iteration’s roofline result is appended to the roofline_history array in state.json.

This enables the final summary to show a bottleneck migration table, for example:

  • round 1: compute-bound
  • round 2: bandwidth-bound
  • round 3: near-peak

This makes it much easier to visualize how the bottleneck shifts across axes as the kernel improves.


2. Branch-and-Select Exploration

2.1 The problem in V1: one candidate, hyperparameters chosen by luck

In V1, each iteration generated only one kernel.

Claude had to choose both:

  • the optimization methods
  • the hyperparameters, such as tile size, pipeline stages, warp count, and so on

But the best hyperparameter combination depends heavily on the hardware and problem size. It is very hard to get right in a single shot purely by reasoning.

As a result, even if the chosen optimization method is correct, poor hyperparameter choices may hide its actual value and cause the method to be wrongly judged as ineffective.

2.2 What V2 does: same method set, multiple hyperparameter branches, then select the winner

V2 decouples method selection from hyperparameter selection.

In each iteration, Claude first determines a set of optimization methods (step 3c), and then generates K branch candidates based on that exact same method combination. By default, K = 4.

These branches share the same optimization strategy, but differ in parameters such as:

  • Tile size: e.g. 128×128×32 vs 128×256×32 vs 256×128×32
  • Pipeline stages: e.g. 3-stage vs 4-stage vs 5-stage
  • Warp count: e.g. 4 warps vs 8 warps
  • Implementation variants inside a method: e.g. different swizzle patterns or different MMA atom choices

2.3 Champion selection

The script branch_explore.py compiles and benchmarks all K branches.

Importantly, this stage does not run Nsight Compute profiling. It only performs:

  • correctness validation
  • timing benchmark

This keeps overhead under control.

The selection rule is simple:

among all branches that pass correctness, pick the one with the lowest execution time as the champion

If none of the K branches pass validation, Claude must rewrite and resubmit, with up to 3 retries, just like in V1.

2.4 Frontier retention

The losing branches are not discarded.

All branches that pass correctness but are not selected as champion are stored in the frontier array in state.json.

These “suboptimal but valid” candidates form an exploration frontier for future iterations. For example, if a later optimization method happens to pair especially well with the hyperparameters of a previously losing branch, that branch can become a useful starting point.

2.5 How this appears in the directory structure

Each iteration now includes a new branches/ subdirectory, where every branch stores its own kernel source and benchmark result.

The champion kernel is copied to the iteration root as kernel.<ext>, becoming the official output of that iteration.

2.6 What fundamental problem this solves

Branch exploration transforms hyperparameter tuning from guesswork into experimentation.

In V1, if the agent picked tile = 128×128×32 but the true optimum was 128×256×32, the whole iteration could be wasted.

In V2, both candidates are actually compiled and benchmarked, so the value of the optimization method itself is much less likely to be hidden by a bad parameter choice.

This dramatically improves the information efficiency of every iteration.


3. Ablation-Based Attribution

3.1 The problem in V1: bundled judgment, no way to isolate individual contributions

In V1, each iteration applied 3 optimization methods and then checked whether total performance improved by more than a 2% noise threshold.

If yes, all three methods were added to effective_methods.
If not, all three were added to ineffective_methods.

This creates a serious attribution problem.

Suppose:

  • method A contributes a 3 ms speedup
  • method B contributes 0.5 ms
  • method C actually slows things down by 1 ms

The total speedup would still be 2.5 ms, so all three methods would be labeled “effective.”

But in reality:

  • A is highly valuable
  • B is marginal
  • C is harmful

V1 had no way to distinguish them.

3.2 What V2 does: per-method ablation experiments

After the champion is selected and fully profiled with Nsight Compute, V2 adds a new ablation stage, driven by ablate.py.

Suppose the chosen methods in this round are A, B, and C, and the champion kernel runs in 2.14 ms.

Ablation then generates three variants, each removing exactly one method:

  • remove A → keep only B + C → benchmark = 4.82 ms
  • remove B → keep only A + C → benchmark = 2.31 ms
  • remove C → keep only A + B → benchmark = 2.19 ms

The attribution score for each method is defined as:

attribution(m) = ms_without_m − ms_champion

Using the numbers above:

  • A: 4.82 − 2.14 = +2.68 ms Removing A makes the kernel much slower → A contributed a lot
  • B: 2.31 − 2.14 = +0.17 ms Removing B makes the kernel slightly slower → B contributed positively, but modestly
  • C: 2.19 − 2.14 = +0.05 ms Removing C has almost no effect → below noise threshold, so C is treated as ineffective

3.3 How attribution affects method classification

Attribution is not used in isolation. It is combined with SASS validation results, which we will discuss in the next section.

The full classification logic is:

  • Attribution > noise threshold and SASS validation passed → add to effective_methods
  • Attribution ≤ noise threshold but SASS validation passed → add to ineffective_methods The method was genuinely implemented by the compiler, but did not help performance
  • SASS validation failed, regardless of attribution → add to implementation_failed_methods The method was never truly realized in the compiled code

3.4 Outputs of the ablation stage

Each iteration now contains a new ablations/ subdirectory.

For every ablated method, a corresponding subfolder stores:

  • the ablated kernel source
  • the benchmark result

The aggregated attribution data is written to iterv{i}/attribution.json.

3.5 What fundamental problem this solves

Ablation turns method effectiveness from a group-level judgment into a per-method causal test.

This provides two major benefits:

  1. The effective_methods list becomes much more trustworthy, so future iterations are less likely to be misled by false positives.
  2. The ineffective_methods list also becomes more precise. A good method is no longer rejected simply because it happened to be bundled with bad companions.

4. SASS-Level Instruction Validation

4.1 The problem in V1: claimed optimizations may exist only on paper

In V1, Claude might claim to have used a certain optimization method—say, tensor cores—and might even write the corresponding source code, such as inline mma.sync assembly or a CUTLASS HMMA path.

But the compiler may not actually adopt that intended path.

Common reasons include:

  • compiler optimization overriding the handwritten intent
  • register pressure forcing a fallback to scalar instructions
  • pragma or attribute directives being silently ignored

V1 acknowledged this in its failure-mode documentation and suggested manually diffing Nsight Compute reports between iterations to see when “nothing really changed.”

But this remained only a suggestion. It was not automated, and could easily be skipped.

4.2 What V2 does: automatic validation at the disassembly level

V2 introduces sass_check.py and a reference file references/sass_signatures.json, forming an automated compiled-artifact verification mechanism.

sass_signatures.json maps each optimization method to the SASS instruction patterns that should appear if the method was truly realized. For example:

  • compute.tensor_core → should contain HMMA instructions
  • latency.async_pipeline → should contain LDGSTS or CP.ASYNC
  • memory.smem_swizzle_xor → should show the expected shared-memory access signature

The workflow of sass_check.py is:

  1. run cuobjdump --dump-sass on the compiled champion kernel
  2. extract the SASS disassembly
  3. grep for the expected instruction signatures corresponding to each selected method
  4. output a pass/fail result per method to iterv{i}/sass_check.json

4.3 How SASS validation and ablation work together

SASS validation and ablation answer two different questions:

  • Ablation attribution: did this method have a causal performance contribution?
  • SASS validation: was this method actually realized by the compiler in the machine code?

Together, they produce the three-way classification described earlier.

One especially important case is when:

  • SASS validation fails
  • but the kernel still gets faster overall

V2 handles this by marking the method as implementation_failed, while still allowing the kernel itself to remain the best result if it truly is faster.

That means the speedup may have come from hyperparameter changes, not from the claimed method itself. This is exactly the sort of distinction that V2’s attribution system can capture, but V1 completely missed.

4.4 Impact on future iterations

Methods placed into implementation_failed_methods are not automatically blacklisted in future rounds.

Instead, Claude must explicitly acknowledge the earlier implementation failure in analysis.md and explain why the method may succeed this time—for example:

  • using a different code formulation
  • changing compiler flags
  • reducing register pressure

This is more flexible than V1’s simple blacklist approach. The method itself may still be good; only the previous implementation path failed.


5. Changes to the State Structure

5.1 Overview of newly added fields

Compared with V1, V2 adds four new top-level fields to state.json, each corresponding to one of the mechanisms above.

branches (integer)

This records the number of branches explored per iteration, defaulting to 4.

It is initialized in state.py init via the --branches argument and remains fixed throughout the optimization run.

It controls:

  • how many hyperparameter variants Claude must generate in step 3d
  • how many candidates branch_explore.py must compile and benchmark in step 3e

implementation_failed_methods (list)

V1 had only two buckets:

  • effective_methods
  • ineffective_methods

V2 adds a third:

  • implementation_failed_methods

The meanings are now:

  • effective_methods: implemented in SASS and causally beneficial
  • ineffective_methods: implemented in SASS but not causally beneficial
  • implementation_failed_methods: never truly manifested in machine code

This three-way split allows later iterations to distinguish between:

  • “we tried it and it didn’t help”
  • “we wrote it, but the compiler never really used it”

Those are fundamentally different failure modes and should be handled differently.

roofline_history (list)

After each iteration, the round’s roofline result is appended here, including:

  • Δc, Δm, Δl
  • bound type
  • near-peak flag
  • axis budget allocation

This serves two purposes:

  1. it provides the data source for bottleneck-shift summaries
  2. it gives Claude historical trend information for future reasoning

For example, if Δm keeps increasing across two rounds, that may indicate that previous compute optimizations introduced additional memory pressure.

frontier (list)

This stores all branch candidates from all iterations that:

  • passed correctness
  • but were not selected as champion

Each record contains information such as:

  • iteration number
  • branch ID
  • kernel path
  • benchmark timing

The frontier is valuable because these kernels are valid but suboptimal in the current setting. They may perform better on different GPUs, different input sizes, or serve as promising starting points for manual exploration.

5.2 Changes to the inputs of state.py update

In V1, the update command needed only three inputs:

  • kernel file
  • benchmark result
  • methods.json

In V2, it adds two new required inputs:

  • --attribution
  • --sass-check

This means state updates no longer rely solely on the single signal of “did the whole kernel get faster,” but instead incorporate two finer-grained signals:

  • causal attribution
  • implementation validation

5.3 Changes to method classification logic

This is the core difference in how the state gets updated.

In V1, classification was:

  • binary
  • group-based

If the whole kernel was faster than the previous one by more than 2%, all methods were marked effective; otherwise all were marked ineffective.

In V2, classification becomes:

  • ternary
  • per-method

For each method, the update logic first checks SASS validation, then attribution, and assigns the method to one of the three buckets accordingly.

Three methods selected in the same iteration may now end up in three different buckets.

5.4 Richer history records

In V1, each history entry contained only:

  • iter
  • methods
  • ms
  • speedup
  • status

In V2, each history record is extended with attribution and SASS validation information, turning the history into a complete iteration audit log.

This makes it possible to reconstruct exactly what happened in any round, including the contribution and validation status of every method.

5.5 A shift in design philosophy

V1’s state was primarily a recorder: it faithfully stored what happened in each round, but used coarse judgment rules.

V2’s state is much closer to a knowledge base: it stores not only facts, but also:

  • causal attribution
  • compiler-level validation
  • bottleneck evolution
  • exploration frontier

This gives future iterations richer and more reliable context for decision-making.

In other words, the optimization loop has evolved from:

trial → record

into:

trial → attribute → validate → learn

6. Results

Using one of the operator problems from
https://tensara.org/problems
as an example, the local test environment was based on an RTX 3060.

As a result, the observed optimization gains still do not fully reflect the performance ceiling one might expect on an A100. The main limitation is simply that no A100 hardware was available for validation.

tensara baseline

tensara best

KernelFlow-ops/cuda-optimized-skill

If you found this write-up helpful, a ⭐ Star on the repo would be greatly appreciated. Thank you!

Top comments (0)