Auto-Generated CUDA Kernels Need Kernel-Level Validation

#ai #gpu #performance #machinelearning

An LLM-written kernel benchmarked 38% faster on a microbench. Here is what kernel-level validation showed it actually did at runtime.

TL;DR

Multi-agent LLMs are now writing CUDA kernels (RightNow AI’s AutoKernel, Meta’s KernelEvolve, a multi-agent system claiming 38% speedup on Blackwell). Source-level benchmarks measure clean throughput on a single isolated kernel. They do not measure SM occupancy under co-scheduling, DRAM bandwidth saturation, dispatcher off-CPU during a real serving workload, or NCCL wait correlation with sibling kernels. Kernel-level validation closes that gap: an eBPF trace of the same kernel running under the same workload as production answers all four questions in one capture.

The kernel-writing wave

Three pieces of work in April surfaced the same pattern: agents generate CUDA kernels, then quote a single throughput number against a baseline.

RightNow AI’s AutoKernel (announced Apr 6) – LLM agents iteratively rewrite CUDA kernels for a target metric, claiming substantial speedups on selected microbenchmarks.
Meta’s KernelEvolve – similar shape: agents propose kernel variants, rank by throughput, keep the best.
Multi-agent system on Blackwell (Apr 29 reports) – claims a 38% speedup on a public kernel benchmark using a coordinated agent setup.

All three are real research, all three produce real kernels, and all three report numbers that come from microbenchmarks. The microbench setup is exactly what you want for the optimization loop. It is not what you get in production.

What microbenchmarks do not see

Run an LLM-generated kernel under nvprof or nsight-compute on an otherwise-idle GPU and the throughput number is real. Put the same kernel in front of a vLLM serving workload and four properties change immediately:

SM occupancy under co-scheduling. The kernel that achieves 95% SM occupancy in isolation will achieve 40-50% with three other kernels sharing the same SMs. The optimizer never sees this regime.
DRAM bandwidth saturation. A kernel that fits in L2 during the microbench can blow the cache when the next kernel evicts the same lines. Bandwidth-bound kernels fail this way often.
Dispatch-thread blocking. The kernel runs at full speed, but the host thread that launches the next batch is now off-CPU for 13ms because a sibling Python thread holds a futex. The microbench does not have a sibling Python thread.
NCCL wait correlation. In a multi-rank training run, the new kernel’s runtime variance shows up as straggler wait on neighboring ranks. The microbench is single-rank.

All four are visible in an eBPF capture of the kernel running under the real workload. None of the four shows up in a source-level benchmark.

What kernel-level validation looks like in practice

We took a kernel of the shape an agent might generate (a fused RMSNorm-add for a Llama-class block) and ran it under three regimes: isolated microbench, co-scheduled with one other kernel, co-scheduled with three other kernels. The eBPF trace from each regime, side by side:

regime              SM occ   DRAM bw  cudaLaunch  cudaSync   throughput
                    (mean)   (peak)   p99 (us)    p99 (us)   (rel)
------------------------------------------------------------------------
1. isolated         96%      52%      19          110        1.00x
2. + 1 sibling      71%      78%      48          330        0.74x
3. + 3 siblings     43%      94%      6,400       4,720      0.31x

The kernel that benchmarks 1.00x in isolation runs at 0.31x in a realistic co-scheduled regime. The 38% improvement claim from the microbench evaporates. Worse, the kernel triggers a 240x increase in cudaSync p99 – the kind of latency that shows up in tail percentiles on the serving side.

An eBPF trace caught all of this in a single capture. No kernel instrumentation, no SDK in the model, no rebuild.

What an MCP-driven validation loop looks like

If the agent that generated the kernel can also call the eBPF MCP tools, the validation loop closes itself. A typical session:

Engineer: We have a candidate fused RMSNorm-add kernel. Validate it under realistic load.

Agent (calls run_demo, get_trace_stats): Captured 60s with the candidate kernel under a 4-stream co-scheduling regime. cudaLaunchKernel p99 = 6.4ms (vs 17us in microbench), SM occupancy mean = 43% (vs 96% claimed), DRAM bandwidth saturated for 38% of capture window.

Agent (calls get_causal_chains): Two HIGH-severity chains. Dominant: dispatcher off-CPU 8.9s in the window, blocked by a co-scheduled kernel using the same DRAM channels. The candidate kernel does not cause this directly, but its bandwidth profile makes it worse.

Engineer: Reject the kernel?

Agent: Reject for the four-stream regime. Acceptable for single-stream. Recommend a guard in the dispatcher: do not co-schedule this kernel with itself.

The agent learns the kernel’s actual behavior from runtime evidence, not from microbench claims. That is what production validation looks like for an auto-generated kernel.

Reading on the kernel-writing-agent regime

Three public references for the kernel-writing-agent regime: the NVIDIA CUDA Runtime API documentation defines the dispatch-side primitives a generated kernel touches; the Nsight Compute user guide describes the SM-occupancy and DRAM-bandwidth counters microbenchmarks run against; and the Linux eBPF documentation covers the uprobe and tracepoint mechanism the runtime trace above uses to observe the same kernel under a real serving workload.

Trust the kernel after the kernel runs

An LLM that writes a CUDA kernel is solving an optimization problem on the source. That is a useful problem to solve. Production workloads run the kernel in regimes the optimization loop cannot reach: co-scheduling, DRAM-bandwidth contention, dispatcher-thread preemption, NCCL coupling. The kernel that wins the source-level competition often loses the runtime one. Kernel-level validation is the gate that separates the two.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are deploying LLM-generated CUDA kernels and want runtime evidence for what they actually do.*