Generation-Side Tooling Outpaces Validation-Side Tooling

#ai #machinelearning #gpu #programming

The generation side is shipping fast (TileGym, AutoKernel, KernelEvolve). The validation-side surface for “what the kernel actually did at runtime” has not kept pace.

TL;DR

In the past nine months, three significant releases have landed for auto-generation of CUDA kernels: NVIDIA TileGym, RightNow AutoKernel, and Meta’s KernelEvolve. Each ships training infrastructure for kernel generation. Validation infrastructure (what the generated kernel actually did at runtime, on a real workload, in a production-shaped environment) has not kept the same pace. eBPF traces are the ground-truth layer that closes the gap.

What “validation” means at the kernel level

Two distinct validation surfaces:

Pre-launch: the generated CUDA C compiles, the PTX assembles, the kernel passes a numerical-equivalence test against a reference. Standard compiler / unit-test territory. Generation frameworks ship this themselves.

Post-launch: the kernel ran, returned, took N microseconds, used M registers per thread, hit X cache miss rate, and did or did not serialize the rest of the stream behind it. This is the layer that an eBPF trace plus standard CUDA driver counters can answer for any kernel, generated or hand-written.

Auto-generation pipelines do not by default close the post-launch loop. They demonstrate “the kernel works in our test setup”. They do not demonstrate “the kernel does not regress p99 latency on production inference traffic”.

What an eBPF trace adds to a generated kernel

Once a generated kernel is in a real workload, the same trace surface used for any CUDA kernel applies: launch latency from cudaLaunchKernel, sync stalls from cudaStreamSynchronize, host-side overhead from the dispatcher, host scheduling preemption while the GPU is busy. None of those signals are visible to a generation framework that evaluates kernels in isolation.

-- post-launch validation: did the new generated kernel regress p99?
SELECT kernel_name,
       COUNT(*)                  AS launches,
       AVG(duration_ns)/1e3      AS avg_us,
       quantile(duration_ns,0.99)/1e3 AS p99_us
  FROM cuda_events
 WHERE timestamp_ns BETWEEN :before AND :after
 GROUP BY kernel_name
 ORDER BY p99_us DESC
 LIMIT 20;

Run that query against a trace captured before the generated kernel replaced the hand-written one, and again against a trace captured after. The diff is the only “did this generation actually help” answer that survives contact with production.

Why this gap is widening

Kernel generation is downstream of LLM-coding capability, which has been the fastest-improving category in the past year. Validation infrastructure is downstream of observability tooling, which moves slower for structural reasons (instrumentation needs to be in production, the agent needs to have low overhead, the tool has to be trusted by SREs). The generation side gets a new release every quarter. The validation side gets one every two or three.

The shape of the gap is not unique to GPU kernels. It mirrors the gap between “AI-generated code” and “AI-tested code” across software engineering more broadly. The GPU-kernel version is sharper because the cost of a wrong kernel in production is measured in dollars per minute on a fleet of $30/hour GPU hosts.

Generation outruns validation by default

Better generation tools without better validation tools means more kernels in production that no one has measured against the workload they are running on. The fix is not to slow generation; it is to make the validation surface as cheap to run as the generation surface. A trace capture and a SQL query for every new generated kernel is one shape that fits.

Ingero – open-source eBPF agent for GPU debugging. One binary, zero deps, <2% overhead. Apache 2.0 + GPL-2.0. *GitHub ⭐** · Open an issue if you are running auto-generated CUDA kernels in production and want a cheap, repeatable way to validate them on real workloads.*