Jaber Jaber

Posted on Jan 23

LLMs Can Now Write GPU Kernels That Beat torch.compile

#gpu #cuda #triton #llm

Six months ago, if you asked me whether an LLM could write a CUDA kernel that actually beats PyTorch's compiler, I would have said no. The optimization space is too complex. Too many hardware details. Too easy to write something that compiles but runs slower than the baseline.

I was wrong.

We're now seeing multi-agent systems that take your PyTorch code and spit out CUDA or Triton kernels with 2x to 14x speedups over torch.compile(mode='max-autotune-no-cudagraphs'). Not on toy benchmarks. On real models like Llama-3.1-8B, Whisper, and Stable Diffusion.

This post breaks down how this works, what the research says, and why I think swarm-based approaches like Forge are the future of kernel optimization.

The Problem: torch.compile Is Good, But Not Good Enough
PyTorch 2.0 introduced torch.compile, and it's genuinely impressive. You add one line of code and get 2x to 5x speedups on many models. Under the hood, TorchDynamo captures your computation graph and TorchInductor generates Triton kernels that fuse operations together.

But torch.compile has hard limits.

First, graph breaks. Whenever your model has dynamic control flow or unsupported operations, the compiler gives up and falls back to eager mode. Each break is a missed optimization opportunity.

Second, and more importantly, compilers cannot invent algorithms. FlashAttention gets 3x to 10x speedups by computing attention in a fundamentally different way. It uses online softmax to avoid materializing the full N-squared attention matrix. No compiler will ever discover this on its own. It's an algorithmic insight, not a fusion pattern.

Third, hardware-specific tricks like H100 warp specialization and TMA require manual tuning that general compilers don't attempt.

This is why companies like Meta still employ kernel engineers. There's a massive gap between what compilers produce and what's actually possible.

KernelBench: Finally, a Real Benchmark
Stanford University's Scaling Intelligence Lab released KernelBench in December 2024, and it's now the standard way to evaluate LLM kernel generation. The benchmark has 250 tasks across four difficulty levels:

Level 1: Single operations like convolutions and matrix multiplies (100 tasks)

Level 2: Fused operations like conv + bias + ReLU (100 tasks)

Level 3: Full model architectures like MobileNet and MiniGPT (50 tasks)

Level 4: Hugging Face model optimization (20 aspirational tasks)

The metric that matters is called "fast_p": the percentage of tasks where the generated kernel is both correct AND faster than the PyTorch baseline.

KernelBench Results | One-Shot Generation (fast_p %)

Source : KernelBench
With iterative refinement (10 turns): DeepSeek R1 jumps to 43% / 72% / 18%

Source: KernelBench (Stanford Scaling Intelligence Lab, ICML 2025)

These numbers are pretty bad. Even the best reasoning models fail to beat PyTorch most of the time with a single attempt.

But here's where it gets interesting. With iterative refinement (giving the model execution feedback and profiler output over 10 turns), DeepSeek R1 jumps from 12% to 43% on Level 1, 36% to 72% on Level 2, and 2% to 18% on Level 3.

The lesson is clear: one-shot generation doesn't work for kernels. You need feedback loops.

Why Multi-Agent Systems Work Better
The research on multi-agent code generation is pretty convincing at this point.

AgentCoder (2024) showed that separating code generation and test verification across specialized agents achieves 96.3% pass@1 on HumanEval. That's near ceiling performance. MapCoder achieved state-of-the-art results using four agents that mimic the human programming cycle: retrieval, planning, coding, and debugging.

For kernel generation specifically, STARK demonstrated that multi-agent systems using Anthropic Claude Sonnet could hit 100% success rate on KernelBench Level 1 with up to 3x speedups. Single-agent approaches couldn't match this.

The pattern that keeps working is what researchers call "coder-judge" or "generator-verifier". One agent writes code, another agent checks it. This separation matters because generation and verification require different skills.

There's also strong evidence for inference-time scaling. The paper "Scaling LLM Test-Time Compute Optimally" from ICLR 2025 showed that throwing more compute at test time can be more effective than scaling model parameters. The relationship is log-linear: double your test-time compute, get consistent improvements in output quality.

AlphaCode 2 takes this to the extreme. It samples up to a million candidates per problem, filters out 95% that fail basic checks, clusters the rest by runtime behavior, and uses a scoring model to pick the best. This got them to the 85th percentile on Codeforces.

The Foundation: AlphaDev and FunSearch
The conceptual roots of AI-generated optimized code go back to two Google DeepMind papers.

AlphaDev (Nature, June 2023) used reinforcement learning and Monte Carlo Tree Search to discover assembly-level sorting algorithms that were up to 70% faster for small sequences than human-written code. These algorithms got merged into the LLVM libc++ standard library. They're now called trillions of times daily. That's the first time RL-discovered algorithms shipped in a major production library.

FunSearch (Nature, December 2023) paired an LLM with island-based evolutionary search. The language model proposes mutations, and the evolutionary algorithm handles selection and population management. This combo achieved the first improvement in 20 years to the cap set problem in combinatorics.

The FunSearch approach directly inspired systems like Meta's KernelEvolve, which uses similar LLM + evolutionary dynamics for kernel optimization.

Meta's Production System: KernelEvolve
Meta has the most mature production deployment of LLM-assisted kernel generation. Their KernelEvolve system is detailed in a 2025 paper, and the results are impressive:

Works across NVIDIA GPUs, AMD GPUs, and Meta's custom MTIA accelerators

Achieves 1.25x to 17x speedups across Meta's LLMs and recommendation systems

Deployed for hundreds of models serving billions of users daily

Uses a self-improving state machine with tree search and a persistent knowledge base

Meta also released KernelLLM, an 8B parameter model fine-tuned specifically for Triton kernel generation. Despite being much smaller than frontier models, it matches their performance on KernelBench Level 1. Training took only about 192 GPU hours.

One stat that surprised me: Triton has overtaken CUDA as the dominant kernel programming model at Meta. They have over 8,000 Triton kernels in production.

How Forge Approaches This
Forge takes the multi-agent approach and scales it to 32 parallel coder-judge pairs. Each pair explores a different region of the optimization space: tensor core utilization, memory coalescing, register blocking, shared memory tiling, kernel fusion patterns.

Forge System Architecture

The system takes PyTorch input, passes it through a Pattern RAG system (containing 1,711 CUTLASS patterns and 113 Triton patterns), then uses an evolutionary optimizer with MAP-Elites across 36 cells covering memory-bound, compute-bound, fused ops, and tensor core strategies. The 32 swarm agents work in parallel, each with a Coder-Judge pair, achieving 100% correctness rate with an average 2.18x speedup on outputs.

The agents compete to find the best kernel configuration. Most will get stuck in local minima (suboptimal but stable configurations). A few will find the global optimum. You keep the winners.

In practice, running 32 independent Coder+Judge agent pairs competing to find optimal kernel configurations on NVIDIA B200, only about 3 agents typically navigate successfully to the global optimum (e.g., 8.24ms latency), while others converge to local minima (14.76ms, 19.32ms) or remain wandering, oscillating, or stuck in suboptimal regions.

The system uses MAP-Elites, which is a quality-diversity algorithm that maintains an archive of solutions across different behavioral characteristics. Instead of just keeping the single best kernel, you keep the best kernel for each combination of traits (memory-bound vs compute-bound, different block sizes, etc). This diversity helps avoid getting trapped in local optima.

There's also retrieval-augmented generation from a database of 1,711 CUTLASS patterns and 113 Triton patterns. The agents don't start from scratch. They pull relevant code snippets and optimization strategies from a curated knowledge base.

Forge Benchmark Results "Latency on NVIDIA B200"

On benchmarks, Forge shows 5.16x speedup on Llama-3.1-8B (42.3ms down to 8.2ms), 4.23x on Qwen2.5-7B, and 2.87x on SDXL UNet. All measured against torch.compile with max-autotune-no-cudagraphs.

The Economics Make Sense
The performance gaps translate directly to money at scale.

FlashAttention-2 on H100 enables GPT-3 175B training for an estimated $458,000 versus $4.6M without the optimization. That's a 90% cost reduction.

One analysis from Lambda put it bluntly: if you're not using fused kernels, you're leaving 70% of your GPU performance on the table.

For inference, 80% of LLM latency comes from matmul and attention. Fused kernels aren't optional for real-time serving. vLLM and TensorRT-LLM rely on them heavily.

The Cautionary Tale
I should mention Sakana AI "AI CUDA Engineer" project. They initially claimed 10x to 100x speedups, which would have been incredible. But they later discovered the models were reward hacking. Instead of finding genuine optimizations, the LLMs found ways to exploit the benchmark measurement without actually making kernels faster.

This highlights why verification matters so much. You need robust correctness checks and real performance measurement on actual hardware. LLMs are very good at finding shortcuts. Forge addresses this with a tiered evaluation pipeline: Dedup, Compile, Test, then Benchmark on real GPUs.

What Comes Next
The research is converging on a few clear principles:

One-shot generation doesn't work for kernels. You need iterative refinement with execution feedback.

Multi-agent systems outperform single agents. Separating generation and verification helps.

Inference-time scaling is real. More test-time compute reliably improves results.

The gap between torch.compile and hand-optimized kernels is 2x to 10x for critical operations. This gap creates the economic incentive for these systems.

The open question is whether LLM-based systems can discover genuinely new algorithms like FlashAttention, or whether they'll be limited to optimizing within known patterns. My guess is we'll see incremental algorithmic improvements but not paradigm-shifting breakthroughs. The training data doesn't contain algorithms that don't exist yet.

But even "just" automating the application of known optimization patterns to arbitrary PyTorch code would be hugely valuable. That's what Forge and similar systems are trying to do.

Part of the RightNow ecosystem.

References
Ouyang et al. "KernelBench: Can LLMs Write Efficient GPU Kernels?" ICML 2025. Stanford Scaling Intelligence Lab.

Mankowitz et al. "Faster sorting algorithms discovered using deep reinforcement learning." Nature, June 2023.

Romera-Paredes et al. "Mathematical discoveries from program search with large language models." Nature, December 2023.

Meta AI. "KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators." arXiv:2512.23236, 2025.

Snell et al. "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Parameters." ICLR 2025.

Huang et al. "AgentCoder: Multi-Agent Code Generation with Effective Testing and Self-optimisation." arXiv:2312.13010, 2024.

Dao et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022.

NVIDIA. "Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling." NVIDIA Technical Blog, 2025.

Chen et al. "Teaching Large Language Models to Self-Debug." ICLR 2024.

Tillet et al. "Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations." OpenAI, 2019.

Try Forge Agent →

DEV Community

LLMs Can Now Write GPU Kernels That Beat torch.compile

Top comments (0)