An AI Agent Found 20 ML Improvements Karpathy Had Missed in 20 Years

#ai #productivity #machinelearning #python

Andrej Karpathy released autoresearch on GitHub last week, and the results are worth understanding carefully. Not because of the hype, but because of how the architecture actually works.

The framework is 630 lines of Python. It runs an AI agent in a loop: read a training script, form a hypothesis, modify the code, run a short training job (five minutes), evaluate results against a scalar metric, repeat. On Karpathy's own ML training setup, the agent ran 700 experiments over two days on a single GPU and found an 11% training speedup through 20 optimizations he says he hadn't discovered in 20 years of working on the same codebase.

Then Shopify's CEO ran the same approach on internal data. 37 overnight experiments. 19% performance gain. Applied to their Liquid templating engine: 53% faster rendering, 61% fewer memory allocations, 93 automated commits, all 974 unit tests passing. The repo hit 42,000 GitHub stars in its first week.

The Architecture Is the Lesson

The design is deliberately minimal. The entire agent contract lives in one file: program.md. That file carries:

What to optimize (the objective, stated in natural language)
Constraints (what the agent must not do: break tests, increase memory footprint, etc.)
Stopping criteria (when to declare success or give up)

The agent reads program.md, modifies the training script, runs the job, parses the metric from the output, logs the result, and loops. No external tool calls. No internet access. No vector database of prior experiments. Just: read, modify, train, evaluate, repeat.

Karpathy's phrase for this pattern is "program synthesis via experiment." The agent isn't writing the optimizer from scratch. It's running empirical search over the space of code modifications, guided by a metric signal.

The Constraint That Actually Matters

Here's where a lot of the coverage has been imprecise: autoresearch only works where quality is measurable with a single scalar value.

Training loss, rendering time, memory allocations, test pass rate. These are scalar metrics. You can compare them across runs. An agent can know unambiguously whether run N+1 was better than run N.

Natural language quality isn't scalar. Alignment properties aren't scalar. Whether a piece of code is readable isn't scalar. Whether a product decision is the right one isn't scalar.

This constraint is the boundary condition for the entire framework. Karpathy acknowledges it: "It works best on problems where you have a clear eval." The framing in some coverage ("AI will now do all research autonomously") misses this. Autonomous research works for ML training, hyperparameter optimization, compiler tuning, and similar problems with quantifiable objectives. It doesn't yet work for the domains where human judgment is most irreplaceable.

That said, Shopify's result is a useful demonstration that the "clear eval" bar isn't as narrow as it might seem. Rendering time for a templating engine is a straightforward metric, but deriving a 53% improvement from 37 overnight experiments against that metric is genuinely impressive.

What to Take From This

If you're doing any ML work that involves iterative training runs, autoresearch is now the default first step before manual hyperparameter search. The framework is on GitHub. Read program.md specifically. The single-file design for agent instructions + constraints + stopping criteria is a pattern worth stealing for any iterative agent task, not just ML optimization.

Karpathy's framing of the bigger picture: "Humans are now the bottleneck in AI research with easy-to-measure results." That's precise language. For the domains where measurement is hard, humans remain central. For the domains where it's easy, the leverage has shifted.

This story is from Edge Briefing: AI, a weekly newsletter curating the signal from AI noise. Subscribe for free to get it every Tuesday.