"The human might be asleep." One line in Karpathy's program.md started 100 automatic experiments per night.

#ai #opensource #machinelearning #agents

The biggest bottleneck in code optimization is the human in the loop. You think of an idea, implement it, test it, check results, then think again. In March 2026, Andrej Karpathy removed that bottleneck. He released autoresearch, a tool that lets an AI agent edit code, run experiments, evaluate results, and keep or discard changes automatically. It hit 42,921 GitHub stars in under two weeks (GitHub API, 2026-03-19 11:56 UTC).

The surprising part is where it spread. Shopify CEO Tobi Lutke applied the pattern to Liquid, a template engine running in production for 20 years. He reported a 53% reduction in parse+render time in PR #2056. LangChain CEO hwchase17 used it to optimize agent quality scores. Ole Lehmann reported raising a Claude Code skill eval score from 56% to 92%. This is not an ML research tool anymore. It is a pattern for any task with a measurable metric.

Why three files are enough

The architecture is stripped to the minimum. There are three core files.

program.md is the instruction file. A human writes it. It defines what to optimize, how to run experiments, and what must not break. train.py is the only file the agent edits. prepare.py is the evaluation harness. Nobody touches it.

This separation works because the boundary between "what changes" and "what stays fixed" is clear. The agent edits train.py, runs a 5-minute experiment, checks the metric. If it improved, git commit. If not, git reset. About 12 experiments per hour. Leave it running overnight and you get about 100.

The 5-minute cap is what makes this work. It forces every experiment into the same budget. You can compare results fairly. Without a fixed budget, a slow-converging change looks just as good as a fast one. The cap makes comparison possible.

program.md includes this line: "The human might be asleep, or gone from a computer and expects you to continue working indefinitely until you are manually stopped." That single instruction removes the human bottleneck.

126 experiments from Karpathy, 974 tests from Shopify

Karpathy ran 126 experiments on a single H100 in about 10.5 hours. He published the full log in Discussion #43. Out of 126 experiments, 23 were kept. That is about 18%. Most experiments fail or make things worse. But the ones that improve stack up. val_bpb went from 0.9979 to 0.9697.

The biggest win was halving the batch size (524K to 262K), which gave -0.0119. The biggest failure was weight tying (sharing embed and unembed layers), which added +2.24 BPB and completely broke the model. The dead-end log is valuable too. Knowing what does not work saves future experiments from going in the wrong direction.

Shopify took a different approach. The target was a Ruby library (lib/liquid/*.rb), not ML training code. The metric was combined_us (parse + render time), not val_bpb. The critical difference was a 3-gate validation system. Every change had to pass 974 unit tests, then a liquid-spec compliance check, then a performance benchmark. Only changes that passed all three gates and improved the metric were kept. About 120 experiments produced 93 commits. Parse time dropped 61%. Render time dropped 20%. Total dropped 53%.

The key insight was that garbage collection was consuming 74% of CPU time. Focusing on reducing object allocations drove most of the improvement. Allocations went from 62,620 to 24,530, a 61% reduction.

Caveats

Shopify PR #2056 was still OPEN as of 2026-03-19. It has not been merged. Comments on the PR mention test failures. The 53% figure is self-reported and has not been independently verified.

Metrics gaming is a known issue. After 30+ iterations, agents start finding ways to improve the metric without real improvement. Random seed engineering is one example. Karpathy's log includes fragile improvements like "seed 137 effect" that may not reproduce.

autoresearch-at-home (440+ stars) extends the pattern to distributed collaboration in a SETI@home style. autoresearch-anything (by zkarimi22) generates setup files for any project with npx autoresearch-anything. The MLX port for Apple Silicon found that a depth=4 model beats depth=8 under the 5-minute budget. Smaller models that run more optimizer steps win. The optimal setup depends on the hardware.

Conclusion

autoresearch success does not depend on model capability. It depends on three design choices. Metric: what you measure. Scope: what the agent is allowed to change. Verify: what tests and constraints protect the things that must not break.

Shopify's 53% improvement happened because they built a 3-gate Verify system with 974 tests, spec compliance, and benchmarks. If you want to apply this pattern, start by asking two questions. Do you have a measurable metric? Do you have a test suite that protects what matters? If the answer to both is yes, you can let an AI run 100 experiments while you sleep.