TL;DR
Andrej Karpathy released autoresearch — 630 lines of Python that let an AI agent autonomously run hundreds of ML experiments while you sleep. 50.7k GitHub stars in 2 weeks. MIT license.
What is autoresearch?
The entire project is 3 files:
train.py # ~630 lines, the agent modifies this
prepare.py # Data preprocessing (immutable)
program.md # Research direction (human-edited)
The loop is simple:
Agent reads train.py
→ Forms hypothesis
→ Modifies code
→ Runs 5-min training
→ Evaluates val_bpb
→ Keep (improved) or Discard (worse)
→ Repeat hundreds of times
That's it. No complex orchestration. No multi-GPU setup. One GPU, one metric, one loop.
The Results
Karpathy's run:
| Metric | Before | After |
|---|---|---|
| Experiments | 0 | 700 (2 days) |
| Training time | 2.02h | 1.80h (11% faster) |
Shopify CEO Tobi Lütke:
| Metric | Result |
|---|---|
| Experiments overnight | 37 |
| Performance improvement | 19% |
| Key finding | 0.8B model > 1.6B model |
A smaller model outperforming a model twice its size through agent-driven optimization — that's the headline.
The Karpathy Loop
Fortune coined the term. Karpathy himself put it bluntly:
"All LLM frontier labs will do this. It's the final boss battle."
His X post hit 8.6M views. The concept resonated because it's both simple and powerful.
Architecture Deep Dive
# Simplified autoresearch loop (conceptual)
while True:
# Agent reads current train.py
current_code = read_file("train.py")
# Forms hypothesis based on current state
hypothesis = agent.analyze(current_code, program_md)
# Modifies code
modified_code = agent.modify(current_code, hypothesis)
write_file("train.py", modified_code)
# Short training run (5 minutes)
result = run_training(timeout=300)
# Evaluate
if result.val_bpb < best_val_bpb:
best_val_bpb = result.val_bpb
# Keep the change
else:
# Revert
write_file("train.py", current_code)
The genius is in the constraints:
- Single metric: val_bpb (validation bits-per-byte) eliminates ambiguity
- 5-minute cycles: Fast feedback, many iterations
- Immutable prepare.py: Agent can't mess up data preprocessing
- program.md: Human sets direction, agent executes
Beyond ML: The SETI@home Vision
Karpathy referenced SETI@home — distributed agents worldwide running experiments independently and sharing results. He called it "research community emulation."
The ecosystem is already expanding:
- DarkMatter: Extended autoresearch variant
- Optimization Arena: Competitive experiment platform
- NanoClaw: Lightweight fork
- Community forks: Windows (RTX), Apple Silicon (M1-M4), smaller GPUs
Getting Started
# Prerequisites
# - NVIDIA GPU (20GB+ VRAM)
# - Python 3.10+
# - uv package manager
# Clone and setup
git clone https://github.com/karpathy/autoresearch
cd autoresearch
# Install with uv
uv sync
# Download dataset (ClimbMix from Hugging Face)
python prepare.py
# Connect your coding agent (Claude Code, Cursor, etc.)
# Point it at train.py and program.md
# Let it run overnight
Limitations
Be aware of these before running:
- Goodhart's Law: Single metric optimization can game the metric without real improvement
- Overfitting: Short training + automated decisions = potential overfitting to validation set
- Single GPU only: No multi-GPU or distributed training support yet
- Narrow search space: Agent modifies only within train.py boundaries
- Scale transfer: Results on small experiments may not transfer to production scale
What This Means for Developers
The autoresearch pattern isn't limited to ML research. The loop — hypothesize, modify, test, evaluate, keep/discard — applies to:
- Performance optimization (automated benchmarking loops)
- A/B testing at scale
- Code quality improvement
- Hyperparameter tuning
If you have a clear metric and a modifiable codebase, this pattern works.
Links:
- GitHub: autoresearch (50.7k stars, MIT)
- Fortune: The Karpathy Loop
- VentureBeat: autoresearch analysis
- DataCamp: autoresearch guide
Have you tried autoresearch? Would love to hear about results on different hardware setups.
Top comments (0)