DEV Community

정상록
정상록

Posted on

Karpathy's autoresearch: How 630 Lines of Python Run 700 ML Experiments Overnight

TL;DR

Andrej Karpathy released autoresearch — 630 lines of Python that let an AI agent autonomously run hundreds of ML experiments while you sleep. 50.7k GitHub stars in 2 weeks. MIT license.

What is autoresearch?

The entire project is 3 files:

train.py     # ~630 lines, the agent modifies this
prepare.py   # Data preprocessing (immutable)
program.md   # Research direction (human-edited)
Enter fullscreen mode Exit fullscreen mode

The loop is simple:

Agent reads train.py
  → Forms hypothesis
    → Modifies code
      → Runs 5-min training
        → Evaluates val_bpb
          → Keep (improved) or Discard (worse)
            → Repeat hundreds of times
Enter fullscreen mode Exit fullscreen mode

That's it. No complex orchestration. No multi-GPU setup. One GPU, one metric, one loop.

The Results

Karpathy's run:

Metric Before After
Experiments 0 700 (2 days)
Training time 2.02h 1.80h (11% faster)

Shopify CEO Tobi Lütke:

Metric Result
Experiments overnight 37
Performance improvement 19%
Key finding 0.8B model > 1.6B model

A smaller model outperforming a model twice its size through agent-driven optimization — that's the headline.

The Karpathy Loop

Fortune coined the term. Karpathy himself put it bluntly:

"All LLM frontier labs will do this. It's the final boss battle."

His X post hit 8.6M views. The concept resonated because it's both simple and powerful.

Architecture Deep Dive

# Simplified autoresearch loop (conceptual)

while True:
    # Agent reads current train.py
    current_code = read_file("train.py")

    # Forms hypothesis based on current state
    hypothesis = agent.analyze(current_code, program_md)

    # Modifies code
    modified_code = agent.modify(current_code, hypothesis)
    write_file("train.py", modified_code)

    # Short training run (5 minutes)
    result = run_training(timeout=300)

    # Evaluate
    if result.val_bpb < best_val_bpb:
        best_val_bpb = result.val_bpb
        # Keep the change
    else:
        # Revert
        write_file("train.py", current_code)
Enter fullscreen mode Exit fullscreen mode

The genius is in the constraints:

  • Single metric: val_bpb (validation bits-per-byte) eliminates ambiguity
  • 5-minute cycles: Fast feedback, many iterations
  • Immutable prepare.py: Agent can't mess up data preprocessing
  • program.md: Human sets direction, agent executes

Beyond ML: The SETI@home Vision

Karpathy referenced SETI@home — distributed agents worldwide running experiments independently and sharing results. He called it "research community emulation."

The ecosystem is already expanding:

  • DarkMatter: Extended autoresearch variant
  • Optimization Arena: Competitive experiment platform
  • NanoClaw: Lightweight fork
  • Community forks: Windows (RTX), Apple Silicon (M1-M4), smaller GPUs

Getting Started

# Prerequisites
# - NVIDIA GPU (20GB+ VRAM)
# - Python 3.10+
# - uv package manager

# Clone and setup
git clone https://github.com/karpathy/autoresearch
cd autoresearch

# Install with uv
uv sync

# Download dataset (ClimbMix from Hugging Face)
python prepare.py

# Connect your coding agent (Claude Code, Cursor, etc.)
# Point it at train.py and program.md
# Let it run overnight
Enter fullscreen mode Exit fullscreen mode

Limitations

Be aware of these before running:

  1. Goodhart's Law: Single metric optimization can game the metric without real improvement
  2. Overfitting: Short training + automated decisions = potential overfitting to validation set
  3. Single GPU only: No multi-GPU or distributed training support yet
  4. Narrow search space: Agent modifies only within train.py boundaries
  5. Scale transfer: Results on small experiments may not transfer to production scale

What This Means for Developers

The autoresearch pattern isn't limited to ML research. The loop — hypothesize, modify, test, evaluate, keep/discard — applies to:

  • Performance optimization (automated benchmarking loops)
  • A/B testing at scale
  • Code quality improvement
  • Hyperparameter tuning

If you have a clear metric and a modifiable codebase, this pattern works.


Links:

Have you tried autoresearch? Would love to hear about results on different hardware setups.

Top comments (0)