Andrej Karpathy Just Open-Sourced the Future of AI Research: Meet autoresearch
Five days ago, Andrej Karpathy dropped a repo on GitHub that already has nearly 25,000 stars. It's called autoresearch, and if you haven't looked at it yet, you should — because it's less a tool and more a glimpse at what AI research is going to look like for the next decade.
The pitch is deceptively simple: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. You go to sleep. The agent modifies training code, runs 5-minute experiments, checks whether validation loss improved, keeps or discards the change, and repeats. You wake up to a log of ~100 experiments and (hopefully) a better model.
That's it. That's the whole thing. And it's kind of terrifying how obvious it sounds in retrospect.
The Setup: Deliberately Minimal, Deliberately Clever
The repo is built around three files:
-
prepare.py— fixed constants, one-time data prep (downloads training data, trains a BPE tokenizer), and the evaluation harness. Do not modify. This is the ground truth. -
train.py— the single file the agent edits. It contains a full GPT model (with Flash Attention 3, RoPE, GQA, ResFormer value residuals, RMSNorm, a Muon + AdamW hybrid optimizer, sliding window attention). Everything is fair game. -
program.md— the agent's marching orders. Written in Markdown, by you, the human. This is the only thing you're really "programming" anymore.
The metric is val_bpb (validation bits per byte) — lower is better, vocabulary-size-independent so you can compare architectures fairly even if the agent changes vocab_size.
The training time budget is fixed at 5 minutes (wall clock, excluding startup/compilation). This is one of the most interesting design decisions in the repo.
Why a Fixed Time Budget Changes Everything
At first glance, a fixed 5-minute budget seems like a constraint. It's actually what makes the whole thing work.
With a fixed budget:
All experiments are directly comparable. If the agent changes the model size, the optimizer, the batch size — doesn't matter. Every run gets 5 minutes on your GPU. The metric is apples-to-apples.
The system auto-optimizes for your hardware. A smaller, more efficient model that trains faster might get more steps in 5 minutes and actually outperform a larger one on your specific GPU. The autoresearch loop will discover this.
Rate of iteration stays predictable. ~12 experiments per hour. ~100 while you sleep. You can reason about the search budget in concrete terms.
The tradeoff is that your results won't be comparable to someone else running on different hardware. But Karpathy's framing here is: the goal isn't a universal benchmark, it's the best model for your machine in your time budget. That's actually the right goal for most practical use cases.
The Architecture the Agent Is Iterating On
Let's look at what train.py actually contains, because it's not trivial — and the fact that an AI agent is being asked to improve it makes it all the more impressive.
The base model is a GPT-style transformer with some modern upgrades:
@dataclass
class GPTConfig:
sequence_len: int = 2048
vocab_size: int = 32768
n_layer: int = 12
n_head: int = 6
n_kv_head: int = 6
n_embd: int = 768
window_pattern: str = "SSSL"
Notable things baked in from the start:
-
Flash Attention 3 via
kernels, with automatic Hopper vs non-Hopper routing (H100 gets FA3, everything else gets kernels-community) -
Grouped Query Attention (GQA) —
n_kv_headcan be smaller thann_head, like Llama 3 - RMSNorm on Q/K — QK-norm, the trick that stabilizes attention at scale
- Value Embeddings (ResFormer) — alternating layers get a value residual path with input-dependent gating; it's a recent technique from the ResFormer paper that improves training stability
-
Sliding window attention — the
window_patternstring ("SSSL") controls which layers get local vs global attention - RoPE — Rotary Position Embeddings, standard at this point
- Muon + AdamW — the optimizer is a combination: Muon (the Newton-Schulz orthogonalized SGD) for weight matrices, AdamW for embeddings and everything else. This combo has been quietly putting up strong numbers in nanoscale training benchmarks
The agent modifies any or all of this. And importantly, the repo's program.md tells the agent to have a simplicity criterion: a tiny improvement that adds 20 lines of ugly code isn't worth it. Complexity must earn its keep.
What "Programming in Markdown" Actually Means
The most philosophically interesting part of autoresearch is program.md. It's the agent's instructions, written in natural language. Karpathy describes it as a "super lightweight skill." Here's what it tells the agent to do in a new experiment session:
- Agree on a run tag (e.g.,
mar11) - Create a branch:
autoresearch/<tag> - Read all in-scope files for context
- Initialize a
results.tsvwith a header row - Run the baseline training first to establish the starting val_bpb
- Iterate: propose a change, modify
train.py, run, record the result, keep or discard
The agent cannot install new packages. It cannot modify prepare.py. It cannot cheat the evaluation. The sandbox is tight. But within train.py, everything is fair game.
This is what "programming the agent" looks like in 2026. You're not writing Python. You're writing research organization documentation, and the agents follow it. Karpathy's readme even opens with a slightly satirical note:
"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of 'group meeting'. That era is long gone."
That's a joke. For now. Give it a year.
The Real Implication: Research Speed Is Now Hardware-Bound (Mostly)
Here's the thing that keeps jumping out at me about autoresearch. We've been talking abstractly for years about "AI accelerating AI research." This repo makes it concrete and runnable on your laptop's GPU (or rather, your H100, if you've got one).
The questions shift dramatically:
Old world: How many researchers can you hire? How good are they? How fast do they iterate?
New world: How many GPUs can you spin up? How good is your program.md? How fast is your evaluation loop?
The humans are now writing the research organization spec — the program.md — and letting agents execute within it. Karpathy even hints at the natural next step: the best program.md is itself something you can autoresearch for. Meta-research. Agents finding the best way to run agents.
Getting Started: What You Actually Need
Hardware: A single NVIDIA GPU. The repo says H100 specifically, but the codebase routes to kernels-community/flash-attn3 for non-Hopper GPUs, so A100s and others should work. There's already a fork specifically targeting a single RTX 3080.
Software: Python 3.10+, uv (the fast Rust-based Python package manager).
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Setup
uv sync
uv run prepare.py # Downloads data, trains tokenizer (~2 min)
# Test a single run
uv run train.py
# Then: point Claude/Codex at the repo, reference program.md, let it go
The last step is "spin up your AI coding assistant of choice in this directory, disable all permissions except file editing and shell exec, and prompt it to start a new autoresearch run." Claude, Codex, whatever you're using — the program.md functions as the skill spec.
The Community Response: Already Forking and Building
In five days, 3,154 forks. People are already:
- Building web-based dashboards to visualize the
results.tsvacross runs - Creating forks targeting specific consumer GPUs (3080, 4090, etc.)
- Building "SelfEvo" — self-evolving AI training systems inspired by autoresearch, but adding support for Gemini, OpenAI, and Claude
- Writing meta-repos that distill portable patterns from autoresearch for other domains beyond LLM training
The GitHub issue tracker is moving fast. Two bug fixes landed just this morning (March 11): an infinite loop guard when no training shards exist, and a fix for NaN loss not being caught by the fast-fail check. Karpathy is actively maintaining it.
What This Actually Means for ML Practitioners
If you're doing ML research or applied ML, here's the honest take:
You should run this at least once. Not necessarily because you'll get a breakthrough model out of it, but because the experience of watching an AI agent run experiments autonomously is clarifying. It changes your mental model of what "doing research" is.
The 5-minute budget is a useful primitive. Adapt this pattern for your own work. Fixed time budgets + automated metric evaluation + agent-in-a-loop = a research acceleration pattern that generalizes well beyond nanochat.
The program.md is the real moat. Hardware and models are increasingly commoditized. The people who figure out how to write good research organization specs — the program.md equivalent for their domain — will compound their advantages.
The simplicity criterion matters. Karpathy's insistence that the agent prefer simpler solutions even at small accuracy cost is not just aesthetic. Complexity in ML training code is where bugs hide, where reproducibility breaks, and where future iteration becomes painful. It's good research hygiene encoded into the agent instructions.
The Bottom Line
autoresearch is 25,000 stars in 5 days because it's real. It's runnable. It's not a demo or a paper — it's a workflow. And it feels like one of those rare open-source releases that doesn't just ship a tool, but ships a mental model shift.
The shift is: ML research is becoming more like devops. You don't write every line of the model yourself. You write the evaluation harness, the constraints, the research org spec — and then you run the loop.
Karpathy's comment that this "is the story of how it all began" is funnier the more you think about it. The 10,205th generation of the codebase may be a joke. Today.
The code is at github.com/karpathy/autoresearch. Go run an experiment tonight.
Top comments (0)