Kevin

Posted on Mar 10

Andrej Karpathy Is Automating AI Research Itself — Here's What That Actually Means

#ai #opensource #programming #machinelearning

Andrej Karpathy Is Automating AI Research Itself — Here's What That Actually Means

If you missed it: four days ago, Andrej Karpathy quietly pushed a repo called autoresearch to GitHub. By Tuesday morning it had nearly 20,000 stars. That's not a fluke — it's recognition that something genuinely different just happened.

The pitch, in his own words: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model.

That sounds deceptively simple. But sit with it for a minute.

What autoresearch Actually Does

The repo is deliberately minimal. Three files that matter:

prepare.py — fixed constants, data prep, runtime utilities. The agent never touches this.
train.py — the full GPT model, optimizer (Muon + AdamW), and training loop. This is what the agent edits.
program.md — the "skill file" you write as a human, directing the agent's behavior.

You spin up Claude, Codex, or whatever LLM you prefer in this repo, point it at program.md, and let it loose. The agent is given a simple goal: lower the val_bpb (validation bits per byte) metric. Lower is better. It can change anything in train.py — architecture, optimizer settings, batch size, attention patterns, everything.

The clever constraint is the fixed 5-minute time budget. Every experiment runs for exactly five minutes of wall-clock training time, regardless of what the agent changed. This makes experiments directly comparable — if the agent decides to try a larger model, a different optimizer, or a novel attention kernel, the comparison is still fair because the time budget is identical. It also means you get roughly 12 experiments per hour, or about 100 experiments overnight on a single H100.

One hundred experiments. While you sleep. On one GPU.

Let's compare: a typical human ML researcher might run five or ten thoughtful experiments in a good week. Mostly because running experiments is slow, context-switching is expensive, and humans have the audacity to require sleep. The agent doesn't.

The Architecture Is What Makes It Interesting

The beauty of this design isn't just automation — it's the structure of the automation.

Karpathy wrote in the README's opening (which reads more like a sci-fi story than a README):

One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of "group meeting". That era is long gone.

Cheeky, but he's making a real point. The research loop that humans do — hypothesize, implement, train, evaluate, iterate — is exactly the loop the agent runs. And by making it explicit in code (program.md), it becomes programmable. You can iterate on your research process the same way you'd iterate on any software.

The program.md file is basically a lightweight "research culture document." It tells the agent what to try, what to avoid, what the evaluation metric means, and how to log results. And here's the part that's easy to miss: you can improve program.md over time. Your research org becomes a codebase that gets better.

Karpathy's setup is intentionally bare-bones. But the design implies you could build:

A program.md that encodes months of accumulated intuitions ("don't bother with dropout on small models," "always try Muon before Adam," etc.)
Multiple agents with different specializations running simultaneously
A feedback loop where successful experiments update the program.md automatically

Which brings us to the second repo.

AgentHub: "GitHub Is For Humans"

A day after autoresearch dropped, Karpathy pushed agenthub. The description is one of the best I've read in a while:

GitHub is for humans. AgentHub is for agents.

AgentHub is a collaboration platform for agent swarms working on the same codebase. The vision: instead of one agent running experiments on your machine overnight, thousands of agents across the internet contribute to a shared research effort. Each agent pushes its commits (and results) to a shared bare git repo. Instead of branches and PRs, there's a sprawling DAG of commits going in every direction, with a message board for agents to coordinate.

The architecture is impressively scrappy: one Go binary, one SQLite database, one bare git repo. A thin CLI called ah wraps the API. Agents push code via git bundles, can browse the commit DAG, find frontier leaves (commits with no children — unexplored directions), and post to a message board where they coordinate.

The message board piece is subtle and important. Agents don't just push code in isolation — they can post results, hypotheses, failures, and coordination notes. It's asynchronous communication between non-human researchers. That's... genuinely new infrastructure.

As of this writing, AgentHub has crossed 1,000 stars in under 48 hours.

Why This Matters More Than Another "AI Coding Tool"

We've had a lot of "AI makes coding faster" announcements. This isn't that.

The distinction is that autoresearch targets the research loop itself, not the implementation loop. It's not helping a human write code faster. It's replacing the human in the experimental cycle entirely.

Consider what that means for the economics of ML research. Right now, running serious hyperparameter searches requires either:

Enormous compute budgets (grid searches across cloud GPU clusters), or
Human intuition to narrow the search space intelligently

Karpathy's setup offers a third option: cheap, systematic, agentic exploration. The agent brings intuition (from its training data — billions of tokens of ML papers, code, and discussions) and applies it with machine-scale persistence. It doesn't get bored. It doesn't lose track of what it tried. It doesn't go to lunch.

The fixed time budget is particularly elegant for making this economically tractable. An H100 rents for roughly $2-3/hour. At 12 experiments/hour, you're paying maybe 20 cents per experiment. A hundred experiments overnight costs roughly $20. That's within reach of individual researchers, not just labs.

The "Boring" Practical Reality

Let me be honest about the current limitations, because hype without grounding is annoying.

It requires an H100 (or equivalent). The autoresearch repo runs on a single NVIDIA GPU, and Karpathy is explicit that it's tested on H100s. The community has already started pushing forks for MacOS (MLX, with MPS support) and Windows (RTX GPUs), but those are early. If you want to reproduce the headline numbers, you're looking at a beefy cloud instance.

The benchmark is small. autoresearch trains on a curated text corpus with a fixed training window. The improvements you find won't necessarily transfer to GPT-4-scale models or completely different architectures. This is Karpathy's explicit framing — it's a nanochat training setup, deliberately simplified. The question of whether autonomous research on toy-scale setups generalizes to frontier-scale research is real and open.

The agent still needs good program.md authorship. The human is still in the loop — just upstream. You're not writing train.py, but you're writing the instructions that shape the agent's behavior. Getting that right is its own skill, and a poorly written program.md will produce chaotic or useless experiments.

What Comes Next

The community response has been striking. Within four days:

Nearly 20,000 stars on autoresearch
Community forks for MacOS and Windows platforms
1,000+ stars on agenthub within 48 hours of dropping

Karpathy linked to a tweet with more context, and the thread is worth reading. But the repo README itself is actually the richest artifact — it's written with unusual clarity about the design philosophy and tradeoffs.

The broader question this project raises: what happens when the most productive "researcher" in the world is an AI agent running on rented compute? What does academic ML research look like when a well-resourced team can run 100,000 experiments a month for the cost of a few hundred dollars?

My guess: we see a dramatic shift toward "meta-research" — humans figuring out what questions to ask (the program.md layer), and agents handling the empirical exploration. Which is honestly how the best researchers already operate. They think at the level of experiments, not implementations.

Karpathy's framing in the README hints at this: the program.md is "intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the 'research org code' that achieves the fastest research progress."

Research org code. The meta-level artifact that tells your swarm of AI researchers how to behave. That's a genuinely new kind of thing to build.

Try It Yourself

If you have an NVIDIA GPU and Python 3.10+, the quickstart is three commands:

curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
uv run prepare.py
uv run train.py  # manually verify your setup, then...

Then open your LLM coding tool of choice in the repo directory, point it at program.md, and watch it go.

For Apple Silicon users, check out the community forks — autoresearch-mlx is the most active right now.

If you're more interested in the collaborative/distributed angle, agenthub is still early but the architecture is clean and the concept is compelling. It's one Go binary and a git repo. Easy to spin up locally and experiment with.

The Bottom Line

autoresearch is one of those repos that's simple enough to understand in an afternoon and deep enough to think about for months. The question it poses — what if AI ran its own experiments? — isn't new. But this is the first time I've seen it implemented with enough clarity and restraint to actually be usable.

The irony Karpathy leans into in his opening paragraph is real: we're building AI systems to do AI research. The ouroboros has arrived, it runs on Python, and it's MIT licensed.

Whether or not this specific project becomes the standard framework for agentic research, the direction is clear. The research loop is becoming automated. The people who figure out how to write good program.md files — how to pose good research questions at the meta level — will have an enormous advantage.

Worth paying attention to.

Links: autoresearch · agenthub · Karpathy's tweet thread