Andrej Karpathy's autoresearch repo has exactly three files that matter.
Two of them are the agent's domain. The agent edits the training code, runs five-minute experiments, checks validation loss, keeps or discards the result, and loops. All night. Without you.
The third file is yours. It's a Markdown document. You write it once. Then you go to sleep.
83 experiments while you sleep. 15 kept improvements. The agent never touches your file. You never touch its file. The arena is the interface between you.
That Markdown file — program.md — is the most important artifact in the system. Not the training code. Not the model architecture. The document that tells the agent how to think about the problem.
The bottleneck isn't compute. It's your program.md.
The Fixed Metric Problem
Here's what most people miss about autoresearch: the 5-minute training budget.
Every run gets exactly 5 minutes. Not "roughly 5 minutes." Not "5 minutes unless the experiment looks promising." Exactly 5 minutes, every time, judged on the same vocabulary-size-independent metric.
That fixed budget is doing most of the work. Without it, you have 100 data points that can't be meaningfully compared. A run that trained for 8 minutes on a smaller architecture versus a run that trained for 3 minutes on a larger one — which was better? You can't know. The comparison breaks.
The fixed metric is what makes the arena work. It's the one thing the agent cannot change. The agent can modify the neural network architecture, the optimizer, all the hyperparameters. It cannot change the evaluation criterion. That stays human-defined.
This is the design question every arena has to answer: what is the one thing that cannot move?
Without an anchoring criterion, the arena has no walls. The agent optimizes toward a target that is also shifting. You wake up to 100 experiments that can't be compared and a model that may or may not be better than when you started.
The fixed metric is not a technical parameter. It is a judgment about what success means. Encoded into architecture before the experiments begin.
The Same Pattern Across Scales
Karpathy built this for ML research. The pattern is older and runs deeper.
A developer working with Claude Code creates a CLAUDE.md file — not a document but a logical router. Hard rules at the top. Decision trees for ambiguous cases. Explicit end conditions before any session starts. The agent reads it at boot. The agent operates inside it. The developer never touches the code directly — they maintain the specification that shapes how the agent approaches the code.
A team running agents on a shared codebase defines mode separation — Discover, Plan, Build, Review, Bugfix, Loop. Each mode is a different arena with different rules. You don't go straight to Build. You move through Discover first. The modes are the walls. They exist so the agent can't run indefinitely in the wrong direction.
An organization deploying agents across workflows builds a governance framework — authority boundaries, escalation paths, audit trails. What the agent can do alone. What requires sign-off. What gets logged. The governance layer is an arena at institutional scale.
Research lab. Individual developer. Team. Organization. All arenas. All the same design question: what are the walls, what is the fixed metric, and who defines them?
The human artifact at each level:
- program.md tells the agent how to think about research
- CLAUDE.md tells the agent how to think about code
- TASK_CONTRACT.md defines done before the session starts
- The governance framework defines authorization scope before the agent acts
These are all the same file. Different domains, different scales, same function. The specification that shapes how the agent thinks is now the primary artifact of human creative work.
What Happens Without an Arena
Victor Taelin wanted a feature built. He gave his agents two days and a budget. The agents ran. They were productive by every execution metric — they wrote code, ran tests, iterated on output. They were thorough. They were fast.
He spent $1,000. He got eighteen rounds of wrong work.
There was no fixed metric. There was no anchoring criterion. There was no arena. The agents ran experiments inside a problem that wasn't well-defined, and they optimized efficiently toward the wrong answer because nothing in the system could tell them they were answering the wrong question.
This is the failure mode the arena prevents. Not incompetence — the agents were competent. Not laziness — they ran 24 hours. The failure was the absence of walls. Without an arena, the agent has no way to know whether it's making research progress or just making commits.
The bash loop version of this is Ralph Wiggum — feed a prompt to an agent, loop indefinitely, keep whatever emerges. It works when the prompt is precise enough to function as an arena. It fails when the prompt is vague enough to contain the wrong answer alongside the right one.
The discipline is not in the agent. The discipline is in the specification.
The Recursive Problem
Andrej Karpathy's program.md is already 90% AI-written. He said so himself.
Someone in the thread asked: why not let the agents iterate on the prompt too? If the agent can improve the training code, why can't it improve the specification that guides the training?
It's the right question. And the answer is: it can, but only if something else stays fixed.
If program.md iterates and the training metric iterates, you have a system optimizing toward a target that is also rewriting its own success criteria. That's not research. That's drift.
Karpathy's fix is val_bpb — the validation metric that doesn't move. The agent can rewrite everything else. The evaluation criterion stays human-defined.
This is the anchoring requirement stated as a design principle: recursive self-improvement requires at least one layer that cannot self-improve. The evaluation criterion for the layer below has to come from the layer above.
Foundation's answer to this is provenance. The evaluator decides what knowledge gets promoted to semantic memory — but the criteria for evaluation are human-defined and human-maintained. The agent generates knowledge candidates. The human defines what counts as knowledge worth keeping. That layer doesn't automate.
The loop can run indefinitely. The anchoring criterion has to stay.
The Bottleneck Moved
A year ago the bottleneck was writing the code. Then it was writing the prompt. Now it's writing the program.md — the document that tells the agent how to think about the entire problem domain.
The best labs won't have the most engineers. They'll have the best program.md.
This sounds like good news. In some ways it is — the leverage available to a single person with a well-designed arena is genuinely enormous. Karpathy runs 100 ML experiments overnight on one GPU. A developer with a good CLAUDE.md ships what used to require a team. A researcher with a precise specification wakes up to 15 kept improvements from 83 experiments she didn't run herself.
But the bottleneck moved. It didn't disappear.
Writing program.md well requires understanding the problem domain deeply enough to define what success looks like before the experiments begin. It requires knowing which metric to fix and why. It requires building walls that are tight enough to prevent drift and loose enough to allow genuine discovery.
That's harder than writing code. Code either compiles or it doesn't. An arena either produces useful experiments or it produces 100 comparable runs toward the wrong answer — and you might not know which until you've slept through several nights of agent activity.
The shift is not from hard work to easy work. It's from the hard work of execution to the harder work of specification. The agent handles the experiments. The human handles the judgment about what the experiments should be measuring.
That judgment doesn't automate. The loop runs. The arena stays yours.
This is part of a series on what AI actually changes in software development. Previous pieces: The Gatekeeping Panic, The Meter Was Always Running, Who Said What to Whom, The Token Economy, I Shipped Broken Code and Wrote an Article About It.
Top comments (0)