Karpathy's AutoResearch: What Happens When You Let an AI Run Its Own Experiments Overnight
Written by Arshdeep Singh
In March 2026, Andrej Karpathy released something quietly remarkable: AutoResearch — a framework that lets an AI agent autonomously run machine learning experiments, iterate on its own training code, and improve a model overnight without human intervention.
This isn't a research paper. It's a working system you can clone and run today. And it points to something significant about where AI-assisted research is heading.
GitHub: karpathy/autoresearch
The Problem It Solves
ML research is, fundamentally, an experimental loop:
- Have an idea
- Implement it
- Train a model
- Evaluate results
- Keep or discard the change
- Go back to step 1
This loop is slow because it's human-bottlenecked. Steps 2-5 can take hours per cycle. A researcher might run 5-10 experiments per day if they're focused. Most of that time is waiting — waiting for training runs, waiting for evaluations to complete, waiting to context-switch back to the right mental model.
AutoResearch removes the human from steps 2-5 entirely. The loop runs overnight. You wake up to 50 completed experiments.
The Architecture
AutoResearch is built on nanochat — Karpathy's minimal GPT implementation designed for single-GPU training runs. Each training job takes about 5 minutes, which is the key design constraint: fast enough to run many experiments in a single overnight session.
The system has exactly three files that matter:
prepare.py — Fixed
Data preparation and tokenization. The agent never touches this. The dataset and preprocessing are locked in, giving the agent a stable foundation to experiment from.
train.py — Agent's Playground
This is where the agent operates. It contains everything about the model: architecture decisions, hyperparameters, optimizer configuration, learning rate schedules, regularization. The agent reads this file, proposes a modification, implements it, and measures whether it helped.
program.md — Your Research Direction
Here's the clever part: you don't write Python to configure AutoResearch. You write markdown.
program.md is the research organization's charter. You describe what you're trying to achieve, what directions seem promising, what constraints to respect. The agent reads this document and uses it to guide its experimental strategy.
Want to focus on attention mechanisms? Write that in program.md. Want to avoid changes that increase parameter count beyond a threshold? Write that too. The agent follows it.
How the Loop Works
Read program.md → Understand current model state
→ Propose an experiment (hypothesis + implementation)
→ Edit train.py
→ Run training (~5 min)
→ Evaluate validation loss
→ Compare against baseline
→ If improvement: commit change, update baseline
→ If regression: revert, log the failure
→ Record findings in experiment log
→ Repeat
In a single overnight run, the system executed 50 experiments. It explored:
- Attention head configurations
- Activation functions
- Layer normalization variants
- Learning rate schedule shapes
- Optimizer hyperparameters
- Residual connection patterns
By morning, it had found configurations that meaningfully improved the baseline model — without a single human keypress after the initial launch.
What This Actually Means
Let's be precise about what AutoResearch is and isn't.
It is:
- A working demonstration that AI agents can run meaningful ML experiments autonomously
- A practical tool for exploring hyperparameter and architecture spaces overnight
- A framework that treats research direction as a natural-language configuration problem
- Open-source and runnable today on a single GPU
It isn't:
- A system that generates novel research ideas from scratch
- A replacement for human intuition in designing experiments
- A tool that works on arbitrary codebases (it's scoped to nanochat)
- Guaranteed to find improvements — many experiments fail
The honest framing: AutoResearch is an automated experimental assistant, not an autonomous research scientist. You still define the direction. It executes and iterates faster than you can.
The program.md Insight
The design decision to use a markdown file for research configuration is worth dwelling on.
Most automation systems are configured with code or structured config files. AutoResearch deliberately chooses natural language. This means:
- Researchers without strong coding skills can participate — you describe your research intent in prose, not parameters
- The configuration is human-readable — you can audit what the agent understood and adjust it
- The boundary between researcher and agent is clear — humans write intent, agents write code
This is a small but meaningful step toward AI systems that collaborate with humans at the level of ideas rather than just implementation.
Getting Started
git clone https://github.com/karpathy/autoresearch
cd autoresearch
pip install -r requirements.txt
# Configure your research direction
vim program.md
# Prepare your dataset
python prepare.py
# Launch overnight run
python autorun.py
You'll need a single GPU (the system is designed for consumer hardware — an RTX 3090 or 4090 is ideal). Set it running before you sleep. Review results in the morning.
Implications for the Field
AutoResearch is a prototype of something bigger: AI as an active participant in scientific research.
The current model is: human researchers use AI tools to accelerate their work. The next model is: AI systems run experiments in parallel with human researchers, exploring the search space faster than any single person could.
We're not at the "AI has research ideas" stage yet. But we're clearly at the "AI can run research experiments faster than humans" stage. AutoResearch makes that concrete and tangible.
Karpathy has a track record of releasing tools that become foundational — nanoGPT, micrograd, minbpe. AutoResearch feels like another one of those releases: minimal, clearly designed, and pointing at something important.
Final Thoughts
The most interesting thing about AutoResearch isn't the technical implementation — it's the workflow it enables. Run experiments while you sleep. Wake up to data. Make decisions informed by 50 trials instead of 3.
That's not a marginal improvement in research productivity. It's a structural shift in what a single researcher can accomplish.
For anyone doing ML research or experimentation on single-GPU hardware, AutoResearch is worth studying and running. Even if you don't use it directly, the design philosophy — natural language research configuration, autonomous experimental loops, fast iteration over small models — is worth internalizing.
👉 github.com/karpathy/autoresearch
Written by Arshdeep Singh
Top comments (0)