TL;DR: Karpathy's autoresearch runs hundreds of GPT pretraining experiments overnight. It doesn't tell you which ones mattered. I built three CLIs that do: autojudge (noise floor + Pareto analysis), autosteer (what to try next), autoevolve (competing agents, cross-pollinate winners).
The problem
After running autoresearch for a week I had a TSV with thousands of rows and no idea what to trust.
The built-in keep/discard logic is: did val_bpb go down? That's it. No noise floor estimation. No way to know if a 0.02% improvement is real signal or run-to-run jitter. After 700 experiments I had 6 "improvements" and zero confidence in any of them.
The eval layer isn't there. Karpathy left it as an exercise.
What I built
autojudge
Reads results.tsv and run.log, estimates the noise floor from recent experiments, checks if the improvement is on the Pareto front (val_bpb vs memory), and returns a verdict with a confidence score.
pip install autojudge
autojudge --results results.tsv --run run.log
Output looks like:
experiment_042: STRONG_KEEP (confidence: 0.91)
val_bpb delta: -0.0041 | noise floor: ±0.0008
pareto status: EFFICIENT
experiment_043: RETEST (confidence: 0.44)
val_bpb delta: -0.0009 | noise floor: ±0.0011
delta within noise -> not enough signal
Exit codes are scripting-friendly: 0 = keep, 1 = discard, 2 = retest. You can pipe directly into your loop.
What didn't work first: I tried estimating noise floor from a single baseline run. It's too noisy itself. Needed a rolling window of recent experiments (I settled on the last 5) to get a stable estimate.
autosteer
Looks at your history of kept/discarded experiments, groups them by category (architecture, hyperparams, optimizer, regularization, etc.), and suggests what to try next.
pip install autosteer
autosteer --results results.tsv --mode exploit
Two modes:
-
exploit: you're winning in a category, suggests more variations there -
explore: you're stuck, suggests underexplored categories
Category analysis (last 50 experiments):
architecture: 12 tried | 8 kept (67%) | EXPLOIT
hyperparams: 18 tried | 6 kept (33%) | NEUTRAL
optimizer: 8 tried | 1 kept (12%) | AVOID
regularization: 4 tried | 0 kept (0%) | EXPLORE
Suggested next: architecture variations (high success rate)
Specific angles: attention head count, layer depth, skip connections
Caveat: suggestions are category-level, not causal. It can tell you architecture changes tend to work for your setup. It can't tell you why.
autoevolve
The experimental one. Puts multiple agents on separate git worktrees with different strategies. They compete on the same problem. Winning ideas cross-pollinate into the next generation.
pip install autoevolve
autoevolve --strategies conservative aggressive random --rounds 3
Each agent gets its own worktree and runs the standard autoresearch loop with its strategy. After each round, the best-performing config gets merged into all agents as the new baseline.
This is the least polished of the three. It works. The git worktree management is clean. The cross-pollination heuristic is simplistic, I'm picking the best single config per round rather than doing anything clever with ensembles. That's next.
Installation
pip install autojudge autosteer autoevolve
Python 3.10+, MIT license. Plugs into the standard autoresearch loop, reads results.tsv and run.log, no other dependencies on the autoresearch internals.
Repo: github.com/dean0x/autolab
What I'd do differently
The noise floor estimation in autojudge took three rewrites. My first approach (single baseline) was too noisy. My second approach (fixed window of 10) was too slow to adapt early in a run. Rolling window of 5 was the right tradeoff.
If you're using autoresearch seriously, the eval layer is where the leverage is. The overnight loop is the easy part.
Top comments (0)