DEV Community

Dean Sharon
Dean Sharon

Posted on

How I Built Eval Tools for Karpathy's Autoresearch

TL;DR: Karpathy's autoresearch runs hundreds of GPT pretraining experiments overnight. It doesn't tell you which ones mattered. I built three CLIs that do: autojudge (noise floor + Pareto analysis), autosteer (what to try next), autoevolve (competing agents, cross-pollinate winners).

The problem

After running autoresearch for a week I had a TSV with thousands of rows and no idea what to trust.

The built-in keep/discard logic is: did val_bpb go down? That's it. No noise floor estimation. No way to know if a 0.02% improvement is real signal or run-to-run jitter. After 700 experiments I had 6 "improvements" and zero confidence in any of them.

The eval layer isn't there. Karpathy left it as an exercise.

What I built

autojudge

Reads results.tsv and run.log, estimates the noise floor from recent experiments, checks if the improvement is on the Pareto front (val_bpb vs memory), and returns a verdict with a confidence score.

pip install autojudge
autojudge --results results.tsv --run run.log
Enter fullscreen mode Exit fullscreen mode

Output looks like:

experiment_042: STRONG_KEEP (confidence: 0.91)
  val_bpb delta: -0.0041 | noise floor: ±0.0008
  pareto status: EFFICIENT

experiment_043: RETEST (confidence: 0.44)
  val_bpb delta: -0.0009 | noise floor: ±0.0011
  delta within noise -> not enough signal
Enter fullscreen mode Exit fullscreen mode

Exit codes are scripting-friendly: 0 = keep, 1 = discard, 2 = retest. You can pipe directly into your loop.

What didn't work first: I tried estimating noise floor from a single baseline run. It's too noisy itself. Needed a rolling window of recent experiments (I settled on the last 5) to get a stable estimate.

autosteer

Looks at your history of kept/discarded experiments, groups them by category (architecture, hyperparams, optimizer, regularization, etc.), and suggests what to try next.

pip install autosteer
autosteer --results results.tsv --mode exploit
Enter fullscreen mode Exit fullscreen mode

Two modes:

  • exploit: you're winning in a category, suggests more variations there
  • explore: you're stuck, suggests underexplored categories
Category analysis (last 50 experiments):
  architecture:    12 tried | 8 kept (67%) | EXPLOIT
  hyperparams:     18 tried | 6 kept (33%) | NEUTRAL
  optimizer:        8 tried | 1 kept (12%) | AVOID
  regularization:   4 tried | 0 kept (0%)  | EXPLORE

Suggested next: architecture variations (high success rate)
Specific angles: attention head count, layer depth, skip connections
Enter fullscreen mode Exit fullscreen mode

Caveat: suggestions are category-level, not causal. It can tell you architecture changes tend to work for your setup. It can't tell you why.

autoevolve

The experimental one. Puts multiple agents on separate git worktrees with different strategies. They compete on the same problem. Winning ideas cross-pollinate into the next generation.

pip install autoevolve
autoevolve --strategies conservative aggressive random --rounds 3
Enter fullscreen mode Exit fullscreen mode

Each agent gets its own worktree and runs the standard autoresearch loop with its strategy. After each round, the best-performing config gets merged into all agents as the new baseline.

This is the least polished of the three. It works. The git worktree management is clean. The cross-pollination heuristic is simplistic, I'm picking the best single config per round rather than doing anything clever with ensembles. That's next.

Installation

pip install autojudge autosteer autoevolve
Enter fullscreen mode Exit fullscreen mode

Python 3.10+, MIT license. Plugs into the standard autoresearch loop, reads results.tsv and run.log, no other dependencies on the autoresearch internals.

Repo: github.com/dean0x/autolab

What I'd do differently

The noise floor estimation in autojudge took three rewrites. My first approach (single baseline) was too noisy. My second approach (fixed window of 10) was too slow to adapt early in a run. Rolling window of 5 was the right tradeoff.

If you're using autoresearch seriously, the eval layer is where the leverage is. The overnight loop is the easy part.

Top comments (0)