Dean Sharon

Posted on Mar 18 • Edited on Mar 22

Why I Built Eval Tools for Karpathy's Autoresearch

#ai #machinelearning #python #autoresearch

TL;DR: Karpathy's autoresearch runs hundreds of GPT pretraining experiments overnight. It doesn't tell you which ones mattered. I built three CLIs that do: autojudge (noise floor + Pareto analysis), autosteer (what to try next), autoevolve (competing agents, cross-pollinate winners).

The problem

After running autoresearch for a week I had a TSV with thousands of rows and no idea what to trust.

The built-in keep/discard logic is: did val_bpb go down? That's it. No noise floor estimation. No way to know if a 0.02% improvement is real signal or run-to-run jitter. After 700 experiments I had 6 "improvements" and zero confidence in any of them.

The eval layer isn't there. Karpathy left it as an exercise.

What I built

autojudge

Reads results.tsv and run.log, estimates the noise floor from recent experiments, checks if the improvement is on the Pareto front (val_bpb vs memory), and returns a verdict with a confidence score.

pip install autojudge
autojudge --results results.tsv --run run.log

Output looks like:

experiment_042: STRONG_KEEP (confidence: 0.91)
  val_bpb delta: -0.0041 | noise floor: ±0.0008
  pareto status: EFFICIENT

experiment_043: RETEST (confidence: 0.44)
  val_bpb delta: -0.0009 | noise floor: ±0.0011
  delta within noise -> not enough signal

Exit codes are scripting-friendly: 0 = keep, 1 = discard, 2 = retest. You can pipe directly into your loop.

What didn't work first: I tried estimating noise floor from a single baseline run. It's too noisy itself. Needed a rolling window of recent experiments (I settled on the last 5) to get a stable estimate.

autosteer

Looks at your history of kept/discarded experiments, groups them by category (architecture, hyperparams, optimizer, regularization, etc.), and suggests what to try next.

pip install autosteer
autosteer --results results.tsv --mode exploit

Two modes:

exploit: you're winning in a category, suggests more variations there
explore: you're stuck, suggests underexplored categories

Category analysis (last 50 experiments):
  architecture:    12 tried | 8 kept (67%) | EXPLOIT
  hyperparams:     18 tried | 6 kept (33%) | NEUTRAL
  optimizer:        8 tried | 1 kept (12%) | AVOID
  regularization:   4 tried | 0 kept (0%)  | EXPLORE

Suggested next: architecture variations (high success rate)
Specific angles: attention head count, layer depth, skip connections

Caveat: suggestions are category-level, not causal. It can tell you architecture changes tend to work for your setup. It can't tell you why.

autoevolve

The experimental one. Puts multiple agents on separate git worktrees with different strategies. They compete on the same problem. Winning ideas cross-pollinate into the next generation.

pip install autoevolve
autoevolve --strategies conservative aggressive random --rounds 3

Each agent gets its own worktree and runs the standard autoresearch loop with its strategy. After each round, the best-performing config gets merged into all agents as the new baseline.

This is the least polished of the three. It works. The git worktree management is clean. The cross-pollination heuristic is simplistic, I'm picking the best single config per round rather than doing anything clever with ensembles. That's next.

Installation

pip install autojudge autosteer autoevolve

Python 3.10+, MIT license. Plugs into the standard autoresearch loop, reads results.tsv and run.log, no other dependencies on the autoresearch internals.

Repo: github.com/dean0x/autolab

What I'd do differently

The noise floor estimation in autojudge took three rewrites. My first approach (single baseline) was too noisy. My second approach (fixed window of 10) was too slow to adapt early in a run. Rolling window of 5 was the right tradeoff.

If you're using autoresearch seriously, the eval layer is where the leverage is. The overnight loop is the easy part.

Top comments (6)

Harjot Singh • May 31

Eval tooling is the unsexy work that separates real agent systems from demos - if you can't measure whether a change made the agent better or worse, you're just vibing on top of vibes. Building evals for autoresearch specifically is hard mode too, since "good research" resists a clean pass/fail, so how you defined the rubric is the interesting part.

The thing evals unlock beyond quality is confidence to optimize cost: once you can measure output quality, you can safely push work down to cheaper models and prove you didn't regress - otherwise nobody dares touch the expensive setup. That measure-then-route loop is core to how I run Moonshift (prompt to a shipped SaaS on your own GitHub+Vercel). Great topic; did you land on LLM-as-judge, reference-based, or human-in-the-loop scoring for the research outputs? (Moonshift's first run's free if useful.)

Dean Sharon • Jun 8

trying to abstract myself outside the loop so went with a judge - can it be trusted? not yet :)

Apex Stack • Mar 20

The exploit/explore framework in autosteer resonates with a pattern I keep running into outside of ML training. I run a financial data platform where I generate content programmatically across thousands of pages using a local LLM, and the same signal-vs-noise problem shows up when evaluating content quality at scale. A small improvement in generation prompts might lift quality scores on 200 pages — but is that real improvement or just variance in the LLM output? I ended up building something conceptually similar to your rolling window approach: comparing output quality metrics across batches rather than individual pages to get a stable baseline.

The autosteer category analysis is particularly interesting. In content generation I've found the same dynamic — certain categories of prompt changes (structural changes to page templates) have a much higher hit rate than others (tweaking tone or adding adjectives). Having that exploit/explore toggle would save a lot of wasted experimentation cycles. Curious whether you've considered adding a decay factor to the category success rates so that older experiment results get weighted down over time, since what works early in a training run might not generalize later.

Dean Sharon • Mar 22

decay factors make sense. the noise floor shifts as training progresses so early wins don't always hold. been thinking about weighting recent runs higher in autojudge. curious how aggressive your decay is on the financial data side.

klement Gunndu • Mar 19

The rolling 5-run noise floor is a smart call — fixed windows either overfit to early noise or lag behind distribution shifts. Worth noting that autoevolve's multi-worktree approach has a subtle race condition risk if two agents mutate the same config layer simultaneously.

Dean Sharon • Mar 19

Thanks for that last one!