Karpathy released autoresearch last week. 31,000 stars.
100 ML experiments overnight on one GPU.
Everyone wrote about the ML training loop.
I saw something different: a pattern.
One file. One metric. One loop. Modify → Evaluate → Keep or Discard → Repeat.
That pattern has nothing to do with machine learning.
So I built a skill that applies it to:
→ API response time (benchmark_speed evaluator)
→ Bundle size (benchmark_size evaluator)
→ Headline click-through (LLM judge evaluator)
→ System prompt quality (LLM judge evaluator)
→ Test pass rate, build speed, memory usage
Works across 11 tools: Claude Code, Codex, Gemini CLI,
Cursor, Windsurf, OpenClaw, and more.
The hardest problem: evaluating things that are not numbers.
Headlines do not come with a val_bpb metric.
Solution: LLM judges using the agent's own subscription.
Critical constraint: the agent cannot modify its own evaluator.
(The alignment problem in miniature.)
What I have not done yet: run 100 experiments overnight.
The skill shipped this week. The architecture is solid.
The validation is ahead of me.
Full architecture + honest limitations:
On Github
What manual optimization loop are you running that should be automated?
Top comments (0)