DEV Community

Cover image for I Turned Karpathy's Autoresearch Into a Skill That Optimizes Anything — Here Is the Architecture
Reza Rezvani
Reza Rezvani

Posted on

I Turned Karpathy's Autoresearch Into a Skill That Optimizes Anything — Here Is the Architecture

Karpathy released autoresearch last week. 31,000 stars.
100 ML experiments overnight on one GPU.

Everyone wrote about the ML training loop.
I saw something different: a pattern.

One file. One metric. One loop. Modify → Evaluate → Keep or Discard → Repeat.

That pattern has nothing to do with machine learning.

So I built a skill that applies it to:
→ API response time (benchmark_speed evaluator)
→ Bundle size (benchmark_size evaluator)

→ Headline click-through (LLM judge evaluator)
→ System prompt quality (LLM judge evaluator)
→ Test pass rate, build speed, memory usage

Works across 11 tools: Claude Code, Codex, Gemini CLI,
Cursor, Windsurf, OpenClaw, and more.

The Full Medium Article

The hardest problem: evaluating things that are not numbers.
Headlines do not come with a val_bpb metric.

Solution: LLM judges using the agent's own subscription.
Critical constraint: the agent cannot modify its own evaluator.
(The alignment problem in miniature.)

What I have not done yet: run 100 experiments overnight.
The skill shipped this week. The architecture is solid.
The validation is ahead of me.

Full architecture + honest limitations:
On Github

What manual optimization loop are you running that should be automated?

Top comments (0)