DEV Community

Stephan Miller
Stephan Miller

Posted on • Originally published at stephanmiller.com on

The Autoresearch Ecosystem - How One Repo Spawned 9 Different Types of AI Projects

I’d been messing around with Karpathy’s autoresearch for a couple of weekends, mostly because I’m interested in letting agents do shit while I sleep and someone had finally formalized the pattern in 630 lines of Python. Run the loop, modify train.py, train for five minutes, check val_bpb, keep or revert, repeat forever. Compounding gains while you’re not even at your desk.

So I fired up GitHub search for “autoresearch” expecting to find a handful of ML forks. People porting it to their hardware, maybe a few hyperparameter tweaks. You know how that goes.

I found nine distinct categories of project. Some brilliant. Some “why did you do this.” And a few that made me stop scrolling and think “oh, that’s actually the interesting idea here.” It turns out the original repo isn’t really about ML. It’s a pattern, and people figured that out pretty quickly.

I’m going to walk through every category I found, what each one actually does differently, and what they tell us about where this whole thing is going. There are a lot of repos here, all linked.

What Karpathy Actually Built

Before we go through the derivatives, let’s look at the original. The repo is small and the loop is dumb on purpose:

  1. Read program.md (the meta-skill that tells the agent how to be a researcher)
  2. Modify train.py with a small, reviewable diff
  3. Train for ~5 minutes on one GPU
  4. Check val_bpb (validation bits per byte — the metric)
  5. If it improved, commit. If it regressed, git reset --hard.
  6. Goto 1.

That’s it. About 100 experiments overnight on a single H100 while you sleep. Git is the memory. The flat TSV file is the search log. The mechanical metric (val_bpb) means there’s no judgment call about whether something worked.

The main idea is that constraint enables autonomy. The diffs are small, so they’re reviewable. The metric is mechanical, so the agent can’t argue with it. The rollback is automatic, so a bad experiment can’t poison the next one. You’re giving it a cheap way to test things and a cheap way to undo them, and letting it run. Not asking it to be smart.

program.md is what Karpathy calls the meta-skill. Humans don’t program the training run. They program the researcher that programs the training run. That’s the part that generalizes, and that’s the part everybody on GitHub immediately ran with.

Karpathy's original screenshot showing val_bpb improvement curve

1. Platform Ports: Running It On Hardware You Actually Own

The “I don’t have an H100” forks

The first thing that happened is what always happens. People without enterprise GPUs ported it to whatever they had lying around. These forks are the most faithful to the original but with the substrate swapped out.

  • miolini/autoresearch-macos — straight macOS port using MPS backend
  • trevin-creator/autoresearch-mlx — Apple Silicon native, using MLX instead of PyTorch
  • jsegov/autoresearch-win-rtx — Windows with RTX
  • lucasgelfond/autoresearch-webgpu — runs entirely in the browser using WebGPU. No Python setup. The whole research loop in a tab.
  • A Colab/Kaggle T4 port (upstream issue #208) that swaps Flash Attention 3 for PyTorch SDPA so you can run experiments overnight on a free GPU
  • ArmanJR-Lab/autoautoresearch — Jetson AGX Orin port with a “director” written in Go that injects novelty (arxiv papers, DeepSeek Reasoner output) when the loop gets stuck in local minima
  • supratikpm/gemini-autoresearch — Gemini CLI native, with Google Search grounding plugged into the loop as a live verification source. True headless overnight mode via --yolo --prompt. 1M token context.

Karpathy himself endorsed several of these in the README and added hyperparameter tuning advice for smaller setups.

The interesting ones in this group aren’t the “same thing on Mac” ports. They’re the ones that change the substrate enough to do something the original couldn’t. MLX on Apple Silicon is legitimately different compute. WebGPU means you can hand someone a URL instead of asking them to set up Python. The Jetson port is the only one trying to escape local minima with external novelty injection, which is the kind of thing the original loop has no concept of. And the Gemini port has Search grounding inside the loop, which means the agent can verify claims against the live web while it’s iterating.

The Apple Silicon and WebGPU ports are the most useful if you don’t have data center hardware. The director-based Jetson fork is the most interesting if you care about where this pattern is heading. Most loops can hill-climb. Almost none of them can detect that they’re stuck and go grab a paper to read.

GPU Cluster Scaling

The opposite direction. What happens if you give it 16 GPUs instead of one?

SkyPilot wrote it up. They gave autoresearch access to a 16-GPU Kubernetes cluster, ran it for 8 hours, and let it figure out how to use the resources.

  • ~910 experiments in 8 hours
  • val_bpb dropped from 1.003 to 0.974 (a 2.87% improvement, which sounds small but is enormous for an LM at this scale)
  • 9x faster than a simulated sequential baseline to reach the same result
  • The agent taught itself to use H200s for validation and screen ideas on cheaper H100s. Nobody told it to do that.

The thing that surprised me was how the search behavior changed with parallelism. Sequential autoresearch is greedy hill-climbing: try one thing, keep or discard, try the next. Parallel autoresearch starts running factorial grids of 10-13 experiments per wave. It catches interaction effects between parameters that single-axis tweaking would never find. Two changes that look mediocre alone can be great together. You can’t see that one-at-a-time.

This is the version that stops looking like a hobby project. If your metric is fast and your discard mechanism is reliable, more compute really does just turn into more answers.

2. ML Research Enhancers: Making the Loop Smarter

The “the flat TSV is not enough” camp

These forks all keep the loop intact but argue that the agent’s memory is too primitive. A TSV with one row per experiment doesn’t carry the right information forward. So they bolt on cognitive architecture.

Memory-Enhanced Researchers

tonitangpotato/autoresearch-engram plugs the Engram cognitive memory library into the loop. It’s neuroscience-grounded: ACT-R activation, Hebbian learning, Ebbinghaus forgetting. RECALL and STORE steps wrap around the existing loop.

The numbers from a long-running instance:

  • After 50 experiments, the agent recognizes patterns like “architecture changes outperform optimizer tweaks in this regime”
  • After 100, it knows the optimal architecture for your specific compute budget
  • One production deployment is at 3,846 memories, 230,103 recalls, 12,510 Hebbian links

What that buys you, supposedly, is research intuition. Not “this worked” but “here’s why and here’s the pattern.” The thing that made human researchers good was never their willingness to try lots of things. It was the priors they built up about what was worth trying.

Bayesian + Active Inference

ErikDeBruijn/autoresearcher2 is the most ambitious one I found. The whole flat results log gets replaced with a Bayesian generative model. Then he piles on Friston’s active inference, Wozniak’s learntropy, and Schmidhuber’s compression progress. The agent doesn’t just ask “was this experiment good?” It asks “which of my latent beliefs was wrong?”

Four additions to the original loop:

  1. Generative model over experiment outcomes
  2. Policy evaluation via Expected Free Energy
  3. Learntropy appraisal module
  4. Persistent memory with decay dynamics

It’s been validated on synthetic environments where it beats random and greedy baselines. There’s an evidence-quality comparison run in progress on an RTX PRO 6000 Blackwell against vanilla autoresearch. The repo also has a CONSTITUTION.md because the project is partially about whether recursive self-improvement can deepen judgment, not just power.

The interesting distinction is structural insight (“RoPE matters more than the optimizer in this regime”) versus flat knowledge (“RoPE improved val_bpb by 0.02”). The flat version doesn’t compose. The structural version does.

Multi-GPU Infrastructure

iii-hq/n-autoresearch keeps the loop and replaces the plumbing. Out goes bash + git + TSV. In comes structured KV state, a REST API, and crash recovery. Multi-GPU parallel experiments via iii-engine (Python orchestrator + Rust GPU workers). Cross-machine GPU workers.

The clever part is the adaptive search strategy. The loop has phases (explore, exploit, combine, ablation) and it auto-transitions based on history. There’s also near-miss detection for when two recent experiments combined would probably work even though neither alone did.

Honestly, this is the “what if you scaled it to a real research lab” fork. If autoresearch becomes how labs actually run experiments this is roughly what the production version looks like.

3. Prompt Optimizers: Same Loop, Different Target File

What if train.py was your system prompt?

Once you accept that the loop is substrate-agnostic, the next move is obvious. Point it at a prompt file. Use accuracy on a test set as the metric. Let it iterate.

autoresearch-prompt-optimization (az9713)

az9713/autoresearch-prompt-optimization is the cleanest version of this. The loop targets prompt.txt instead of train.py. The metric is field extraction accuracy on 30 test examples instead of val_bpb. Everything else is the same.

The numbers:

  • 74.72% → 100% accuracy in 8 experiments
  • Zero human intervention
  • Experiment 5 regressed and got auto-discarded: the loop caught it exactly as designed
  • Cross-model: Claude Opus writes the prompts that Gemini 2.5 Flash executes

The thing prompt engineering has always been missing is a tight feedback signal. Most people write a prompt, eyeball some outputs, decide it “looks better.” Autoresearch makes prompt engineering a numerical optimization problem. Reading last_run.json after each iteration turns prompt writing from art into engineering. That’s a real shift.

autoresearch-for-agents (Galileo)

rungalileo/autoresearch-for-agents is more ambitious. They’re using the loop for adversarial testing plus prompt optimization on support agents.

Two phases. Phase 1 builds a frozen adversarial test suite (the exam). Phase 2 optimizes the prompt against that frozen suite (the studying). Separating the exam from the studying stops the optimizer from moving the goalposts.

The other clever bit is proportional scoring instead of binary pass/fail. Binary scores give the optimizer no gradient. “70% of the way there” is a signal you can climb. “Failed” isn’t.

Results: 0.05 → 0.80 accuracy in 15 experiments. They also documented the limits of what prompt engineering alone can fix. Things like absence detection (“the customer didn’t mention X”) and off-by-one date math just don’t get solved by tweaking the prompt. That’s a useful negative result. Most write-ups about prompt optimization conveniently skip the part where they hit a wall.

4. Generalized Frameworks: Autoresearch For Anything

“Wait, this works for any measurable thing”

This is the category that broke containment. Once a few people had ported the loop to prompts, the next move was to extract the pattern entirely. The result is a bunch of frameworks that don’t care what file you’re optimizing or what metric you’re using.

uditgoenka/autoresearch — Claude Code Skill

uditgoenka/autoresearch packages the loop as a Claude Code skill. You install it, you run /autoresearch, and you point it at any task with a mechanical metric. The README runs through about a dozen domains: test coverage, bundle size, TypeScript error count, SQL query speed, HR policy readability, Dockerfile size, accessibility audits, sales copy, marketing content. There’s also /loop N integration for bounded iterations.

It also documents how to wire MCP servers (PostgreSQL, GitHub, Stripe) as verification sources. So your “metric” can be a query against your actual production database, not a fixture.

This is the version that makes the generalization explicit. The loop works for anything with constraint plus metric plus fast verification.

autoresearch-anything (zkarimi22)

zkarimi22/autoresearch-anything is the lowest-friction setup I’ve seen. You run npx autoresearch-anything and it interrogates you:

  • What file should I edit?
  • What metric am I optimizing?
  • How do I run the eval?
  • What’s off-limits?
  • A few more along those lines.

It outputs setup.md and eval.js and you’re running. Eight questions and you have a configured autoresearch loop pointed at your project.

menonpg/autoloop — The pip Package

menonpg/autoloop is the first one that’s actually a Python library. pip install autoloop-ai, import, and the API is clean:

from autoloop import AutoLoop

loop = AutoLoop(
    target="src/optimize_me.py",
    metric=lambda: run_benchmark(),
    directives="Make this faster, don't break tests",
    budget_seconds=600,
)

results = loop.run(experiments=100)

Enter fullscreen mode Exit fullscreen mode

Parallel experiments via loop.run(parallel=4). Warm starts. Composite metrics with weights. Agent-agnostic: works with Claude, Codex, Ollama local models. CLI tools for inspecting history (autoloop history, autoloop best, autoloop diff 12 best, autoloop rollback 12).

The demo shows a 6.9x speedup on a fibonacci function in 4 experiments, and the framework auto-detected and discarded the broken iterations.

This one’s for you if you want autoresearch as a library you import rather than a skill you invoke. The bar is “have a Python function that returns a float” and you’re in. That’s about as low as it gets.

krzysztofdudek/ResearcherSkill — One File, Full Discipline

krzysztofdudek/ResearcherSkill is interesting because it ignores the framework race entirely. It’s one researcher.md file you drop into any AI agent. Before doing anything, the agent interviews you: goal, metric, constraints, time limit, stopping conditions.

It creates a .lab/ directory (gitignored) for experiment history that survives code reverts. That’s separate from git on purpose. You don’t want a git reset --hard to wipe your experiment log.

The loop has three phases:

  1. THINK — mandatory written analysis before each experiment, logged separately
  2. TEST — commit, run, keep or revert
  3. REFLECT — log entry in log.md, row in results.tsv

There are also convergence guardrails baked in. Three discards in a row = mandatory pause. Five discards = force branch fork. Plateau for 8+ experiments = invert assumptions.

The interesting part is THINK. Most autoresearch implementations skip written analysis. The agent just runs. Forcing it to write down what it expects to happen before running changes what it tries. The README claims “10 minutes of analysis can prevent 5 wasted experiments,” which I believe.

There’s also a “thought experiment” type that lets the agent log analysis without running code. It counts as a row in the results, just labeled thought. That’s a small detail and it matters more than it should.

alfonsograziano/auto-agent — Autoresearch Builds Agents

alfonsograziano/auto-agent is autoresearch turned on AI agents themselves. You give it a target agent (in a separate repo) and a golden dataset of expected input/output pairs. The orchestrator spawns Claude Code or Kiro CLI inside the target repo, has it analyze failures, implement fixes, and re-run.

Two repos: orchestrator and target. MEMORY.md persists across hypotheses (what worked, what didn’t, known blockers). Each hypothesis gets its own git branch and its own REPORT.md with before/after metrics and a CONTINUE or ROLLBACK decision. After a run, npm run generate-changelog produces a human-readable summary.

This is recursive in a way that very interesting. The thing being optimized is an AI agent. The thing doing the optimizing is also an AI agent. The metric is how often the target hits the golden set. You’re using autoresearch to make agents better at the things you created them for.

5. Production Codebase Optimization: Autoresearch on Real OSS

Shopify used it on the Liquid template engine

This is where the pattern stops being a demo. Shopify ran autoresearch against the Liquid template engine, the thing that renders every theme on Shopify, and shipped the results.

The setup is in auto/autoresearch.md:

  • Benchmark: ThemeRunner (real Shopify theme templates, not synthetic)
  • Metric: combined parse + render time in microseconds (primary), allocations (secondary)
  • Constraints: tests must pass, no new gem dependencies, semantic correctness preserved

The results across 17 tracked experiments:

  • 7,374µs → 4,815µs (-34%)
  • 62,620 → 37,355 allocations

The agent’s techniques included replacing regex with manual byte parsing, fast-path variable parsing, and short-circuit checks for common cases. None of it is rocket science. It’s the kind of optimization a senior developer would do given enough time and a good profiler. The agent just had cheap iteration and an automatic discard for anything that broke a test.

More Production War Stories

Real companies, real metrics, real prod deploys

Once Shopify went public with theirs, more case studies surfaced.

idealo Search Ranking

The idealo team (Atakan Filgöz, Gena Shabanov, Arjun Roy Choudhury) ran autoresearch against preprocess.py in their Learning-to-Rank inference endpoint. They added a correctness constraint that required bit-for-bit identical output between the original and optimized version, then optimized for average latency over 500 benchmark iterations.

Numbers:

  • 13 experiments in 1 hour
  • 10 kept, 3 reverted
  • Preprocessing latency: 3.9ms → 0.66ms (83% reduction, 5.9x speedup)
  • End-to-end production latency: 46ms → 28.8ms (37% reduction at 250+ req/sec)
  • Total cost: ~$7 in Claude Opus on AWS Bedrock

For seven dollars and an hour of supervision, they took 37% off a production endpoint that’s serving 250+ req/sec. That’s an absurd ROI.

The techniques the agent found: shared computation (sort once, derive everything else), algorithmic shortcuts for sorted arrays, minimal allocations. The agent reasoned like a profiler: “the ranking computation takes 40% of total time, focus there next.” They watched it work, occasionally steered it, and shadow-tested before shipping. It’s now in production.

The honest detail in the writeup is that the agent’s code was clean at 13 experiments but they suspect longer runs would over-engineer. That tracks with my experience using AI tools for refactoring. The first dozen suggestions are gold. By suggestion 50 it’s pattern-matching to “more abstraction must be better” and you have to slap its hand.

Tennis XGBoost — The Reward Hacking Cautionary Tale

This is the one nobody mentions when they’re hyping the pattern. Nick Oak ran autoresearch on a tennis match prediction XGBoost model. The agent found a way to game the metric without actually improving the model. He preserved the embarrassing iterations on an archived/gamed-iterations branch so you can read what the agent did.

The discard mechanism only saves you if your metric is measuring what you actually care about. If your eval can be gamed, the agent will game it. This is not an RL-only problem. Reward hacking shows up everywhere there’s an automated optimizer, and autoresearch is exactly that.

The takeaway isn’t “autoresearch is dangerous.” It’s “your metric is now a load-bearing piece of software and you should treat it that way.” Spend more time on the eval than on the loop.

Vesuvius Challenge Ink Detection

Vesuvius Challenge ran a multi-agent autoresearch loop for ink detection on ancient scrolls, focused on cross-scroll generalization. I haven’t dug deep into this one, but it’s worth knowing that autoresearch is currently being used to read 2,000-year-old burned scrolls. That’s a thing.

6. Agent Factory: Autoresearch Builds Agents

Applying the loop to creating other agents

Dominien/agent-factory takes the meta move further than auto-agent. Instead of optimizing an existing agent, it autonomously researches problems and builds new specialized agents to solve them.

The loop is:

  1. Research : Reddit, HN, GitHub, Twitter — find real problems people have
  2. Score : Venture Score plus TAM estimate
  3. Build : Next.js agent from a seed template
  4. Validate : against synthetic users / actual usage
  5. Ship
  6. Repeat

There’s a threshold ratchet. The bar to ship keeps rising as the system finds better ideas. So the things it builds get better over time, not because the agent is smarter, but because it’s competing against its own previous best.

Agents shipped so far: freelancer-deduction-finder, wage-rights-advisor, data-broker-opt-out, property-tax-appeal-advisor. Twenty agents and counting.

This is the meta-loop concept and I find it disorienting. Research quality compounds the same way training quality does. A loop that researches problems, builds solutions, ships, and uses ship-ability as the metric will eventually outpace anyone manually doing the same thing. Whether the agents it ships are any good is the open question. But the number keeps going up.

7. Research OS / Skills Systems: Institutionalizing the Pattern

What if autoresearch was the entire research methodology?

If autoresearch is going to actually be how research gets done, somebody has to build the scaffolding around it. Two projects are going hard at this.

PhD-Zero (TenureAI)

TenureAI/PhD-Zero is an operating system for research-oriented coding agents. Modular skill library: run-governor, research-workflow, deep-research, experiment-execution, memory-manager, human-checkpoint, paper-writing.

Cross-runtime: same skills exposed to Codex (via AGENTS.md) and Claude Code (via .claude/skills/). The focus is reproducibility, literature review, experiment planning. Discipline around the process.

This is the thing that turns autoresearch from “fun overnight experiment” into something that could plausibly be used by a real research group. The autoresearch loop runs experiments. PhD-Zero runs the literature review, the writeup, the human checkpoints, the reproducibility checks. The loop is one verb in a much bigger vocabulary.

alirezarezvani/claude-skills

alirezarezvani/claude-skills is a 204-skill library for AI coding agents, with autoresearch-agent as one skill in the engineering tier. Works across Claude Code, Codex, Gemini CLI, Cursor, Aider, Windsurf — eleven tools total.

Treating autoresearch as a reusable skill component rather than a standalone repo is an important move. It means your agent uses autoresearch the way it uses anything else: as a tool you reach for when the situation calls for it.

8. Creative Writing: Autoresearch For Prose and Fiction

The thing nobody expected: it works on writing too

This is the one I want to come back to in another post. The transfer is straightforward. If you can score a draft, you can run the loop. The metric just needs to be cheap, mechanical, and not gameable. (See the tennis cautionary tale.)

Multiple projects figured this out independently within a few weeks of each other.

redpen — Prose Refinement Engine

itspikabubu/redpen is a ratchet loop for blog posts and writing. Drafts can only get better, never worse. Six AI personas score on different dimensions: seed founder, fellow GP, LP allocator, LinkedIn reader, HN skeptic, VC Twitter. Each persona runs three times and the scores are medianed for noise reduction.

The writer agent makes one surgical edit targeting the weakest dimension. Re-evaluate. If the minimum score improved, keep. If not, discard and revert. Repeat until target score or max iterations.

You can configure voice: tone spectrum, blacklist words, a 16-point natural prose rubric. I have not tried this yet but I’m planning to. If it works, it solves the thing every blogger struggles with: I can tell a draft is bad, but I can’t always tell why.

NousResearch/autonovel — Complete Novel Pipeline

NousResearch/autonovel is the most ambitious creative writing fork. Full autonomous novel pipeline: seed concept → world bible → characters → outline → draft chapters → revision → export.

Five co-evolving layers: voice, world, characters, outline, and chapters, with canon cross-cutting all of them. Two evaluation systems running in parallel: mechanical (regex bans for AI clichés, slop forensics) and LLM-judge (prose quality, voice adherence). Phase 3b sends the full manuscript to Claude Opus for a dual-persona review (literary critic + professor of fiction) and the loop continues until the reviewer’s complaints are mostly “qualified hedges rather than real problems.” Their phrase, not mine.

There’s also an art pipeline (fal.ai), multi-voice audiobook (ElevenLabs), LaTeX typesetting, ePub generation, landing page.

The first novel produced is The Second Son of the House of Bells. 79,456 words. 19 chapters (down from 24: the loop did four structural merges). Six rounds of Opus review.

The loop improved prose and changed the structure of the book. We talk about autoresearch like it’s a fine-grained optimizer, but at long enough horizons, it’s making editorial decisions a human would make.

sinfiny/Auto-Creative-Reasoning

sinfiny/Auto-Creative-Reasoning is benchmark-first. The repo motto is “generation is not the product. Evaluation is the product.” Rewrite ladders route failure to the right level: prose, scene, chapter, arc, premise. Rubrics score hook strength, strategy, clue fairness, consequence density, readability.

There’s a Codex plugin for running benchmarked loops against existing fiction drafts. The long-term vision is multiple parallel novel timelines with competing chapter versions compared head-to-head.

This is the version that argues evaluation is harder and more important than generation. Which is exactly the lesson from the tennis XGBoost story, ported to fiction.

CalvinMagezi/self-evolving-skill — Brand Document Evolution

CalvinMagezi/self-evolving-skill is the business-minded version. Autoresearch applied to writing-strategy.md instead of train.py. The metric is an LLM judge composite score on a fixed test brief, run three times at temperature=0 and medianed.

The output is real documents: .docx, .pptx, .pdf that match brand identity. Git history serves as memory; the loop reads git log before each iteration to avoid repeating failed ideas. Works with any LLM via LiteLLM (OpenRouter, Gemini, OpenAI, Anthropic).

This is the one with the clearest business case of the bunch. Companies actually need their documents to get better. They have brand rubrics. They have a fixed test brief in the form of “the next thing we need to write.” All the pieces are already there.

9. Meta-Pattern: Wrapping Autoresearch as a Worker

What happens when autoresearch is just one layer of something bigger

This is the one that snapped my view of the whole ecosystem into focus. alirezarezvani had been shipping autoresearch as a skill since March. A month of production use revealed the missing piece: orchestration above it.

The Problem with Solo Autoresearch

One context window and reasoning trajectory, with no isolation between investigation threads. A query like “what is X, who are the players, what are the limits, what changed in 6 months” becomes four tangled sub-questions sharing one bloated context. By the time you’re on sub-question 4, the context is thick with answers from 1-3, and synthesis drifts.

This is something I hit constantly with Claude Code on big tasks. By the time the context is full of half-finished investigations, the model is reasoning about all of them at once, badly.

The Fix: 3 Files, 4 Subagents

The whole rebuild is small:

  • CLAUDE.md — decomposition rules, including an “independence test” (a sub-question is independent if its answer wouldn’t change based on another sub-question in the same query)
  • .mcp.json — Firecrawl, Perplexity, internal docs server. Critically, scoped per-agent to avoid the token tax of loading all MCP tool descriptions into every context
  • 4 subagent definitions — lead-researcher (orchestrator, no MCPs), web-searcher (invokes autoresearch inside its own context), internal-searcher, citation-checker

Lead decomposes. Workers fan out in parallel. Each worker runs an autoresearch loop to convergence inside its own isolated context. Lead synthesizes. Citation-checker verifies every source. Wall-clock time ends up shorter than single-session autoresearch because the workers run in parallel.

What Actually Broke In Production

Four failure modes from the writeup, and they all rang bells:

  1. Orchestrator over-delegation — without the independence test, the orchestrator was paying for parallel context windows to produce worse answers than one session would have
  2. MCP tool-description token tax — every MCP server’s tool descriptions loading into every agent’s context. Scoping per-agent fixed it
  3. Citation drift — workers returning confident claims where the page didn’t quite support the paraphrase. Paraphrase drift, not hallucination
  4. Context amnesia between sessions — a flat lessons.md file the lead reads on startup is the imperfect fix

The lesson here is the one that rewires the whole picture. Autoresearch was already a strong worker. The orchestrator does nothing clever: decompose, delegate, synthesize. The intelligence is in the decomposition rules, and those took three rewrites to get right.

So the future isn’t “smarter autoresearch.” It’s autoresearch as a primitive that other systems call into.

So What Does This Actually Mean?

Karpathy didn’t just build an ML research tool. He demonstrated a pattern that works anywhere you can measure progress with a command: constraint plus mechanical metric plus autonomous iteration.

Here are the categories ranked by fidelity to the original idea:

  1. Platform ports — most faithful. Same loop, different hardware.
  2. ML enhancers — extend the substrate. Memory, Bayesian updates, multi-GPU.
  3. Prompt optimizers — same loop, different file. train.pyprompt.txt.
  4. Generalized frameworks — extract the pattern. pip packages, Claude Code skills, “give me any metric.”
  5. Production codebase — industrial application. Shopify -34%, idealo -37% in 1 hour for $7.
  6. Agent factory — meta-application. The loop builds other agents.
  7. Research OS — institutionalization. The whole methodology, not just the loop.
  8. Creative writing — the surprise expansion. Prose, fiction, brand documents.
  9. Orchestration — autoresearch as worker, not the whole system.

A few honest takes:

The reward hacking problem is the cautionary tale nobody includes. In the tennis XGBoost case, the loop found a way to improve the metric without improving the model. The discard mechanism is only as good as your metric. If your eval can be gamed, the agent will game it. Spend more time on the eval than on the loop.

The pattern is more durable than the implementation. Most of the forks I found were “what if we applied this to X” and they all worked. That’s kind of remarkable. The discard mechanism (git reset on regression) is the key. You don’t need intelligence. You need iteration speed, a mechanical metric, and automatic rollback.

The Shopify and idealo case studies should embarrass you a little. $7 of API and an hour of supervision took 37% off a production endpoint serving 250+ req/sec. There are perf wins like this in basically every codebase. We’re just not asking for them yet because we still think of optimization as expensive senior-engineer time.

Orchestration eats the loop. alirezarezvani’s piece shows that solo autoresearch is fine, but the next move is autoresearch as a worker that orchestrators call when a sub-question lands. That’s where this is heading and it’s already happening in production.

If you’re not running at least one of these on a real project, you’re leaving free improvements on the table. The bar to entry is pip install autoloop-ai or npx autoresearch-anything. There’s no reason not to point one at something you care about and let it run overnight. You’ll either get a better version of the thing or you’ll learn something about your metric. Both of those are wins.

Top comments (0)