Everyone's writing about Karpathy's autoresearch. Most of it is "here's how the loop works" or "imagine the possibilities." I wanted to see what happens when you point it at a real codebase with a real metric, not a training script. I wanted to try it!
So I ran two rounds. 60 total iterations. The first round improved things. The second round found nothing - and that turned out to be even more interesting.
The System
I work on a hybrid search system: Cohere embeddings in pgvector for semantic similarity, then a keyword re-ranking layer on top. Django, PostgreSQL, Bedrock. The kind of search stack a lot of teams are probably running right now.
The ranking logic lives in one file: utils.py. It takes the top 100 vector search candidates, scores them on keyword and tag matches across location, activity, and general terms, normalizes everything with z-scores, applies adaptive correlation-based weighting to avoid double-counting, and combines it all into a final score: similarity * (1 + keyword_boost).
There are a lot of knobs. Base weights for three query types. A scoring formula for body keyword matches. Z-score clipping bounds. A correlation shrinkage function. The final combination formula. All hand-tuned. All "seems about right."
Perfect autoresearch target.
The Setup
The autoresearch pattern is simple: one file, one metric, one loop. The agent edits the file, runs the eval, keeps improvements, reverts failures, repeats.
Here's what I set up:
The constrained file: utils.py — ranking logic only. The embedding service, query metadata extraction, database schema — all frozen.
The metric: A composite score: 80% Precision@12 (how many of the top 12 results are actually relevant) and 20% MRR (is the best result near the top). I weighted it this way because MRR was already at 0.975 — almost every query already had the right #1 result. The room to improve was in the rest of the top 12.
The test set: 20 queries across three types (location, activity, general) with hand-labeled expected results. Things like "best hiking trails near Aspen Colorado," "beginner backpacking gear list," "avalanche safety backcountry skiing." I ran each query, looked at the top 50 results, and picked the ones that actually answered the question.
The eval caching trick: Each query hits Bedrock twice (query metadata + embedding). That's 15 seconds per query. But the agent only modifies the ranking logic — the embeddings and metadata don't change between iterations. So I cached all the API results on the first run and monkey-patched them in on subsequent runs. Eval went from 6 minutes to about 30 seconds.
I wrote an instructions.md that told Claude Code exactly what it could touch, what it couldn't, and what strategies to try in roughly what order. Here's the skeleton:
## The Constrained File
- `src/service/utils.py` — ONLY file you may edit
## What You Cannot Modify
- eval script, test queries, embedding service, cache files, schema
## Eval
- `uv run manage.py run_autoresearch_eval`
- SCORE = 0.8 * Precision@12 + 0.2 * MRR
## Strategy Guidance (roughly in this order)
1. Quick wins: base weights, pool size, zclip range
2. Scoring function: damping, formula shape
3. Weight optimization: fine-tune per query type
4. Experimental: combine best ideas
## Do NOT
- Add API calls or new dependencies
- Edit frozen files
- Spend 3+ iterations on the same dead approach
Then I walked away. Literally, I went and played with my kids.
Round 1: The Results
Baseline: 0.6933 composite (P@12: 0.6292, MRR: 0.9500)
Final: 0.7200 composite (P@12: 0.6500, MRR: 1.0000)
44 iterations. 3 kept. 41 reverted.
Let that sink in: 93% of experiments made things worse or changed nothing.
Here are the three changes that survived:
1. Bigger base weights, scaled by query type. Location queries got 5x the original weight. Activity queries 3x. General 2x. The system had been under-weighting the keyword signals relative to the embedding similarity.
2. Exponential scoring formula. Swapped (1-d) * (1+boost) for (1-d) * exp(boost*0.3). Better separation between boosted and unboosted items. This also fixed the one query where MRR wasn't perfect.
3. Higher general weights. Pushed 5x on the general query type weights specifically, which improved "best hikes in the world" from P@12 0.667 to 0.750.
None of these are surprising in hindsight. That's kind of the point.
What Didn't Work (the Actually Useful Part)
This is where the value is.
Bigger candidate pools don't help. I expected that going from 100 to 150 or 200 re-ranking candidates would surface articles that were just barely outside the original pool. Nope. The expected articles were already in the top 100 by vector distance. The problem was ranking, not recall.
Title matching is noise. Seemed like a slam dunk — articles with query terms in the title should rank higher, right? In practice, tons of irrelevant articles also have those terms in their titles. Net negative.
Disabling adaptive weighting hurts. The correlation shrinkage I'd built in (reduce keyword weight when keywords correlate with embedding similarity) was actually pulling its weight. Removing it caused regressions.
Keyword density scoring backfires. Normalizing keyword counts by article length seemed smart. It wasn't. Shorter articles aren't more relevant — they just have fewer words.
Body keyword damping doesn't matter. Whether you use 1 + log1p(count) * 0.5 or 1 + log1p(count) * 0.3 or min(count, 3) — the scores barely move. The exact damping formula is not where the signal is.
Each of these would have been a reasonable thing to try manually. And each would have taken 15-30 minutes of "change, test, evaluate, think about it, decide." The agent burned through all of them in a few hours and proved definitively that they're dead ends.
And fiddling with just the weights would have taken forever if I even bothered going that far. Instead, we basically have LLM-led gradient descent:
Round 2: Optimizing the Prompt (and Finding Nothing)
Round 1's final log said the ceiling was upstream — the quality of keywords extracted by Claude Haiku from the user's query. So I ran a second round targeting the Haiku prompt in embedding_service.py. Same test set, same metric, Round 1's ranking changes frozen.
16 iterations. Zero improvements. But two findings that were worth the entire round:
The Redis trap. The metadata extraction function caches results by hash(query) — not hash(query + prompt). My first two iterations showed improvements that weren't real. The eval was reading cached metadata from the old prompt. I only caught it when I cleared the cache manually and the "improvements" vanished. If you're running autoresearch on anything with a caching layer, make sure the cache key includes everything that could change between iterations.
The co-optimization ceiling. Round 1 tuned the ranking weights to work with the specific metadata distribution the original prompt produces. Changing the prompt changes that distribution, and the frozen ranking can't adapt. Every prompt change that improved location queries degraded activity queries. The two components were coupled, and optimizing them sequentially hit a wall that optimizing them together wouldn't have.
This is the thing nobody mentions about autoresearch: sequential rounds have a structural ceiling. Round 1 overfits to the current state of the frozen components. Round 2 can't improve those components without undoing Round 1's gains. If you're planning multi-round autoresearch, either co-optimize both components in one round, or know that each round's ceiling will be lower than the last.
Where the Ceiling Is
After 60 iterations across two rounds, the score settled at 0.72. The ranking math is near-optimal. The prompt is at a Pareto boundary. The remaining weak queries are ones where the right articles are far away in embedding space — "bike packing routes Pacific Northwest" returns road trip content because the embeddings think those are similar.
The next improvement needs a cross-encoder re-ranker or a better embedding model. That's a different project, not a different autoresearch run.
Was It Worth It?
For the ranking improvements alone? Probably not. A +0.03 on a composite score is real but marginal.
For the knowledge? Absolutely. I now know, with 60 data points, that my ranking logic was already close to optimal, that the adaptive weighting I built actually works, that keywords are essentially decorative in this system (the embeddings do all the real work), that the Redis cache doesn't key on prompt changes, and that my next improvement has to come from the embedding layer.
I would not have arrived at any of that from manual tuning. I would have tried 8-10 things, gotten frustrated, and moved on with lingering uncertainty.
The autoresearch pattern works best not when it finds big wins, but when it maps the ceiling of a system. "You can stop tuning this" is an underrated finding.
If You Want to Try This
You don't need GPUs. You don't need an ML training loop. You need:
- One file the agent can edit
- One metric that goes up when things get better
- A fast eval (cache everything that doesn't change between iterations)
-
An
instructions.mdthat tells the agent the rules
Write the eval first. Label some test data. Cache your API calls. Then let it run.
I've open-sourced the skill I built from this experiment as a Claude Code plugin: pjhoberman/autoresearch. It generates the full experiment harness (instructions, eval script, test data template, launch prompt) scoped to your codebase. The references/lessons.md file has everything I learned from both rounds.
The hard part isn't the loop. It's writing an eval that actually measures what you care about.

Top comments (0)