In January 2025, Andrej Karpathy shared a concept that quietly changed how serious practitioners think about AI-assisted work: autoresearch. The idea is deceptively simple — instead of using an AI agent for one-shot tasks, you set up a loop where the agent autonomously runs experiments, evaluates results, keeps what works, discards what doesn't, and iterates. Over and over. While you sleep.
This isn't theoretical anymore. In early 2026, ARK Invest's research showed that AI coding agents can now work reliably and autonomously for 55+ minutes before needing human intervention. That's enough time for dozens of experiment-evaluate-iterate cycles. The autoresearch pattern has gone from interesting idea to practical workflow.
In this tutorial, we'll build an autonomous research loop from scratch. You'll learn the core pattern, see three real use cases, and walk away with a working setup you can adapt to your own projects.
The Autoresearch Pattern: Experiment → Evaluate → Keep/Discard → Iterate
At its core, Karpathy's autoresearch is a hill-climbing algorithm for knowledge work. Here's the loop:
- Experiment: The agent tries something — runs a code change, tests a hypothesis, tweaks a parameter
- Evaluate: It measures the result against a clear metric — did the test pass? Did the score improve? Did the output match expectations?
- Keep or Discard: If the result improved, keep the change. If not, revert it
- Iterate: Go back to step 1 with the updated state
This is fundamentally different from asking ChatGPT a question and getting an answer. Autoresearch is closed-loop — the agent acts on the world, observes the consequences, and adapts. It's the difference between reading a recipe and actually cooking, tasting, and adjusting.
┌─────────────┐
│ EXPERIMENT │ ← Agent tries a change
└──────┬──────┘
▼
┌─────────────┐
│ EVALUATE │ ← Measure against metric
└──────┬──────┘
▼
┌─────────────┐ ┌──────────┐
│ BETTER? │─No─▶│ DISCARD │──┐
└──────┬──────┘ └──────────┘ │
│ Yes │
▼ │
┌─────────────┐ │
│ KEEP │ │
└──────┬──────┘ │
│ │
▼ │
┌─────────────┐ │
│ ITERATE │◀──────────────────┘
└─────────────┘
The critical insight is that this loop requires three things to work:
- An automated experiment the agent can run without human intervention
- A measurable evaluation metric — not vibes, actual numbers
- A version control mechanism to revert failed experiments cleanly
If you have all three, you can run an autoresearch loop. If you're missing any one of them, you'll need a human in the loop.
What You Need: The Toolchain
Before we build anything, here's the practical toolchain that makes autoresearch loops work in 2026:
Orchestration Layer: An AI Coding Agent
You need an agent that can read files, write files, run shell commands, and iterate on its own output. The best options right now:
- Claude Code — Anthropic's CLI agent. Runs in your terminal, has full filesystem and shell access, and can operate autonomously for extended periods. Our recommended choice for autoresearch loops due to its strong reasoning and tool-use capabilities.
- OpenAI Codex — OpenAI's sandboxed coding agent. Good isolation model but more constrained environment.
- Gemini CLI — Google's command-line agent. Strong multimodal capabilities if your research involves images or documents.
Any agent that supports autonomous tool use will work. The key requirement is that it can execute commands, read output, and decide what to do next — without waiting for you to hit Enter.
Experiment Log: Git
Every experiment needs a paper trail. Git gives you:
- Atomic commits for each experiment attempt
- Easy reverts when an experiment fails
- Branch isolation for parallel experiment tracks
- Full history to analyze what the agent tried and why
Create a dedicated repository for your autoresearch project. The commit history becomes your experiment log.
Evaluation: Scripts That Return Numbers
Your evaluation metric needs to be a script that exits with a number or prints a score. Examples:
-
pytest --tb=short→ pass/fail count -
python benchmark.py→ performance score -
npm run test:coverage→ coverage percentage - A custom scoring script that evaluates output quality
The more unambiguous your metric, the better the loop works.
Use Case 1: Automated Research Paper Analysis
Goal: Build an agent that reads research papers, extracts key findings, and maintains a structured knowledge base.
Setup
Create a project directory with this structure:
autoresearch-papers/
├── papers/ # Drop PDFs or URLs here
├── findings/ # Agent writes structured summaries
├── knowledge.json # Accumulated knowledge base
├── evaluate.py # Scores completeness and accuracy
└── run-loop.sh # The main loop script
The Loop Script
Here's a practical run-loop.sh that drives the autoresearch cycle:
#!/bin/bash
# run-loop.sh — Autonomous research paper analysis loop
REPO_DIR="$(cd "$(dirname "$0")" && pwd)"
MAX_ITERATIONS=20
LOG_FILE="$REPO_DIR/loop-log.txt"
cd "$REPO_DIR"
for i in $(seq 1 $MAX_ITERATIONS); do
echo "=== Iteration $i ===" | tee -a "$LOG_FILE"
# EXPERIMENT: Agent processes next unanalyzed paper
claude -p "Look in papers/ for any paper not yet in findings/.
Read it, extract key findings, methodology, and results.
Write a structured summary to findings/.
Update knowledge.json with new cross-references.
Be thorough but concise." \
--allowedTools "Read,Write,Bash" 2>&1 | tee -a "$LOG_FILE"
# EVALUATE: Check quality of findings
SCORE=$(python evaluate.py)
echo "Score: $SCORE" | tee -a "$LOG_FILE"
# KEEP or DISCARD based on score
if [ "$SCORE" -gt 0 ]; then
git add -A && git commit -m "research: iteration $i — score $SCORE"
echo "✓ Kept iteration $i" | tee -a "$LOG_FILE"
else
git checkout -- .
echo "✗ Discarded iteration $i" | tee -a "$LOG_FILE"
fi
done
The Evaluation Script
# evaluate.py — Score the knowledge base quality
import json
import os
def evaluate():
score = 0
# Check findings exist and have content
findings_dir = "findings"
if not os.path.exists(findings_dir):
return 0
for f in os.listdir(findings_dir):
filepath = os.path.join(findings_dir, f)
content = open(filepath).read()
# Score based on structural completeness
if "## Key Findings" in content: score += 1
if "## Methodology" in content: score += 1
if "## Results" in content: score += 1
if len(content) > 500: score += 1
# Check knowledge base coherence
if os.path.exists("knowledge.json"):
kb = json.load(open("knowledge.json"))
if isinstance(kb, dict) and len(kb) > 0:
score += 2
return score
if __name__ == "__main__":
print(evaluate())
You drop papers into the papers/ folder, start the loop, and come back to a structured knowledge base with cross-referenced findings. The git history shows exactly what the agent found and when.
Use Case 2: Trading Strategy Optimization
Goal: Iteratively improve a trading strategy by testing parameter variations against historical data.
This is where autoresearch really shines — parameter search over a well-defined objective function.
Setup
autoresearch-trading/
├── strategy.py # The trading strategy with configurable params
├── backtest.py # Runs strategy against historical data
├── data/ # Historical price data
├── results/ # Backtest results per iteration
└── run-loop.sh # The loop
The Core Loop
#!/bin/bash
# Autonomous trading strategy optimization
BEST_SHARPE=0
MAX_ITERATIONS=50
for i in $(seq 1 $MAX_ITERATIONS); do
# EXPERIMENT: Agent modifies strategy parameters
claude -p "Review the current strategy.py and recent backtest results
in results/. Analyze what's working and what isn't.
Make ONE targeted parameter change to improve the Sharpe ratio.
Document your reasoning in a comment.
Do NOT change the core strategy logic, only parameters." \
--allowedTools "Read,Write,Bash"
# EVALUATE: Run backtest
RESULT=$(python backtest.py 2>&1)
SHARPE=$(echo "$RESULT" | grep "Sharpe:" | awk '{print $2}')
# KEEP or DISCARD
if (( $(echo "$SHARPE > $BEST_SHARPE" | bc -l) )); then
BEST_SHARPE=$SHARPE
git add -A && git commit -m "strategy: sharpe=$SHARPE (iteration $i)"
echo "✓ New best Sharpe: $SHARPE"
else
git checkout -- strategy.py
echo "✗ Sharpe $SHARPE < best $BEST_SHARPE, reverted"
fi
# Save result for agent context
echo "Iteration $i: Sharpe=$SHARPE" >> results/history.txt
done
The agent sees the full history of what it's tried, so it can learn from failed experiments. After 50 iterations overnight, you wake up to an optimized parameter set with full documentation of every change attempted.
Important caveat: See the "When Autoresearch Fails" section below for why you should treat these results as starting points, not final answers.
Use Case 3: Code Quality Improvement
Goal: Autonomously improve a codebase's test coverage, performance, or code quality metrics.
This is the most immediately practical use case for most developers.
The Loop
#!/bin/bash
# Autonomous code improvement loop
BASELINE_COVERAGE=$(pytest --cov=src --cov-report=term 2>&1 | \
grep TOTAL | awk '{print $4}' | tr -d '%')
for i in $(seq 1 30); do
# EXPERIMENT: Agent writes tests or refactors for coverage
claude -p "Run 'pytest --cov=src --cov-report=term-missing' and
identify the module with the lowest coverage.
Write meaningful tests for uncovered code paths.
Focus on edge cases and error handling.
Do not write trivial tests just to inflate coverage." \
--allowedTools "Read,Write,Bash"
# EVALUATE: Check that tests pass AND coverage improved
RESULT=$(pytest --cov=src --cov-report=term 2>&1)
PASS=$?
NEW_COVERAGE=$(echo "$RESULT" | grep TOTAL | awk '{print $4}' | tr -d '%')
if [ $PASS -eq 0 ] && \
(( $(echo "$NEW_COVERAGE > $BASELINE_COVERAGE" | bc -l) )); then
BASELINE_COVERAGE=$NEW_COVERAGE
git add -A && git commit -m "tests: coverage $NEW_COVERAGE% (iteration $i)"
echo "✓ Coverage: $NEW_COVERAGE%"
else
git checkout -- .
echo "✗ Tests failed or coverage didn't improve, reverted"
fi
done
We've seen this pattern take a project from 45% to 78% test coverage overnight — with meaningful, well-written tests, not just assertion-free stubs.
Making It Robust: Practical Tips
1. Set Resource Limits
Autonomous loops can burn through API credits fast. Set hard limits:
# Cap total API spend
export MAX_TOKENS=500000 # Per iteration
export MAX_ITERATIONS=30 # Total loop count
# Add a cost check
COST=$(calculate_api_cost) # Your cost tracking function
if (( $(echo "$COST > 50.00" | bc -l) )); then
echo "Cost limit reached: \$$COST"
exit 0
fi
2. Use Git Branches, Not Main
Always run autoresearch on a feature branch:
git checkout -b autoresearch/experiment-$(date +%Y%m%d)
# ... run loop ...
# Review results before merging to main
3. Log Everything
The agent's reasoning is as valuable as its results. Capture full output:
claude -p "..." 2>&1 | tee -a "logs/iteration-$i.log"
4. Add Circuit Breakers
Stop the loop if something goes obviously wrong:
# Stop if 5 consecutive failures
FAIL_COUNT=0
if [ "$IMPROVED" = "false" ]; then
FAIL_COUNT=$((FAIL_COUNT + 1))
if [ $FAIL_COUNT -ge 5 ]; then
echo "5 consecutive failures — stopping loop"
exit 1
fi
else
FAIL_COUNT=0
fi
When Autoresearch Fails: Honest Assessment
Autoresearch loops are powerful, but they have real limitations. Here's when they struggle:
Local Optima
This is the biggest risk. Hill-climbing algorithms — including autoresearch — can get stuck on local optima. The agent finds a parameter set that's better than its neighbors but far from globally optimal. It keeps making tiny changes, none of which improve the metric, and the loop stalls.
Mitigation: Run multiple loops from different starting points. Add randomization to the experiment step. Use the loop to explore, then apply human judgment to the best results.
Goodhart's Law
"When a measure becomes a target, it ceases to be a good measure." If your evaluation metric is test coverage, the agent might write trivial tests that inflate the number without catching real bugs. If your metric is Sharpe ratio, the agent might overfit to historical data.
Mitigation: Use multiple evaluation metrics. Add qualitative checks. Review the agent's output periodically rather than blindly trusting the final number.
Tasks Without Clear Metrics
Autoresearch works best when you can reduce success to a number. It struggles with subjective tasks: "Is this blog post good?" "Is this UI intuitive?" "Is this architecture clean?" If you can't script the evaluation, you can't close the loop.
Mitigation: For subjective tasks, use an LLM-as-judge approach (have a second model evaluate the output against a rubric). It's imperfect but can work for rough filtering.
Compounding Errors
Over many iterations, small errors can compound. The agent makes a change that's slightly wrong, future iterations build on that wrong foundation, and you end up far from where you want to be.
Mitigation: Periodic human review checkpoints. Run the full test suite (not just the incremental metric) every N iterations. Compare against the original baseline, not just the previous iteration.
The 55-Minute Threshold: Why This Works Now
ARK Invest's 2026 research on AI agent autonomy revealed a key data point: the best coding agents can now work reliably for 55+ minutes without human intervention. This number matters because of what it enables.
A single autoresearch iteration typically takes 2–5 minutes (experiment + evaluation). In a 55-minute autonomous window, that's 11–27 iterations per session. Run multiple sessions overnight, and you're looking at hundreds of experiment-evaluate-iterate cycles by morning.
Two years ago, agents would derail after 5–10 minutes — not enough time for the loop to produce meaningful results. The jump to 55+ minutes is what makes autoresearch practically useful, not just theoretically interesting.
Getting Started Today
Here's your minimal starting point:
- Pick a project with a clear, scriptable metric (test coverage is a great first choice)
- Create a git repo dedicated to the experiment
- Write your evaluation script — this is the hardest part and the most important
- Write a simple loop script using the patterns above
- Run it for 10 iterations while you watch, to calibrate
- Then let it run overnight on a longer loop
Start small. A 10-iteration loop on a test coverage task will teach you more about autoresearch than any amount of reading. Once you see the pattern work, you'll immediately recognize other problems in your workflow where it applies.
The tools are ready. The agents are capable enough. The pattern is proven. The only question is what you'll point it at first.
Originally published on ComputeLeap. We cover practical AI tools, tutorials, and industry analysis.
Top comments (0)