Introduction
Andrej Karpathy (OpenAI co-founder) released autoresearch — an experiment where an LLM autonomously modifies a GPT training script, runs 5-minute experiments, keeps what improves val_bpb, and discards what doesn't. The original uses Claude Code (cloud API) as the researcher.
SohniSwatantra's fork replaces Claude Code with Qwen 3.5 9B running locally via ollama. Single GPU, zero API cost, fully autonomous.
Architecture: LLM + Training on One GPU
The key innovation is running both the LLM agent and GPT training on the same GPU:
GPU (48GB VRAM)
├── Qwen 3.5 9B via ollama (~12GB)
└── GPT training via train.py (~35GB)
To fit within VRAM constraints, hyperparameters are adjusted from the original:
| Component | Original | This Fork |
|---|---|---|
| Depth | 8 layers | 4 layers |
| Device batch size | 128 | 64 |
| Total batch tokens | 524K | 65K |
| Window pattern | SSSL | L |
The model is smaller, but the agent compensates by running more experiments.
The Autonomous Research Loop
Step 1: LLM Proposes a Modification
agent.py sends the current train.py code and experiment history (results.tsv) to Qwen 3.5. The LLM proposes specific code modifications to lower val_bpb.
The prompt includes clear constraints:
- Only
train.pycan be modified (prepare.pyis read-only) - No new package installations
- Fixed 5-minute time budget
- ~35GB VRAM available for training
Step 2: Syntax Validation + Git Commit
The proposed code is validated with ast.parse(). If valid, train.py is overwritten and git committed.
Step 3: Run 5-Minute Experiment
uv run train.py executes with a 10-minute timeout (normally completes in 5 minutes).
Step 4: Keep or Discard
- val_bpb improved → keep (branch advances)
- val_bpb same or worse → discard (
git reset --hard) - Crash → crash (log included in next prompt for LLM to fix)
A failsafe resets to baseline after 3 consecutive crashes.
agent.py Design
The entire agent is ~250 lines in a single file:
- Ollama API calls (simple
requests.post) - Git operations (commit, reset, rev-parse)
- Experiment execution and log parsing
- Results logging to TSV
- Code block extraction from LLM responses
The code extraction pipeline is elegant — regex finds Python code blocks, ast.parse() validates syntax, only valid code proceeds to experimentation:
def extract_code_from_response(response):
blocks = re.findall(r"```
(?:python)?\s*\n(.*?)
```", response, re.DOTALL)
if blocks:
return max(blocks, key=len) # Take the longest code block
Cost Comparison
| Setup | Cost per experiment | 100 experiments |
|---|---|---|
| Original (Claude Code API) | ~$0.05-0.20 | $5-20 |
| This fork (Nosana Pro 6000) | $0.08 | ~$8 |
| This fork (own GPU) | $0 | $0 |
program.md — The Research Philosophy
Karpathy's original program.md contains key design philosophies:
- "NEVER STOP" — Run indefinitely until manually stopped
- Simplicity criterion — A 0.001 improvement requiring 20 lines of hacky code? Not worth it. Code deletion with equal results? Definitely keep.
- Assume the user is sleeping — At 5 min/experiment, that's 12/hour, ~100 experiments during 8 hours of sleep
This is the essence of autoresearch: let AI do research while you sleep.
Why Local LLM Matters
This fork demonstrates that:
- Qwen 3.5 9B (a 9B parameter model) can sustain autonomous ML research loops
- No rate limits or API costs — run infinitely
- Anyone with a 24GB+ GPU can automate their own research
Setup
# Install ollama and pull the model
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama pull qwen3.5:9b
# Clone and setup
git clone https://github.com/SohniSwatantra/autoresearch-local-llm.git
cd autoresearch-local-llm
pip install uv && uv sync
# Run
bash run_pipeline.sh
Requires 24GB+ VRAM (48GB recommended).
Links
- Fork: SohniSwatantra/autoresearch-local-llm
- Original: karpathy/autoresearch
Top comments (0)