How I Let an AI Agent Run 100 ML Experiments Overnight on a $500 GPU
Last week I let an AI agent run 100 machine learning experiments overnight on my RTX 3070. I woke up to a 25% model improvement. Here's exactly how it works.
The Setup
The agent is built on Karpathy's autoresearch concept, powered by Claude Sonnet. It runs in a loop:
- Propose — The agent analyzes current model performance and proposes a specific code change
- Implement — It writes the actual Python code to modify the neural network
- Train — The modified model trains on PubMed medical text data
- Evaluate — Loss metrics are compared against the baseline
- Decide — If improvement > threshold, keep the change. Otherwise, revert.
- Repeat — Go back to step 1 with updated context
The Results
Out of 100 experiments:
- 93 failed — proposed changes made the model worse or had no effect
- 7 succeeded — measurable improvements that the agent kept
- Net result — 25% improvement in model performance
The 7% hit rate sounds low, but that's the point. Research is mostly failure. The agent runs experiments I'd never have time to try manually.
What the Agent Discovered
The 7 successful experiments included:
- Learning rate scheduling changes I wouldn't have tried
- A specific attention head configuration that improved convergence
- Batch size adjustments that were counterintuitive but worked
- Layer normalization placement that contradicted my assumptions
The Hardware
This runs on consumer hardware:
- GPU: NVIDIA RTX 3070 (8GB VRAM) — ~$500
- CPU: Standard desktop AMD Ryzen
- RAM: 32GB
- Storage: 1TB NVMe SSD
Total cost for the overnight run: about $0.50 in electricity + Claude API calls. That near-zero per-experiment cost comes from running inference locally rather than through a cloud provider. Here's what serving a live LLM from a consumer GPU actually looks like in production.
Why This Matters
The traditional ML research loop is: human thinks of experiment → human implements it → human waits for training → human evaluates → human thinks of next experiment.
Each cycle takes hours or days of human attention. My agent does it in minutes and runs 24/7.
The Code
The agent is ~300 lines of Python orchestrating:
- Claude Sonnet for reasoning and code generation
- PyTorch for training
- A simple SQLite database tracking all experiments
- Git for version control of each experiment
It's not magic. It's a loop with good prompts and clear evaluation criteria.
What I Learned
- Autonomy requires clear metrics — the agent needs an unambiguous way to measure success
- Failure is the feature — 93% failure rate is fine when experiments are cheap
- Consumer hardware is enough — you don't need cloud GPUs for meaningful research
- Overnight is the killer use case — run experiments while you sleep, review results over coffee
Try It Yourself
You need:
- A GPU (even a 3060 works)
- An API key for Claude or GPT
- A clear metric to optimize
- Patience to debug the loop
The hardest part isn't the code — it's defining what "better" means for your specific model.
Curious about commissioning something like this rather than building it yourself? Here's what custom autonomous agents actually cost in 2026.
Want me to build an autonomous agent for your workflow? Start a project →
Top comments (0)