Patrick Hughes

Posted on May 6 • Originally published at bmdpat.com

How I Let an AI Agent Run 100 ML Experiments Overnight on a $500 GPU

#aiagents #machinelearning #consumergpu #autonomousresearch

How I Let an AI Agent Run 100 ML Experiments Overnight on a $500 GPU

Last week I let an AI agent run 100 machine learning experiments overnight on my RTX 3070. I woke up to a 25% model improvement. Here's exactly how it works.

The Setup

The agent is built on Karpathy's autoresearch concept, powered by Claude Sonnet. It runs in a loop:

Propose — The agent analyzes current model performance and proposes a specific code change
Implement — It writes the actual Python code to modify the neural network
Train — The modified model trains on PubMed medical text data
Evaluate — Loss metrics are compared against the baseline
Decide — If improvement > threshold, keep the change. Otherwise, revert.
Repeat — Go back to step 1 with updated context

The Results

Out of 100 experiments:

93 failed — proposed changes made the model worse or had no effect
7 succeeded — measurable improvements that the agent kept
Net result — 25% improvement in model performance

The 7% hit rate sounds low, but that's the point. Research is mostly failure. The agent runs experiments I'd never have time to try manually.

What the Agent Discovered

The 7 successful experiments included:

Learning rate scheduling changes I wouldn't have tried
A specific attention head configuration that improved convergence
Batch size adjustments that were counterintuitive but worked
Layer normalization placement that contradicted my assumptions

The Hardware

This runs on consumer hardware:

GPU: NVIDIA RTX 3070 (8GB VRAM) — ~$500
CPU: Standard desktop AMD Ryzen
RAM: 32GB
Storage: 1TB NVMe SSD

Total cost for the overnight run: about $0.50 in electricity + Claude API calls. That near-zero per-experiment cost comes from running inference locally rather than through a cloud provider. Here's what serving a live LLM from a consumer GPU actually looks like in production.

Why This Matters

The traditional ML research loop is: human thinks of experiment → human implements it → human waits for training → human evaluates → human thinks of next experiment.

Each cycle takes hours or days of human attention. My agent does it in minutes and runs 24/7.

The Code

The agent is ~300 lines of Python orchestrating:

Claude Sonnet for reasoning and code generation
PyTorch for training
A simple SQLite database tracking all experiments
Git for version control of each experiment

It's not magic. It's a loop with good prompts and clear evaluation criteria.

What I Learned

Autonomy requires clear metrics — the agent needs an unambiguous way to measure success
Failure is the feature — 93% failure rate is fine when experiments are cheap
Consumer hardware is enough — you don't need cloud GPUs for meaningful research
Overnight is the killer use case — run experiments while you sleep, review results over coffee

Try It Yourself

You need:

A GPU (even a 3060 works)
An API key for Claude or GPT
A clear metric to optimize
Patience to debug the loop

The hardest part isn't the code — it's defining what "better" means for your specific model.

Curious about commissioning something like this rather than building it yourself? Here's what custom autonomous agents actually cost in 2026.

Want me to build an autonomous agent for your workflow? Start a project →

DEV Community

How I Let an AI Agent Run 100 ML Experiments Overnight on a $500 GPU

How I Let an AI Agent Run 100 ML Experiments Overnight on a $500 GPU

The Setup

The Results

What the Agent Discovered

The Hardware

Why This Matters

The Code

What I Learned

Try It Yourself

Top comments (0)