Urvil Joshi

Posted on May 27 • Originally published at Medium on Apr 13

Karpathy’s Auto Research and its application beyond ML

#ai #autoresearch #autonomousai #andrejkarpathy

What if you could hand an AI agent a single file, a scoring function, and say “make this better” then go to sleep? You wake up to a hundred experiments completed, the best ones committed to your git history, and a better result than what you started with.

Andrej Karpathy open-sourced exactly this way ago and It’s called Auto Research , and once you understand the pattern, you start seeing places to apply it everywhere.

Who Is Andrej Karpathy and Why Does This Matter?

If you write code for a living, you’ve probably used something Karpathy built without knowing it.

He was a co-founder of OpenAI. He led Tesla’s Autopilot team the neural networks that power self-driving. He created nanoGPT, minbpe, and llm.c, three of the most influential open-source AI projects in existence. He also coined the term vibe coding, which, love it or hate it, is now part of every developer's vocabulary.

So when Karpathy open-sources something, it’s worth paying attention.

🍥 What Is Auto Research?

The story behind Auto Research is simple. Karpathy had a training script for GPT-2 that he’d been hand-optimizing for months. Tweaking hyperparameters. Trying different learning rate schedules. Adjusting batch sizes. At some point he asked himself the obvious question:

“Why am I doing this manually? Why not have an AI agent run these experiments for me?”

That question became Auto Research.

Auto Research is a closed-loop autonomous optimization system. An AI agent runs experiments in a tight loop: hypothesize, modify , evaluate, keep or revert. Then repeat. Forever, or until you tell it to stop.

Here’s the loop in plain terms:

Hypothesize. The agent reads the current state of the something we want to modify, looks at previous results, and forms a theory about what to try next.
Modify. It edits exactly one file.
Evaluate. It runs an evaluation script that returns a single score
Keep or revert. If the score improved, git commit. If it got worse, git reset --hard. Clean slate.
Loop. Back to step 1 with the new context.

There’s a detail in here that’s easy to miss but absolutely critical: fixed time budget per experiment.

Every experiment gets the same amount of compute. Why? Because otherwise the agent could cheat. If experiment A gets five minutes and experiment B gets fifty, of course B might look better it had ten times the compute. By fixing the time budget, you force the agent to win on the quality of its ideas, not on brute force.

And notice what’s acting as the memory: git. Your git log becomes a complete trail of every successful experiment. Every commit message says what the agent tried and what score it achieved. At the end of an overnight run, you can git log --oneline and see the entire optimization journey.

If you start it before bed, you can wake up to roughly a hundred experiments completed.

✨The Three-File Architecture

Auto Research works because of a constraint system built around three files. Each one has a specific role, and the boundaries between them are what prevent the whole thing from collapsing into chaos.

File 1: program.md

This is the file you write. Think of it as a system prompt for the experiment loop. You define three things:

The objective. What are you optimizing? “Minimize p99 latency.” “Maximize test pass rate.” “Reduce image size.”
The constraints. What can’t the agent do? “Don’t exceed 512MB of memory.”
The protocol. How should the agent behave? “Run eval after every change.” “Commit if better, revert if worse.” “Don’t stop to ask questions.”

program.md is the job description. You're hiring an AI employee, and this is their employment contract.

File 2: train.py

This is the one and only file the agent can edit. The name comes from Karpathy’s original use case (training GPT-2), but it doesn’t have to be a Python script. It can be:

A system prompt
A SQL query
A Dockerfile
A CSS file
A config file
Literally anything you want to optimize

The single-file constraint is deliberate. By giving the agent one degree of freedom, you prevent it from making sprawling changes you can’t review. The agent has a focused surface area; you have a reviewable diff.

File 3: prepare.py

This is the most important file in the entire system, and the agent absolutely cannot touch it.

prepare.py defines what better means. It runs the evaluation, computes the metric, and outputs a single scalar number. The agent reads this score and decides whether to commit or revert.

Why is it locked? Because if the agent could edit the evaluation, it could just rewrite the scoring function to always return a perfect score. Game over. Optimization meaningless.

There’s a subtle but important corollary here: if you set the wrong metric, the agent will confidently optimize the wrong thing. It will improve the number you gave it, even if that number doesn’t measure what you actually care about. Choosing the right metric is your job. The agent handles execution. You handle direction.

🚨The Misconception That’s Costing People

Most people who hear about Auto Research think it’s a machine learning thing. They see Karpathy’s GPT-2 example and assume the pattern only applies to training models.

This is wrong, and it’s the most expensive misconception you can have about this technology.

Auto Research is a pattern, not a tool. The pattern works anywhere you have three things:

One scalar metric. A single number that tells you if things got better or worse.
Automated evaluation. No human in the loop. If you need a person to look at the result and judge, it’s too slow.
One mutable file. A focused surface area for the agent to work with.

If all three conditions are met, you can Auto Research it. Here’s what that opens up:

Prompt engineering. Your file is system_prompt.txt. Your metric is accuracy on a labeled test set. The agent tries different phrasings, few-shot examples, chain-of-thought instructions, even different languages. Each experiment runs the prompt against your test data and reports accuracy.

API performance. Your file is the handler code. Your metric is p99 latency under load. The agent experiments with caching, connection pooling, query batching, async patterns. The eval script fires a thousand requests and measures the 99th percentile.

Dockerfile optimization. Your file is the Dockerfile. Your metric is build_time × image_size. The agent tries multi-stage builds, different base images, layer reordering. The eval runs docker build and measures both numbers.

SQL query tuning. Your file is query.sql. Your metric is execution time on a fixed dataset. The agent tries index hints, join strategies, CTEs vs subqueries. The eval just runs the query and reports wall clock time.

The pattern doesn’t change. The loop doesn’t change. The three files don’t change. Only the contents change.

The rule is simple: if you can score it, you can Auto Research it. If you can’t score it, don’t try.

🔍Why This Is Bigger Than It Looks

I want to zoom out for a second, because the implications of Auto Research go beyond “neat trick for optimizing code.”

Karpathy has talked about his end vision for this. Back in the early 2000s, there was a project called SETI@home you could donate spare computing power on your home PC to the search for extraterrestrial life. Karpathy wants to build the same thing, but for AI research. Millions of agents, distributed across thousands of computers, all running Auto Research loops on different problems.

Think about what that means. Right now, AI labs spend tens of millions of dollars on researchers whose job is essentially to be the experiment loop propose changes, run training runs, look at results, decide what to try next. That work is now scriptable.

Karpathy’s prediction is that every frontier AI lab will eventually adopt some form of Auto Research internally. And he made the basic version open source.

The person who can set up the loop correctly will out-produce a team doing it manually.

🧰What I Think You Should Do With This

If you’ve read this far and you’re a developer, here’s my honest take.

Try it once. Clone Karpathy’s repo. Pick the smallest possible problem in your codebase that has a clear metric maybe a slow function, a bloated Dockerfile, a config file with too many knobs. Set up the three files. Let an AI agent loop on it for an hour. You’ll learn more from one run than from reading ten articles like this one.

Then start noticing. Once the pattern is in your head, you’ll start seeing Auto Research opportunities everywhere. That nightly batch job that takes 40 minutes? The system prompt you’ve been hand-tuning for weeks? The query that’s too slow but you don’t have time to optimize? All of these are candidates.

Get good at picking metrics. This is the hard part and the only part the agent can’t do for you. A bad metric gives you confident, automated, beautifully-committed garbage. A good metric gives you measurable progress toward something that actually matters. Spend more time on this than on anything else.

The closing thought I keep coming back to is something Karpathy himself said: