Julien Avezou

Posted on Mar 12

Autonomous AI Research Does Not Need a Giant Framework

#ai #machinelearning #productivity #opensource

A lot of the conversation around AI agents has drifted toward increasingly complex frameworks, orchestration layers, memory systems, and multi-agent abstractions. At the same time, a quieter and more interesting movement is emerging: small, purpose-built loops that optimise for one thing well.

That is what drew me to autoresearch, an open-source repo created by Andrej Karpathy and published very recently, which has drawn a lot of attention this week in the field of AI.

Why is autoresearch important?

At its core, autoresearch is a very simple idea: give an agent a real training loop, let it make small changes, run short experiments, evaluate results, and keep iterating in a self-contained setup. No huge platform or any over-engineered agent stack. Just a tight feedback loop around real model training.

That simplicity is exactly what makes it so appealing. Just a couple hundred lines of Python...

autoresearch is built around three important files:

prepare.py: handles one-time data preparation, tokenizer setup, and runtime utilities
train.py: the main training file that the agent is allowed to modify
program.md: the instruction file that defines how the agent should behave during the experiment. You define this.

The agent reads program.md, modifies train.py, runs training, checks metrics, and decides what to do next.

This strips autonomous research down to its essence. Instead of building a huge agent framework first, you start with a minimal loop that already does useful work.

It is a good reminder that progress in agentic systems may come less from adding more layers and more from building tighter, more disciplined loops.

My idea

I wanted to run a real autonomous research experiment locally, on my own machine, with minimal infrastructure and no cloud GPU dependency.

My constraint was simple:

run on my MacBook Pro
use CPU only
keep it practical
generate insight, not just raw performance chasing

That led me to an experiment around reflective iteration.

My experiment question was:

Does adding a structured self-reflection step help an autonomous agent make better model improvement decisions over time?

The setup compared two loops:

A: non-reflective loop
B: reflective loop

The difference was small but important.

In mode A, the agent simply proposed the next change from a fixed queue and ran it.

In mode B, the agent first reflected on the previous result in a structured way before choosing the next change.

The goal was not only to improve validation performance, but to test whether reflection improved the quality of experimental decision-making.

How I set up my experiment

If you want to run this on a MacBook like I did, the easiest starting point is this fork.

The steps I followed:

Fork the repo to my own GitHub account and clone it locally.
Open the repo with my coding agent or assistant.

For this setup, I used Codex 5.3 with medium reasoning depth. It was especially useful for planning, checking dependencies, repository structure, experiment constraints, and execution.

Prepare the experiment.

Verify that all dependencies are installed and that all prerequisite training steps are ready

Define the experiment in program.md.

What worked well for me was not replacing the original file, but enhancing it. I created a standalone experiment definition and combined it with the base repository rules, so I could preserve important constraints from the original setup.

Optimise my execution environment.

After defining the experiment, I asked Codex to help me refine the setup without changing the actual experiment design.

The idea was to keep the experiment logic in program.md, and use setup prompts only for execution environment and setup discipline.

This approach kept the repo simple and prevented setup conversations from contaminating the actual experiment logic.

Run the experiment loop itself.

Once everything was ready, the experiment loop was straightforward:

the agent edits train.py
the run executes
results get logged
the next run is chosen
the loop repeats

What mattered most was keeping each change small and isolated so the outcomes stayed interpretable.

Some practical tips

For longer experiments, I strongly recommend hardening the execution environment a bit. My practical setup for a long-running experiment was:
- tmux to protect against terminal closure
- caffeinate to prevent the Mac from sleeping
- log files for crash visibility
- nice to slightly reduce CPU priority
- laptop plugged into power
- heavy apps closed

This is a setup I will reuse for local long-running agent experiments going forward.

Explicitly tell your coding assistant that you want the full experiment to continue automatically without manual input. In my case, it set up a runner script so I could let the experiment continue while I worked on other things.
Also tell it to provide regular status updates. That makes the process much easier to monitor over time.

Experiment results

Here are the results from the experiment:

Baseline:
val_bpb = 1.622929
lower is better

Comparison outcomes:
Best A (non_reflective): 1.623505
Best B (reflective): 1.621094

Runs to beat baseline:
A: never beat baseline
B: beat baseline on B01

Improvement ratio:
A: 0/5 = 0%
B: 3/5 = 60%

Average runtime:
A: 501.38s
B: 478.14s

Iteration quality trend:
A: flat at 0.50 every run
B: consistently high at 0.93-1.00, average 0.944

This was a small experiment but it showed something meaningful.

Under this CPU-constrained local setup, structured reflection improved the quality of the agent’s decisions. The reflective loop not only produced better validation outcomes, it also got there faster and with a much cleaner iteration pattern.

So the takeaway is not just that reflection helped performance.
It helped reasoning quality and efficiency. And it produced a more coherent experimental trajectory.

The best reflective run converged on a small set of parameter changes that outperformed baseline:

Reflection is a habit I have been practicing as an engineer which has helped me with my career and personal growth. Agents seem to agree with this...

Why I think this matters

The most interesting thing about this experiment is not the absolute performance gain. It is the fact that a lightweight, local, interpretable agent loop was able to produce measurable differences in research behavior so quickly.

That suggests a useful direction for agentic AI work:

less framework overhead
tighter loops
better constraints
clearer reasoning
more emphasis on experiment quality, not just experiment quantity
more focus on setting up an optimised evaluation function

If that direction holds up in larger studies, it could become a more practical path for real-world autonomous research systems.

What I want to try next

The next obvious step is scale.

I want to rerun this experiment with a much larger run budget, probably around 80+ runs, to see whether the signal strengthens or weakens over longer search trajectories.

That would make the conclusion much more robust.

This was a small pilot, not a definitive benchmark. The comparison used 5 runs per mode on a single local machine, so I treat the outcome as directional evidence rather than proof. Still, the signal was clear enough to justify running a larger follow-up.

Discussion

Have you tinkered with autoresearch yet? What do you think of the approach? What would be some useful experiments to run with this setup?

Also, what does your agentic setup look like today? I am curious to know.

Top comments (19)

Aryan Choudhary • Mar 12

This has made me really curious about the concept of autoresearch, Julien. It sounds like such a fascinating project - using an open-source AI research framework to optimize for one thing well, rather than trying to tackle everything at once. The idea of comparing a non-reflective loop to a reflective loop is intriguing, and I'm interested in learning more about how reflection improved the agent's decision-making quality and efficiency.

Julien Avezou • Mar 12

Exactly Aryan! I encourage you to give it a try yourself. I learned a lot in the process of how to run experiments on my local machine using agents. I want to pursue this further to see what the possibilities are.

Benjamin Nguyen • Mar 12

cool!

Julien Avezou • Mar 12

Thanks Benjamin!

Benjamin Nguyen • Mar 12

no problem

Benjamin Nguyen • Mar 12

Let connect on LinkedIn if you want?

Julien Avezou • Mar 12

For sure! Would love to connect: Julien Avezou

Benjamin Nguyen • Mar 13

nice

klement Gunndu • Mar 12

The prepare.py/train.py/program.md split is the part that makes this click — it separates what the agent can modify from what stays fixed. That same boundary pattern works well for any autonomous loop where you want controlled experimentation without runaway drift.

Julien Avezou • Mar 12

Agree klement, the main setup is easy to grasp and it makes changes by the agent easy to track over time.

Harsh • Mar 12

This is the kind of experiment I wish more people were running. The reflective loop vs non reflective comparison is elegant simple enough to interpret, but the results (60% improvement ratio, better runtime) are genuinely interesting.

One question: how did you handle the reflection' prompt in mode B? Did you give the agent a specific structure (like 'what worked, what didn't, what to try next') or was it freeform? The consistency of the iteration quality scores (0.93-1.00) suggests the reflection was well-structured.

Also, love the practical tips - tmux + caffeinate + nice is exactly the kind of 'boring infrastructure' that makes real research possible. 🙌

Julien Avezou • Mar 12

Thanks Harsh!

For mode B, I defined a short reflection block in program.md that the agent had to produce before each new run. The structure was:

Previous change
Outcome
Likely reason
Confidence
Next best step

That structure was intentional. I wanted reflection to stay concise, analytical, and engineering-oriented rather than turning into vague chain-of-thought style narration.

Apex Stack • Mar 12

The "tight loop" philosophy here maps perfectly to what I've seen building agent pipelines for large-scale content sites. I manage a multilingual programmatic SEO site with 100k+ pages, and the biggest productivity gains came not from adding more agent capabilities, but from constraining each agent's scope to a single, well-defined task with a clear evaluation function.

Your reflection structure — previous change, outcome, likely reason, confidence, next best step — is essentially what we ended up building into our content generation pipeline. Before each batch of pages, the agent reviews quality metrics from the previous batch and adjusts its approach. The improvement in output quality was immediate and measurable, similar to your 60% vs 0% improvement ratio.

The prepare.py/train.py/program.md separation that klement mentioned is key. We use a similar pattern: a fixed data schema the agent can't touch, a generation template it can modify, and an instruction set that defines boundaries. That boundary discipline is what prevents the agent from drifting into changes that look reasonable in isolation but break the system at scale.

Curious about your next steps with the 80+ run budget — I'd expect the reflective advantage to compound over longer trajectories since the agent builds a richer decision history to reason from.

Julien Avezou • Mar 12

Great to hear from your experience and thanks for validating my experiment.
I will try to run a longer experiment over the next few weeks to see what results I get.