DEV Community

Cover image for Autonomous AI Research Does Not Need a Giant Framework
Julien Avezou
Julien Avezou

Posted on

Autonomous AI Research Does Not Need a Giant Framework

A lot of the conversation around AI agents has drifted toward increasingly complex frameworks, orchestration layers, memory systems, and multi-agent abstractions. At the same time, a quieter and more interesting movement is emerging: small, purpose-built loops that optimise for one thing well.

That is what drew me to autoresearch, an open-source repo created by Andrej Karpathy and published very recently, which has drawn a lot of attention this week in the field of AI.


Why is autoresearch important?

At its core, autoresearch is a very simple idea: give an agent a real training loop, let it make small changes, run short experiments, evaluate results, and keep iterating in a self-contained setup. No huge platform or any over-engineered agent stack. Just a tight feedback loop around real model training.

That simplicity is exactly what makes it so appealing. Just a couple hundred lines of Python...

autoresearch is built around three important files:

  • prepare.py: handles one-time data preparation, tokenizer setup, and runtime utilities
  • train.py: the main training file that the agent is allowed to modify
  • program.md: the instruction file that defines how the agent should behave during the experiment. You define this.

The agent reads program.md, modifies train.py, runs training, checks metrics, and decides what to do next.

This strips autonomous research down to its essence. Instead of building a huge agent framework first, you start with a minimal loop that already does useful work.

It is a good reminder that progress in agentic systems may come less from adding more layers and more from building tighter, more disciplined loops.

My idea

I wanted to run a real autonomous research experiment locally, on my own machine, with minimal infrastructure and no cloud GPU dependency.

My constraint was simple:

  • run on my MacBook Pro
  • use CPU only
  • keep it practical
  • generate insight, not just raw performance chasing

That led me to an experiment around reflective iteration.

My experiment question was:

Does adding a structured self-reflection step help an autonomous agent make better model improvement decisions over time?

The setup compared two loops:

A: non-reflective loop
B: reflective loop

The difference was small but important.

In mode A, the agent simply proposed the next change from a fixed queue and ran it.

In mode B, the agent first reflected on the previous result in a structured way before choosing the next change.

The goal was not only to improve validation performance, but to test whether reflection improved the quality of experimental decision-making.

How I set up my experiment

If you want to run this on a MacBook like I did, the easiest starting point is this fork.

The steps I followed:

  • Fork the repo to my own GitHub account and clone it locally.

  • Open the repo with my coding agent or assistant.

For this setup, I used Codex 5.3 with medium reasoning depth. It was especially useful for planning, checking dependencies, repository structure, experiment constraints, and execution.

  • Prepare the experiment.

Verify that all dependencies are installed and that all prerequisite training steps are ready

  • Define the experiment in program.md.

What worked well for me was not replacing the original file, but enhancing it. I created a standalone experiment definition and combined it with the base repository rules, so I could preserve important constraints from the original setup.

  • Optimise my execution environment.

After defining the experiment, I asked Codex to help me refine the setup without changing the actual experiment design.

The idea was to keep the experiment logic in program.md, and use setup prompts only for execution environment and setup discipline.

This approach kept the repo simple and prevented setup conversations from contaminating the actual experiment logic.

  • Run the experiment loop itself.

Once everything was ready, the experiment loop was straightforward:

  1. the agent edits train.py
  2. the run executes
  3. results get logged
  4. the next run is chosen
  5. the loop repeats

What mattered most was keeping each change small and isolated so the outcomes stayed interpretable.

Some practical tips

  • For longer experiments, I strongly recommend hardening the execution environment a bit. My practical setup for a long-running experiment was:
    • tmux to protect against terminal closure
    • caffeinate to prevent the Mac from sleeping
    • log files for crash visibility
    • nice to slightly reduce CPU priority
    • laptop plugged into power
    • heavy apps closed

This is a setup I will reuse for local long-running agent experiments going forward.

  • Explicitly tell your coding assistant that you want the full experiment to continue automatically without manual input. In my case, it set up a runner script so I could let the experiment continue while I worked on other things.

  • Also tell it to provide regular status updates. That makes the process much easier to monitor over time.

Experiment results

Here are the results from the experiment:

Baseline:
val_bpb = 1.622929
lower is better

Comparison outcomes:
Best A (non_reflective): 1.623505
Best B (reflective): 1.621094

Runs to beat baseline:
A: never beat baseline
B: beat baseline on B01

Improvement ratio:
A: 0/5 = 0%
B: 3/5 = 60%

Average runtime:
A: 501.38s
B: 478.14s

Iteration quality trend:
A: flat at 0.50 every run
B: consistently high at 0.93-1.00, average 0.944
Enter fullscreen mode Exit fullscreen mode

This was a small experiment but it showed something meaningful.

Under this CPU-constrained local setup, structured reflection improved the quality of the agent’s decisions. The reflective loop not only produced better validation outcomes, it also got there faster and with a much cleaner iteration pattern.

So the takeaway is not just that reflection helped performance.
It helped reasoning quality and efficiency. And it produced a more coherent experimental trajectory.

The best reflective run converged on a small set of parameter changes that outperformed baseline:

Reflection is a habit I have been practicing as an engineer which has helped me with my career and personal growth. Agents seem to agree with this...

Why I think this matters

The most interesting thing about this experiment is not the absolute performance gain. It is the fact that a lightweight, local, interpretable agent loop was able to produce measurable differences in research behavior so quickly.

That suggests a useful direction for agentic AI work:

  • less framework overhead
  • tighter loops
  • better constraints
  • clearer reasoning
  • more emphasis on experiment quality, not just experiment quantity
  • more focus on setting up an optimised evaluation function

If that direction holds up in larger studies, it could become a more practical path for real-world autonomous research systems.

What I want to try next

The next obvious step is scale.

I want to rerun this experiment with a much larger run budget, probably around 80+ runs, to see whether the signal strengthens or weakens over longer search trajectories.

That would make the conclusion much more robust.

This was a small pilot, not a definitive benchmark. The comparison used 5 runs per mode on a single local machine, so I treat the outcome as directional evidence rather than proof. Still, the signal was clear enough to justify running a larger follow-up.


Links

If you want to explore the setup of my experiment or run your own version, here are the two key repos:

Discussion

Have you tinkered with autoresearch yet? What do you think of the approach? What would be some useful experiments to run with this setup?

Also, what does your agentic setup look like today? I am curious to know.

Top comments (4)

Collapse
 
itsugo profile image
Aryan Choudhary

This has made me really curious about the concept of autoresearch, Julien. It sounds like such a fascinating project - using an open-source AI research framework to optimize for one thing well, rather than trying to tackle everything at once. The idea of comparing a non-reflective loop to a reflective loop is intriguing, and I'm interested in learning more about how reflection improved the agent's decision-making quality and efficiency.

Collapse
 
javz profile image
Julien Avezou

Exactly Aryan! I encourage you to give it a try yourself. I learned a lot in the process of how to run experiments on my local machine using agents. I want to pursue this further to see what the possibilities are.

Collapse
 
klement_gunndu profile image
klement Gunndu

The prepare.py/train.py/program.md split is the part that makes this click — it separates what the agent can modify from what stays fixed. That same boundary pattern works well for any autonomous loop where you want controlled experimentation without runaway drift.

Collapse
 
javz profile image
Julien Avezou

Agree klement, the main setup is easy to grasp and it makes changes by the agent easy to track over time.