DEV Community

Cover image for Autonomous AI Research Does Not Need a Giant Framework

Autonomous AI Research Does Not Need a Giant Framework

Julien Avezou on March 12, 2026

A lot of the conversation around AI agents has drifted toward increasingly complex frameworks, orchestration layers, memory systems, and multi-agen...
Collapse
 
itsugo profile image
Aryan Choudhary

This has made me really curious about the concept of autoresearch, Julien. It sounds like such a fascinating project - using an open-source AI research framework to optimize for one thing well, rather than trying to tackle everything at once. The idea of comparing a non-reflective loop to a reflective loop is intriguing, and I'm interested in learning more about how reflection improved the agent's decision-making quality and efficiency.

Collapse
 
javz profile image
Julien Avezou

Exactly Aryan! I encourage you to give it a try yourself. I learned a lot in the process of how to run experiments on my local machine using agents. I want to pursue this further to see what the possibilities are.

Collapse
 
apex_stack profile image
Apex Stack

The "tight loop" philosophy here maps perfectly to what I've seen building agent pipelines for large-scale content sites. I manage a multilingual programmatic SEO site with 100k+ pages, and the biggest productivity gains came not from adding more agent capabilities, but from constraining each agent's scope to a single, well-defined task with a clear evaluation function.

Your reflection structure — previous change, outcome, likely reason, confidence, next best step — is essentially what we ended up building into our content generation pipeline. Before each batch of pages, the agent reviews quality metrics from the previous batch and adjusts its approach. The improvement in output quality was immediate and measurable, similar to your 60% vs 0% improvement ratio.

The prepare.py/train.py/program.md separation that klement mentioned is key. We use a similar pattern: a fixed data schema the agent can't touch, a generation template it can modify, and an instruction set that defines boundaries. That boundary discipline is what prevents the agent from drifting into changes that look reasonable in isolation but break the system at scale.

Curious about your next steps with the 80+ run budget — I'd expect the reflective advantage to compound over longer trajectories since the agent builds a richer decision history to reason from.

Collapse
 
javz profile image
Julien Avezou

Great to hear from your experience and thanks for validating my experiment.
I will try to run a longer experiment over the next few weeks to see what results I get.

Collapse
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

cool!

Collapse
 
javz profile image
Julien Avezou

Thanks Benjamin!

Collapse
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

no problem

Thread Thread
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

Let connect on LinkedIn if you want?

Thread Thread
 
javz profile image
Julien Avezou

For sure! Would love to connect: Julien Avezou

Thread Thread
 
benjamin_nguyen_8ca6ff360 profile image
Benjamin Nguyen

nice

Collapse
 
klement_gunndu profile image
klement Gunndu

The prepare.py/train.py/program.md split is the part that makes this click — it separates what the agent can modify from what stays fixed. That same boundary pattern works well for any autonomous loop where you want controlled experimentation without runaway drift.

Collapse
 
javz profile image
Julien Avezou

Agree klement, the main setup is easy to grasp and it makes changes by the agent easy to track over time.

Collapse
 
harsh2644 profile image
Harsh

This is the kind of experiment I wish more people were running. The reflective loop vs non reflective comparison is elegant simple enough to interpret, but the results (60% improvement ratio, better runtime) are genuinely interesting.

One question: how did you handle the reflection' prompt in mode B? Did you give the agent a specific structure (like 'what worked, what didn't, what to try next') or was it freeform? The consistency of the iteration quality scores (0.93-1.00) suggests the reflection was well-structured.

Also, love the practical tips - tmux + caffeinate + nice is exactly the kind of 'boring infrastructure' that makes real research possible. 🙌

Collapse
 
javz profile image
Julien Avezou

Thanks Harsh!

For mode B, I defined a short reflection block in program.md that the agent had to produce before each new run. The structure was:

  • Previous change
  • Outcome
  • Likely reason
  • Confidence
  • Next best step

That structure was intentional. I wanted reflection to stay concise, analytical, and engineering-oriented rather than turning into vague chain-of-thought style narration.

Collapse
 
subscriptionindex profile image
Subscription Index

Great deep dive!

Collapse
 
javz profile image
Julien Avezou

Thanks!