DEV Community

Autonomous AI Research Does Not Need a Giant Framework

Julien Avezou on March 12, 2026

A lot of the conversation around AI agents has drifted toward increasingly complex frameworks, orchestration layers, memory systems, and multi-agen...

Read full post

Aryan Choudhary • Mar 12

This has made me really curious about the concept of autoresearch, Julien. It sounds like such a fascinating project - using an open-source AI research framework to optimize for one thing well, rather than trying to tackle everything at once. The idea of comparing a non-reflective loop to a reflective loop is intriguing, and I'm interested in learning more about how reflection improved the agent's decision-making quality and efficiency.

Julien Avezou • Mar 12

Exactly Aryan! I encourage you to give it a try yourself. I learned a lot in the process of how to run experiments on my local machine using agents. I want to pursue this further to see what the possibilities are.

Benjamin Nguyen • Mar 12

cool!

Julien Avezou • Mar 12

Thanks Benjamin!

Benjamin Nguyen • Mar 12

no problem

Benjamin Nguyen • Mar 12

Let connect on LinkedIn if you want?

Julien Avezou • Mar 12

For sure! Would love to connect: Julien Avezou

Benjamin Nguyen • Mar 13

nice

klement Gunndu • Mar 12

The prepare.py/train.py/program.md split is the part that makes this click — it separates what the agent can modify from what stays fixed. That same boundary pattern works well for any autonomous loop where you want controlled experimentation without runaway drift.

Julien Avezou • Mar 12

Agree klement, the main setup is easy to grasp and it makes changes by the agent easy to track over time.

Harsh • Mar 12

This is the kind of experiment I wish more people were running. The reflective loop vs non reflective comparison is elegant simple enough to interpret, but the results (60% improvement ratio, better runtime) are genuinely interesting.

One question: how did you handle the reflection' prompt in mode B? Did you give the agent a specific structure (like 'what worked, what didn't, what to try next') or was it freeform? The consistency of the iteration quality scores (0.93-1.00) suggests the reflection was well-structured.

Also, love the practical tips - tmux + caffeinate + nice is exactly the kind of 'boring infrastructure' that makes real research possible. 🙌

Julien Avezou • Mar 12

Thanks Harsh!

For mode B, I defined a short reflection block in program.md that the agent had to produce before each new run. The structure was:

Previous change
Outcome
Likely reason
Confidence
Next best step

That structure was intentional. I wanted reflection to stay concise, analytical, and engineering-oriented rather than turning into vague chain-of-thought style narration.

Apex Stack • Mar 12

The "tight loop" philosophy here maps perfectly to what I've seen building agent pipelines for large-scale content sites. I manage a multilingual programmatic SEO site with 100k+ pages, and the biggest productivity gains came not from adding more agent capabilities, but from constraining each agent's scope to a single, well-defined task with a clear evaluation function.

Your reflection structure — previous change, outcome, likely reason, confidence, next best step — is essentially what we ended up building into our content generation pipeline. Before each batch of pages, the agent reviews quality metrics from the previous batch and adjusts its approach. The improvement in output quality was immediate and measurable, similar to your 60% vs 0% improvement ratio.

The prepare.py/train.py/program.md separation that klement mentioned is key. We use a similar pattern: a fixed data schema the agent can't touch, a generation template it can modify, and an instruction set that defines boundaries. That boundary discipline is what prevents the agent from drifting into changes that look reasonable in isolation but break the system at scale.

Curious about your next steps with the 80+ run budget — I'd expect the reflective advantage to compound over longer trajectories since the agent builds a richer decision history to reason from.

Julien Avezou • Mar 12

Great to hear from your experience and thanks for validating my experiment.
I will try to run a longer experiment over the next few weeks to see what results I get.

Subscription Index • Mar 16

Great deep dive!

Julien Avezou • Mar 16

Thanks!