A lot of the conversation around AI agents has drifted toward increasingly complex frameworks, orchestration layers, memory systems, and multi-agen...
For further actions, you may consider blocking this person and/or reporting abuse
This has made me really curious about the concept of autoresearch, Julien. It sounds like such a fascinating project - using an open-source AI research framework to optimize for one thing well, rather than trying to tackle everything at once. The idea of comparing a non-reflective loop to a reflective loop is intriguing, and I'm interested in learning more about how reflection improved the agent's decision-making quality and efficiency.
Exactly Aryan! I encourage you to give it a try yourself. I learned a lot in the process of how to run experiments on my local machine using agents. I want to pursue this further to see what the possibilities are.
The "tight loop" philosophy here maps perfectly to what I've seen building agent pipelines for large-scale content sites. I manage a multilingual programmatic SEO site with 100k+ pages, and the biggest productivity gains came not from adding more agent capabilities, but from constraining each agent's scope to a single, well-defined task with a clear evaluation function.
Your reflection structure — previous change, outcome, likely reason, confidence, next best step — is essentially what we ended up building into our content generation pipeline. Before each batch of pages, the agent reviews quality metrics from the previous batch and adjusts its approach. The improvement in output quality was immediate and measurable, similar to your 60% vs 0% improvement ratio.
The prepare.py/train.py/program.md separation that klement mentioned is key. We use a similar pattern: a fixed data schema the agent can't touch, a generation template it can modify, and an instruction set that defines boundaries. That boundary discipline is what prevents the agent from drifting into changes that look reasonable in isolation but break the system at scale.
Curious about your next steps with the 80+ run budget — I'd expect the reflective advantage to compound over longer trajectories since the agent builds a richer decision history to reason from.
Great to hear from your experience and thanks for validating my experiment.
I will try to run a longer experiment over the next few weeks to see what results I get.
cool!
Thanks Benjamin!
no problem
Let connect on LinkedIn if you want?
For sure! Would love to connect: Julien Avezou
nice
The prepare.py/train.py/program.md split is the part that makes this click — it separates what the agent can modify from what stays fixed. That same boundary pattern works well for any autonomous loop where you want controlled experimentation without runaway drift.
Agree klement, the main setup is easy to grasp and it makes changes by the agent easy to track over time.
This is the kind of experiment I wish more people were running. The reflective loop vs non reflective comparison is elegant simple enough to interpret, but the results (60% improvement ratio, better runtime) are genuinely interesting.
One question: how did you handle the reflection' prompt in mode B? Did you give the agent a specific structure (like 'what worked, what didn't, what to try next') or was it freeform? The consistency of the iteration quality scores (0.93-1.00) suggests the reflection was well-structured.
Also, love the practical tips - tmux + caffeinate + nice is exactly the kind of 'boring infrastructure' that makes real research possible. 🙌
Thanks Harsh!
For mode B, I defined a short reflection block in
program.mdthat the agent had to produce before each new run. The structure was:That structure was intentional. I wanted reflection to stay concise, analytical, and engineering-oriented rather than turning into vague chain-of-thought style narration.
Great deep dive!
Thanks!