It's been over a year since I published my view on the state of AI/LLMs and how it could be significantly improved in my post: "Failure Is Not An Option For AI". While scrolling through the rStar2-Agent technical report, I couldn't help but mumble to myself: "I told you so!".
Consider how quickly the model improved during training, especially given that we are comparing a 14B parameter model to 671B R1.
This doesn't just show how vast the universe of optimization is; it demonstrates the fundamental principle:
➡️ The cost of inaction often far exceeds the cost of a reversible mistake.
When comparing the traditional Chain-of-Thought (CoT) approach to this new "Chain of Action", the most striking difference is how early the environmental feedback provides value to the training process. To appreciate what was achieved, let's look at technical side. Using this approach effectively during training required a high-throughput execution environment. The one used for this project was capable of handling 45,000 concurrent tool calls, returning feedback in just 0.3s on average.
Compounding is a fundamental principle of investing. Becoming the richest person in the world is possible by investing small sums and compounding them well enough and long enough. But this principle works inefficiently when applied to chain of thought. Compounding subtle errors early in a reasoning process leads to a long, inefficient, and ultimately incorrect reasoning trajectory.
Anyone who sat through lengthy corporate meetings or a drawn-out group decision-making process understands this intuitively. The first practical attempt provides more knowledge than a month of theoretical planning. Feedback, even when negative, is crucial and must be delivered swiftly. I cannot stress this enough.
The emergent ability of the model trained with this novel approach is particularly unique: The model learned to react productively to negative feedback. Researchers observed the model using specific "forking" and "reflection" tokens. It was effectively talking to itself - course-correcting, pausing to analyze an error, exploring alternative approaches.
This suggests a universal success formula, for humans and AI alike:
➡️ Form a hypothesis, take action, observe feedback, and repeat.
The best part of this story is that the rStar2-Agent codebase has been released under an MIT license on GitHub.
[1] rStar2-Agent: Agentic Reasoning Technical Report
[2] rStar2-Agent at github
Top comments (0)