The ReAct (Reasoning and Acting) framework has become the de facto standard for building LLM agents in recent years.
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao
Since the paper was published in 2022, it has been abstracted by libraries like LangChain and is increasingly used as a black box.
This is a summary I put together as part of my LangChain studies.
Rough Summary of the Paper
Abstract
Proposes the synergy between reasoning and acting. Summarizes how the ReAct framework, which defines LLMs generating thought traces and task-specific actions alternately, achieved suppression of reasoning errors and reduction of hallucinations through interaction with external environments (such as Wikipedia).
1. Introduction
Identifies the separation of "Reasoning" (CoT, etc.) and "Acting" (WebShop, etc.) in existing language models as problematic.
- Deficiencies in Reasoning
- Factual errors due to lack of external knowledge, and static reasoning processes
- Deficiencies in Acting
- Wandering due to lack of abstract goals
By integrating these, the paper describes the motivation for introducing a loop that resembles human cognitive processes:
"Formulating action plans through reasoning"
and
"Correcting reasoning based on action results"
2. ReAct: Synergizing Reasoning and Acting in Language Models
- Environment definition
- A Markov Decision Process (MDP)-like environment consisting of state
s in S, actiona in A, and observationo in O
- A Markov Decision Process (MDP)-like environment consisting of state
-
Input sequence (Trajectory) construction
- Defines context
c(t)at timetas:
c(t) = (x, r(1), a(1), o(1), ..., r(t-1), a(t-1), o(t-1)) - Defines context
-
Output structure
- Models a probability distribution
p(r(t), a(t) | c(t))that takesc(t)as input and generates thoughtr(t)or actiona(t)
- Models a probability distribution
3. Knowledge-Intensive Reasoning Tasks (HotpotQA, FEVER)
Evaluation experiments on knowledge-intensive tasks.
- Method
- Uses Wikipedia API as the external environment and defines three actions: Search, Lookup, Finish
- Results
- Compared to CoT alone (reasoning only), accuracy of factual information improved. Compared to Action alone, confirmed more efficient information exploration
- Analysis
- Qualitatively evaluated that ReAct has the ability to self-correct hallucinations
4. Decision Making Tasks (ALFWorld, WebShop)
Evaluation experiments on decision-making and manipulation tasks.
- Environment
- Physical room manipulation (ALFWorld) and online shopping (WebShop)
- Findings
- Demonstrated that by interposing language-based "Thought," tasks can be completed without losing sight of long-term goals even under sparse reward settings
5. Related Work
- Chain-of-Thought (CoT)
- Differences from prior research focused solely on reasoning
- Modular Deep RL
- Advantages of flexibility through few-shot learning of LLMs over reinforcement learning approaches
6. Conclusion
Concludes that ReAct is effective for both knowledge-intensive tasks and decision-making tasks. As future challenges, presents addressing context length limitations and few-shot dependency through model fine-tuning.
Content I Organized Out of Interest
Problems Under Sparse Reward Settings
Problems and Challenges
Sparse reward setting refers to a state in reinforcement learning or decision-making tasks where an agent only receives positive rewards (feedback) at the final stage when the goal is completely achieved, and during the intermediate process, rewards remain at "0 (zero)" or a constant value.
Under these conditions, the following phenomena occur, making it difficult for simple action models (acting only) to complete tasks.
- Delayed evaluation
- It's not immediately clear whether the results of actions are heading in the "right direction"
- Exploration difficulty
- Since no useful signals are obtained until accidentally reaching the goal, the efficiency of trial and error (exploration) becomes extremely low
ReAct's Solution
Shows that it can be overcome by the LLM's "Thought" playing the following roles.
-
Self-generation of intermediate goals
- In processes where final rewards are not obtained, the LLM generates logical intermediate steps within the context, such as "I should check XX next"
-
Substitution for internal reward function
- Even when physical rewards are not obtained, by linguistically evaluating whether Observations align with its own Thought (plan), it forms a de facto self-feedback loop
What Observation Specifically Is
Conclusion
It's the "physical data processing that synchronizes and maps external outputs to working memory called context."
Refers to a series of data processing steps that integrate outputs returned from the external environment into the input sequence (c(t+1)) for the agent's (LLM's) next reasoning cycle.
In More Detail
- Data reception and serialization (Input)
- After external tool execution completes, synchronously receives its return value
- Data flow
- Structured data (JSON, Binary, etc.) is transferred from external processes (API, DB, Interpreter, etc.) to the system side (agent control program)
- Memory structure
- Received data is held in a temporary buffer and serialized into UTF-8 format text data that the LLM can interpret
- Data flow
- After external tool execution completes, synchronously receives its return value
- Context token appending (State Transition)
- Physically concatenates serialized text to the end of the existing token sequence (Trajectory) held within the LLM's context window
- State transition
- Transition from state
S(t) = (x, r(1), a(1), o(1), ..., r(t), a(t))to stateS(t') = (x, r(1), a(1), o(1), ..., r(t), a(t), o(t)). (x: instruction, r: Thought, a: Action, o: Observation)
- Transition from state
- Formatting
- A token identifier defined by the system (e.g., \nObservation: ) is added as a prefix so the LLM can recognize it as a boundary for "external input"
- State transition
- Physically concatenates serialized text to the end of the existing token sequence (Trajectory) held within the LLM's context window
- Context window integrity control
- Logic intervenes to calculate token length after concatenation and control so it doesn't exceed the LLM's maximum context window
- Internal mechanism
- Token counting
- Calculates the token count of the entire sequence including serialized
o(t)
- Calculates the token count of the entire sequence including serialized
- Truncation/summarization (when necessary)
- When exceeding the limit is anticipated, deletes or compresses part of preceding
o(1...t-1)orr(1...t-1)to secure memory space needed for reasoning at current timet
- When exceeding the limit is anticipated, deletes or compresses part of preceding
- Token counting
- Internal mechanism
- Logic intervenes to calculate token length after concatenation and control so it doesn't exceed the LLM's maximum context window
- Trigger to reasoning phase (Output)
- Sends (requests) updated context
S(t')as payload to the LLM's reasoning endpoint- Data flow
- The entire updated sequence is transferred from the control program to the LLM engine
- Purpose
- Presents the result
o(t)for the immediately precedinga(t)to the LLM as established fact (grounding) and induces generation of the next stepr(t+1)
- Presents the result
- Data flow
- Sends (requests) updated context
CoT vs ReAct
Chain-of-Thought (CoT) and ReAct are different frameworks in LLM reasoning processes, and their main difference lies in "interaction with external environments (closed system or open system)."
From an engineering perspective,
ReAct extends CoT and
defines a control loop
to incorporate external I/O (Observation)
can be interpreted this way.
Chain-of-Thought (CoT)
- Definition
- A method where LLMs generate logical intermediate steps to reach an answer using only their own internal parameters (learned knowledge)
- Data flow
- A linear process that generates a thought process for input and outputs a final answer
- System boundary
- Closed system. Does not receive inputs (Observations) from external environments, and reasoning is internally self-contained within the LLM
ReAct
- Definition
- A framework that integrates "reasoning (Thought)" like CoT with "action (Action)" that operates external tools and its result "observation (Observation)"
- Data flow
- A dynamic process that repeats the Thought → Action → Observation cycle
- System boundary
- Open system. Incorporates dynamic feedback from external environments into context and modifies next-step reasoning based on it
ReAct Flow Summary Based on the Above
1. Initial State (t=0)
- Input
- User command (Query:
x) and few-shot prompt (p) that defines ReAct's operating rules.
- User command (Query:
-
State (c(0))
c(0) = (p, x) -
Memory structure
- A static instruction set is placed at the beginning of the LLM's context window
2. Reasoning Cycle (t=n)
Step A: Thought Generation
- Process
p(r(t) | c(t-1))
- Processing
- The LLM references current context
c(t-1)and verbalizes (r(t)) the validity of the next action
- The LLM references current context
- Internal mechanism
- Does not interfere with external environments and formulates intermediate goals in the LLM's internal logical space
Step B: Action Generation
- Process
p(a(t) | c(t-1), r(t))
- Processing
- The LLM outputs an identifier and arguments (
a(t)) to invoke external tools according to a specific format (e.g.,search[query]).
- The LLM outputs an identifier and arguments (
- System output
- Generated
a(t)is sent from the LLM engine to the agent control program
- Generated
Step C: Execution
- Processing
- The control program parses
a(t)and drives the corresponding external tool (API, DB, etc.).
- The control program parses
- Note
- This phase occurs outside the LLM, and LLM reasoning is in a suspended state
Step D: Observation Integration
- Input
- Return value (Raw Data) from external tools.
- Processing
- The system converts data to text, adds an identifier, and makes it
o(t).
- The system converts data to text, adds an identifier, and makes it
-
State transition
c(t) = (c(t-1), r(t), a(t), o(t)) -
Memory update
- Concatenates
o(t)to the end of the context and determines the next request payload to the LLM
- Concatenates
3. Termination Judgment and Final Output
- Condition
- When the LLM recognizes "task completion" during reasoning, it outputs a specific action (e.g.,
Finish[answer])
- When the LLM recognizes "task completion" during reasoning, it outputs a specific action (e.g.,
- Output
- Returns the final answer (
y) to the user and terminates the process
- Returns the final answer (
Important Properties in the Sequence
- Synchrony
-
r(t+1)is not generated untilo(t)is concatenated to the context.
-
- Accumulation
- Since all
r,a,ofrom each cycle accumulate in the context, computational cost (token count) increases proportionally with the number of steps.
- Since all
- Recursive correction
- When the content of
o(t)diverges from predictions inr(t), plan reconstruction (Dynamic Replanning) occurs inr(t+1)of the next cycle
- When the content of
Top comments (0)