DEV Community

4484-ho
4484-ho

Posted on

Demystifying ReAct: Why Reasoning and Acting is the Standard for LLM Agents

The ReAct (Reasoning and Acting) framework has become the de facto standard for building LLM agents in recent years.

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao

Since the paper was published in 2022, it has been abstracted by libraries like LangChain and is increasingly used as a black box.

This is a summary I put together as part of my LangChain studies.

Rough Summary of the Paper

Abstract

Proposes the synergy between reasoning and acting. Summarizes how the ReAct framework, which defines LLMs generating thought traces and task-specific actions alternately, achieved suppression of reasoning errors and reduction of hallucinations through interaction with external environments (such as Wikipedia).

1. Introduction

Identifies the separation of "Reasoning" (CoT, etc.) and "Acting" (WebShop, etc.) in existing language models as problematic.

  • Deficiencies in Reasoning
    • Factual errors due to lack of external knowledge, and static reasoning processes
  • Deficiencies in Acting
    • Wandering due to lack of abstract goals

By integrating these, the paper describes the motivation for introducing a loop that resembles human cognitive processes:
"Formulating action plans through reasoning"
and
"Correcting reasoning based on action results"

2. ReAct: Synergizing Reasoning and Acting in Language Models

  • Environment definition
    • A Markov Decision Process (MDP)-like environment consisting of state s in S, action a in A, and observation o in O
  • Input sequence (Trajectory) construction

    • Defines context c(t) at time t as:
    c(t) = (x, r(1), a(1), o(1), ..., r(t-1), a(t-1), o(t-1))
    
  • Output structure

    • Models a probability distribution p(r(t), a(t) | c(t)) that takes c(t) as input and generates thought r(t) or action a(t)

3. Knowledge-Intensive Reasoning Tasks (HotpotQA, FEVER)

Evaluation experiments on knowledge-intensive tasks.

  • Method
    • Uses Wikipedia API as the external environment and defines three actions: Search, Lookup, Finish
  • Results
    • Compared to CoT alone (reasoning only), accuracy of factual information improved. Compared to Action alone, confirmed more efficient information exploration
  • Analysis
    • Qualitatively evaluated that ReAct has the ability to self-correct hallucinations

4. Decision Making Tasks (ALFWorld, WebShop)

Evaluation experiments on decision-making and manipulation tasks.

  • Environment
    • Physical room manipulation (ALFWorld) and online shopping (WebShop)
  • Findings
    • Demonstrated that by interposing language-based "Thought," tasks can be completed without losing sight of long-term goals even under sparse reward settings

5. Related Work

  • Chain-of-Thought (CoT)
    • Differences from prior research focused solely on reasoning
  • Modular Deep RL
    • Advantages of flexibility through few-shot learning of LLMs over reinforcement learning approaches

6. Conclusion

Concludes that ReAct is effective for both knowledge-intensive tasks and decision-making tasks. As future challenges, presents addressing context length limitations and few-shot dependency through model fine-tuning.

Content I Organized Out of Interest

Problems Under Sparse Reward Settings

Problems and Challenges

Sparse reward setting refers to a state in reinforcement learning or decision-making tasks where an agent only receives positive rewards (feedback) at the final stage when the goal is completely achieved, and during the intermediate process, rewards remain at "0 (zero)" or a constant value.

Under these conditions, the following phenomena occur, making it difficult for simple action models (acting only) to complete tasks.

  • Delayed evaluation
    • It's not immediately clear whether the results of actions are heading in the "right direction"
  • Exploration difficulty
    • Since no useful signals are obtained until accidentally reaching the goal, the efficiency of trial and error (exploration) becomes extremely low

ReAct's Solution

Shows that it can be overcome by the LLM's "Thought" playing the following roles.

  • Self-generation of intermediate goals

    • In processes where final rewards are not obtained, the LLM generates logical intermediate steps within the context, such as "I should check XX next"
  • Substitution for internal reward function

    • Even when physical rewards are not obtained, by linguistically evaluating whether Observations align with its own Thought (plan), it forms a de facto self-feedback loop

What Observation Specifically Is

Conclusion

It's the "physical data processing that synchronizes and maps external outputs to working memory called context."

Refers to a series of data processing steps that integrate outputs returned from the external environment into the input sequence (c(t+1)) for the agent's (LLM's) next reasoning cycle.

In More Detail

  1. Data reception and serialization (Input)
    • After external tool execution completes, synchronously receives its return value
      • Data flow
        • Structured data (JSON, Binary, etc.) is transferred from external processes (API, DB, Interpreter, etc.) to the system side (agent control program)
        • Memory structure
          • Received data is held in a temporary buffer and serialized into UTF-8 format text data that the LLM can interpret
  2. Context token appending (State Transition)
    • Physically concatenates serialized text to the end of the existing token sequence (Trajectory) held within the LLM's context window
      • State transition
        • Transition from state S(t) = (x, r(1), a(1), o(1), ..., r(t), a(t)) to state S(t') = (x, r(1), a(1), o(1), ..., r(t), a(t), o(t)). (x: instruction, r: Thought, a: Action, o: Observation)
      • Formatting
        • A token identifier defined by the system (e.g., \nObservation: ) is added as a prefix so the LLM can recognize it as a boundary for "external input"
  3. Context window integrity control
    • Logic intervenes to calculate token length after concatenation and control so it doesn't exceed the LLM's maximum context window
      • Internal mechanism
        • Token counting
          • Calculates the token count of the entire sequence including serialized o(t)
        • Truncation/summarization (when necessary)
          • When exceeding the limit is anticipated, deletes or compresses part of preceding o(1...t-1) or r(1...t-1) to secure memory space needed for reasoning at current time t
  4. Trigger to reasoning phase (Output)
    • Sends (requests) updated context S(t') as payload to the LLM's reasoning endpoint
      • Data flow
        • The entire updated sequence is transferred from the control program to the LLM engine
      • Purpose
        • Presents the result o(t) for the immediately preceding a(t) to the LLM as established fact (grounding) and induces generation of the next step r(t+1)

CoT vs ReAct

Chain-of-Thought (CoT) and ReAct are different frameworks in LLM reasoning processes, and their main difference lies in "interaction with external environments (closed system or open system)."

From an engineering perspective,

ReAct extends CoT and
defines a control loop
to incorporate external I/O (Observation)
Enter fullscreen mode Exit fullscreen mode

can be interpreted this way.

Chain-of-Thought (CoT)

  • Definition
    • A method where LLMs generate logical intermediate steps to reach an answer using only their own internal parameters (learned knowledge)
  • Data flow
    • A linear process that generates a thought process for input and outputs a final answer
  • System boundary
    • Closed system. Does not receive inputs (Observations) from external environments, and reasoning is internally self-contained within the LLM

ReAct

  • Definition
    • A framework that integrates "reasoning (Thought)" like CoT with "action (Action)" that operates external tools and its result "observation (Observation)"
  • Data flow
    • A dynamic process that repeats the Thought → Action → Observation cycle
  • System boundary
    • Open system. Incorporates dynamic feedback from external environments into context and modifies next-step reasoning based on it

ReAct Flow Summary Based on the Above

1. Initial State (t=0)

  • Input
    • User command (Query: x) and few-shot prompt (p) that defines ReAct's operating rules.
  • State (c(0))

    c(0) = (p, x)
    
  • Memory structure

    • A static instruction set is placed at the beginning of the LLM's context window

2. Reasoning Cycle (t=n)

Step A: Thought Generation

  • Process
    • p(r(t) | c(t-1))
  • Processing
    • The LLM references current context c(t-1) and verbalizes (r(t)) the validity of the next action
  • Internal mechanism
    • Does not interfere with external environments and formulates intermediate goals in the LLM's internal logical space

Step B: Action Generation

  • Process
    • p(a(t) | c(t-1), r(t))
  • Processing
    • The LLM outputs an identifier and arguments (a(t)) to invoke external tools according to a specific format (e.g., search[query]).
  • System output
    • Generated a(t) is sent from the LLM engine to the agent control program

Step C: Execution

  • Processing
    • The control program parses a(t) and drives the corresponding external tool (API, DB, etc.).
  • Note
    • This phase occurs outside the LLM, and LLM reasoning is in a suspended state

Step D: Observation Integration

  • Input
    • Return value (Raw Data) from external tools.
  • Processing
    • The system converts data to text, adds an identifier, and makes it o(t).
  • State transition

    c(t) = (c(t-1), r(t), a(t), o(t))
    
  • Memory update

    • Concatenates o(t) to the end of the context and determines the next request payload to the LLM

3. Termination Judgment and Final Output

  • Condition
    • When the LLM recognizes "task completion" during reasoning, it outputs a specific action (e.g., Finish[answer])
  • Output
    • Returns the final answer (y) to the user and terminates the process

Important Properties in the Sequence

  • Synchrony
    • r(t+1) is not generated until o(t) is concatenated to the context.
  • Accumulation
    • Since all r, a, o from each cycle accumulate in the context, computational cost (token count) increases proportionally with the number of steps.
  • Recursive correction
    • When the content of o(t) diverges from predictions in r(t), plan reconstruction (Dynamic Replanning) occurs in r(t+1) of the next cycle

Top comments (0)