ReAct, Plan-and-Execute, or Reflection? The Three Agent Patterns Every Engineer Needs in 2026

#python #ai #agents #llm

Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You have an agent framework decision and three weeks to ship. The docs for LangGraph, LlamaIndex Agents, and OpenAI's Assistants SDK all promise the same thing. None of them tell you the actual architectural choice you are making.

The architectural choice is older than any of those frameworks. It is one of three patterns that were written down in academic papers between 2022 and 2023, and every production agent you will ship in 2026 is a specialization of one of them.

The three are ReAct, Plan-and-Execute, and Reflection. Picking the wrong one costs you latency, cost, or reliability in predictable shapes. This post walks each pattern at the level you need to ship: definition, when to pick it, a Python skeleton, a concrete failure mode, and how the span tree looks under the OpenTelemetry GenAI semantic conventions.

ReAct: think, act, observe, repeat

ReAct was introduced in the 2022 paper ReAct: Synergizing Reasoning and Acting in Language Models by Yao et al. The loop is simple. The model produces a thought, picks a tool, sees the tool output, and decides the next thought. The loop terminates when the model emits a final answer.

Pick ReAct when the task is interactive and the next step genuinely depends on what the previous step returned. Customer support bots that look up orders, then decide whether to refund. Coding agents that read a file, grep for a symbol, then edit. Anything where the search tree is small and exploration is cheap.

The minimum skeleton in Python, using the OpenAI SDK and a couple of stub tools:

from openai import OpenAI

client = OpenAI()
TOOLS = [
    {"type": "function", "function": {
        "name": "search_orders",
        "parameters": {"type": "object", "properties": {
            "query": {"type": "string"}}}}},
    {"type": "function", "function": {
        "name": "refund_order",
        "parameters": {"type": "object", "properties": {
            "order_id": {"type": "string"}}}}},
]

def react(user_msg: str, max_steps: int = 8) -> str:
    messages = [{"role": "user", "content": user_msg}]
    for _ in range(max_steps):
        resp = client.chat.completions.create(
            model="gpt-4o-mini", messages=messages, tools=TOOLS,
        )
        msg = resp.choices[0].message
        messages.append(msg)
        if not msg.tool_calls:
            return msg.content
        for call in msg.tool_calls:
            result = dispatch(call.function.name, call.function.arguments)
            messages.append({"role": "tool", "tool_call_id": call.id,
                             "content": result})
    raise RuntimeError("step budget exhausted")

The max_steps bound is the one line that separates a working agent from a $47K bill. If you forget it, the loop can run until your budget or your patience breaks. In November 2025, four LangChain agents ran for 11 days on that exact mistake and billed $47,000 before anyone noticed.

Failure mode: the verifier stall. ReAct agents are happy to call a verify_result tool twice, then three times, then twenty. The model has no memory of past failed verifications beyond the context window, and it will try the same tool again with a slightly reworded argument. Guard with a per-tool call cap, not only a global step cap.

Real-world example. SWE-agent (Princeton, 2024) is a ReAct-style loop that reads files, runs shell commands, and edits code to fix real GitHub issues. The interaction is fundamentally ReAct because the next edit depends on the compiler error from the previous one.

Under OTel GenAI semconv, a ReAct run emits a single invoke_agent parent span with alternating chat and execute_tool children. The signal that tells you ReAct is healthy is the child count: three to six children for a well-scoped task, token counts climbing monotonically with each chat, and a clean finish_reasons=["stop"] on the last span.

invoke_agent [agent=support, gen_ai.agent.id=run-abc]
├── chat gpt-4o-mini [input=310, output=64]
├── execute_tool search_orders [tool.call.id=call-01]
├── chat gpt-4o-mini [input=420, output=88]
├── execute_tool refund_order [tool.call.id=call-02]
└── chat gpt-4o-mini [input=540, output=40, finish=stop]

If the child count runs past ten and the tool names repeat, you are in a verifier stall. Alert on that shape.

Plan-and-Execute: decide the steps first

Plan-and-Execute was formalized by Wang et al. in Plan-and-Solve Prompting (2023) and popularized in the LangChain ecosystem as the split between a planner LLM call and an executor that runs each planned step. The planner produces a numbered list of steps up front. The executor walks the list.

Pick Plan-and-Execute when the task has enough structure that a plan is worth writing, and when you want to use a smaller or cheaper model for execution. Research workflows, multi-document summarization, ETL-style agent pipelines. If every step is going to call the same two tools in the same order, let the planner commit to that order once and pay for it once.

The minimum skeleton:

import json
from openai import OpenAI

client = OpenAI()

def plan(goal: str) -> list[str]:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "system",
                   "content": "Return a JSON array of short steps."},
                  {"role": "user", "content": goal}],
        response_format={"type": "json_object"},
    )
    return json.loads(resp.choices[0].message.content)["steps"]

def execute_step(step: str, context: list[str]) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": "Do the step. Be brief."},
                  {"role": "user",
                   "content": f"Context: {context}\nStep: {step}"}],
    )
    return resp.choices[0].message.content

def plan_and_execute(goal: str) -> list[str]:
    steps = plan(goal)
    results = []
    for step in steps:
        results.append(execute_step(step, results))
    return results

The planner runs once on a capable model. The executor runs N times on a cheap one. The cost profile is roughly 1 * strong_model + N * cheap_model, which for N > 3 tends to beat ReAct on the same task.

Failure mode: the brittle plan. The planner commits to a plan before seeing any tool output. If step 2 returns something the planner did not anticipate, step 3 is already written and wrong. The fix is a re-plan gate: after every K steps, or on any step whose output exceeds a confidence threshold, ask the planner whether to revise the remaining steps. BabyAGI added this as its core loop; LangChain's Plan-and-Execute agent does the same.

Real-world example. Devin (Cognition, 2024) uses a plan-first architecture for long-horizon software tasks, with re-planning when the executor stalls.

Under OTel GenAI semconv, Plan-and-Execute shows two distinct span shapes at the top: one chat (or invoke_agent) for the planner, followed by N invoke_agent children for each executed step. The planner span has high input tokens and low output tokens (it reads the goal, writes a short list). The executor spans are uniform in size. If the executor spans start diverging wildly in size, your plan is misaligned with reality. That is the re-plan signal you want to alert on.

plan_and_execute [goal="summarize Q1 reviews"]
├── chat gpt-4o [input=820, output=140]       # planner
├── invoke_agent step_1 [children=2, tokens=1.2k]
├── invoke_agent step_2 [children=3, tokens=4.8k]  ← outlier
├── invoke_agent step_3 [children=2, tokens=1.3k]
└── invoke_agent step_4 [children=2, tokens=1.1k]

The step-2 outlier is the interesting span. Route it to a re-plan.

Reflection: critique what you just did

Reflection was introduced in Reflexion: Language Agents with Verbal Reinforcement Learning by Shinn et al. (2023) and the closely related Self-Refine by Madaan et al. The loop is: generate an answer, have a critic LLM call review the answer, feed the critique back, generate again. Stop when the critic signs off or after a fixed number of rounds.

Pick Reflection when the cost of a wrong final answer is high and the task has a clear quality signal. Code generation where you can run the tests. Legal or financial drafts where a second pass of review catches hallucinations. Anything where you would rather spend 3x the tokens to get to right than ship fast and wrong.

The minimum skeleton:

from openai import OpenAI

client = OpenAI()

def draft(prompt: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return resp.choices[0].message.content

def critique(prompt: str, answer: str) -> tuple[bool, str]:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "system",
                   "content": ("Review the answer. Respond with "
                               "'OK' on first line if acceptable, "
                               "else concrete fixes.")},
                  {"role": "user",
                   "content": f"Q: {prompt}\nA: {answer}"}],
    )
    text = resp.choices[0].message.content
    return text.startswith("OK"), text

def reflect(prompt: str, max_rounds: int = 3) -> str:
    answer = draft(prompt)
    for _ in range(max_rounds):
        ok, feedback = critique(prompt, answer)
        if ok:
            return answer
        answer = draft(f"{prompt}\n\nPrior answer:\n{answer}\n\n"
                       f"Fix:\n{feedback}")
    return answer

The critic should be a stronger model than the drafter, or at minimum a different prompt against the same model. A critic that shares the drafter's blind spots will sign off on bad answers.

Failure mode: self-bias. A critic that is the same model as the drafter tends to approve its own outputs. The LLM Evaluators Recognize and Favor Their Own Generations study (Panickssery et al., 2024) measured this on GPT-4 and Llama 2 and found consistent preference for the model's own text. If you run Reflection with one model wearing two hats, your quality gate is a rubber stamp. Fix by using a distinct model family for the critic, or by grounding the critique in an external signal (test runs, schema validation, citation checks).

Real-world example. The AlphaCodium paper (Codium AI, 2024) used a reflection-heavy loop (draft, run tests, reflect on failures, re-draft) and pushed GPT-4 from 19% to 44% on CodeContests. The tests are the external signal that stops the critic from rubber-stamping.

Under OTel GenAI semconv, a Reflection run emits a pattern of paired chat spans: drafter, critic, drafter, critic. You can tag them with a custom attribute like gen_ai.agent.role=drafter|critic (the spec does not yet standardize this but accepts custom keys) to make the pattern explicit in the trace. The interesting signal is the rounds-to-OK distribution across your traffic: if 90% of requests converge in one round and 10% need three, the 10% is where your quality issues live.

reflect [prompt_id=q-9f2]
├── chat gpt-4o-mini [role=drafter, output=210]
├── chat gpt-4o      [role=critic,  output=80]
├── chat gpt-4o-mini [role=drafter, output=240]
└── chat gpt-4o      [role=critic,  output=12]   # "OK"

Decision matrix

Pattern	Latency	Cost	Best for	Instrument for
ReAct	Variable (depends on steps)	Pay per step on one model	Interactive tasks, unknown search depth, tool chains that branch	Child-span count per `invoke_agent`, repeated tool names
Plan-and-Execute	Bounded after plan	1 strong + N cheap	Structured pipelines, known task shape, cost-sensitive	Plan token size, executor span-size variance
Reflection	Multiplied by rounds	N rounds of 2 calls each	High-stakes output, available quality signal	Rounds-to-OK distribution, critic model identity

A rule of thumb for 2026: start with ReAct for anything interactive. Move to Plan-and-Execute when you notice your ReAct agent is re-deriving the same plan for every request. Add Reflection as an outer loop around either when the final-answer quality matters more than the wall-clock.

Observing the three in one trace

The three patterns compose. A production coding assistant in 2026 commonly runs a Plan-and-Execute outer loop, where each executor step is a ReAct agent with its own tools, and the whole run is wrapped in a Reflection pass that re-runs the failing tests. That span tree is four levels deep and emits invoke_agent spans at each level.

The only way to keep that readable in production is to use the GenAI semconv attributes consistently: gen_ai.agent.name to tell the planner from the executor from the critic, gen_ai.agent.id to correlate all spans from one request, and gen_ai.operation.name to distinguish chat from execute_tool. When your trace UI lets you filter by gen_ai.agent.name=critic and see only the reflection passes across last week, you have the observability surface the pattern was always supposed to give you.

Pick the pattern that matches your task, bound the loops, instrument the spans. The framework you end up using is a detail.

If this was useful

The agent patterns in this post all show up in Observability for LLM Applications, my book on instrumenting, debugging, and running LLM systems in production. Paperback and hardcover are live on Amazon; the ebook launches on April 22. Chapter 9 walks through the OpenTelemetry GenAI semantic conventions and shows the span trees for ReAct, Plan-and-Execute, and Reflection side by side.

If you work with Claude Code or other AI coding tools, I am also building Hermes IDE, an IDE shaped around the way those tools change the edit-run-review loop. The GitHub repo is where it lives.