Shafiq Ur Rehman

Posted on Apr 19

From Simple LLMs to Reliable AI Systems: Building Reflexion, Based Agents with LangGraph

#llm #agentsystems #reflexion #generativeai

"An LLM that cannot reflect on its mistakes is not an agent, it is an autocomplete on steroids."
— Common wisdom in modern AI engineering

Introduction: Why "Just Prompting" Is No Longer Enough

You have seen this happen. You give an LLM a hard task. It writes a report. It fixes code. It plans something step by step. The answer sounds right. But small things are wrong. Sometimes big things are wrong.

The model does not stop to check itself. It does not ask if it made a mistake. It does not try again in a better way.

This is the gap between a simple LLM call and a system you can trust.

This article shows how to close that gap. You will learn two ideas: Reflexion, where the AI checks its own work and tries again, and LangGraph, a tool to build workflows with memory and clear steps.

Section 1: The Reliability Problem with Bare LLMs

Large language models are extraordinarily powerful pattern completers. Given a well-formed prompt, they can write poetry, generate code, summarize documents, and reason through logic puzzles. But they have a structural weakness that every practitioner eventually hits:

They do not know when they are wrong.

The Core Failure Modes

Hallucination (making up information that sounds plausible but is factually incorrect): An LLM asked to cite sources may invent URLs, author names, or statistics that feel authoritative but do not exist.
Premature convergence: The model "settles" on its first reasonable-sounding answer without exploring whether a better one exists. This is especially damaging in multi-step reasoning tasks.
Context blindness at scale: As tasks grow spanning multiple documents, steps, or tool calls, the model loses track of earlier constraints, leading to contradictions deep in a workflow.
Silent failure: Unlike a software crash, a wrong LLM output looks identical to a correct one. There is no error message. The system "succeeds" by returning something.

Counter-view: Some researchers argue that sufficiently large models with good prompting (chain-of-thought, self-consistency) can sidestep many reliability issues. This is partially true for isolated reasoning tasks, but it breaks down when tasks are long-horizon, multi-step, or require external tool use, where real-world feedback is necessary.

Real-World Case: The Air Canada Chatbot Incident (2024)

Air Canada deployed an LLM-powered chatbot that confidently told a customer they could apply for a bereavement fare after their trip and receive a refund retroactively, which was false. The chatbot hallucinated a policy that did not exist. Air Canada was held legally liable. The system had no feedback loop, no validation layer, and no ability to catch its own mistakes.

This is not a prompt engineering failure. It is an architectural failure.

📖 Further Reading: [Search: "Reliability of LLMs in production systems 2024"]

📌 Background: What Is a "Forward Pass"?

When you send a prompt to an LLM, it runs a single forward pass, meaning it reads your input from left to right through billions of parameters and generates tokens one by one until it stops. There is no internal loop, no checking, no going back. It is a one-way function. This is why LLMs cannot self-correct without external scaffolding.

Section 2: Enter Reflexion: Teaching AI to Think Twice

Reflexion is a framework introduced in a 2023 research paper by Shinn et al. at Northeastern University. The core idea is elegant:

Instead of training a model to be better (which requires compute and data), give it the ability to reflect on its own failures in natural language, store that reflection as memory, and try again.

This is significant because it requires no weight updates, no fine-tuning, no retraining. It is a pure inference-time technique that turns a static model into a self-improving agent.

The Three Components of Reflexion

Actor The LLM that actually does the task. It takes the current task description + any memory from past attempts and generates an output (text, code, a plan, a tool call, etc.).
Evaluator (also called the "Critic") A scoring function that judges the Actor's output. This can be:
- Another LLM call that critiques the output
- A deterministic function (e.g., unit test pass/fail, a factuality checker, a code linter)
- A human-in-the-loop signal
Reflector The component that reads the Actor's output and the Evaluator's feedback, then produces a verbal self-critique, a natural language paragraph explaining what went wrong and how to do better. This critique is stored in a persistent episodic memory and injected into the Actor's next attempt.

Why Verbal Reflection Works

The brilliant insight is that LLMs are good at talking about their mistakes even when they make them. By externalizing the critique into language (rather than gradient updates), you leverage the very skill LLMs are best at. "I failed because I did not account for edge case X. Next time, I should check for X first." This verbalized lesson, fed back into the context window, measurably improves next-attempt quality.

Counter-view: Critics point out that Reflexion can get "stuck" if the Actor's initial attempt is wrong in a way the Evaluator cannot detect, the reflection loop simply reinforces the error. The quality of the Evaluator is the ceiling of the entire system. A bad judge produces bad feedback.

Example: HotpotQA Multi-Hop Reasoning

In the original Reflexion paper, the technique was benchmarked on HotpotQA, a dataset of questions requiring reasoning across multiple Wikipedia articles. A plain GPT-4 agent answered correctly ~30% of the time on hard questions. The same model with Reflexion reached ~60% accuracy after three reflection cycles, without any fine-tuning. The improvement came purely from the agent saying: "I missed that the question asked about the founding date, not the founding country. Let me re-read the passage with that in mind."

📖 Further Reading: [Search: "Reflexion: Language Agents with Verbal Reinforcement Learning Shinn et al. 2023"]

⚠️ CRITICAL NOTE: Token Budget and Cost

Every reflection cycle is an additional LLM call. On a 3-cycle Reflexion loop with a GPT-4-class model, you are paying for 3–6× the tokens of a single call. For high-volume production systems, this cost must be budgeted explicitly. Always add a max_iterations guard and use cheaper models for the Evaluator when possible.

Section 3: LangGraph, Stateful Agents as Executable Graphs

LangGraph is a library built on top of LangChain that lets you define agent workflows as directed graphs where nodes are functions (or LLM calls) and edges are transitions between them, which can be conditional.

This is a fundamentally better model for complex agents than a simple chain or a while-loop in Python, for three reasons:

Why Graphs Beat Chains for Agents

Explicit state management: LangGraph makes the agent's "working memory" what it knows, what it has tried, what it is doing into a typed, inspectable Python object called the State. You always know what data is flowing through your system.
Conditional branching: Edges in LangGraph can be conditional. After the evaluator runs, you can route: "If score is good enough → END; else → reflect_node." This is the architectural backbone of the retry loop.
Built-in persistence: LangGraph supports checkpointing, saving the agent's state to a database between steps. This means long-running agents can be paused, resumed, debugged, or even handed off to a human mid-execution.

Counter-view: Some engineers prefer simpler approaches, a while-loop in Python with direct API calls, arguing that LangGraph adds abstraction overhead. This is valid for simple use cases. The graph model truly pays off when you have branching logic, human-in-the-loop steps, or parallel sub-agents that need to join results.

Example: LangGraph vs. LangChain Sequential Chain

Imagine an agent that writes code, runs it, and fixes errors.

With a LangChain sequential chain, you predefine the steps: write → run → fix → done. But what if it needs 3 fix cycles? Or what if the code is correct on the first try? The chain cannot dynamically decide.

With LangGraph, you define: write_node → run_node → conditional_edge(pass? → END, fail? → fix_node) → run_node. The graph routes itself based on runtime results. This is the difference between a flowchart and a script.

📖 Further Reading: [Search: "LangGraph documentation stateful agents 2024"]

📌 Key Term: Conditional Edges

In LangGraph, a conditional edge is a function that inspects the current State and returns the name of the next node to visit. This is how you implement decision logic: "if the evaluator score is above 0.8, go to END; otherwise, go to reflect_node." Without conditional edges, you have a chain, not an agent.

Section 4: Architecting the Reflexion Agent Design Deep Dive

Now we get into the engineering. A Reflexion agent in LangGraph is built around three decisions that determine everything else: what the state looks like, what each node does, and how the conditional router decides when to stop.

4.1 Designing the State

The State is the agent's "working memory." Every node reads from it and writes to it. A well-designed State captures:

The task is immutable, set at the start
All past attempts so the Actor can see what it has already tried
All past reflections so the Actor has accumulated lessons
Scores per attempt for the router to decide stop/continue
Iteration counter is the safety valve against infinite loops
Final answer populated when done

A common mistake is storing only the latest attempt and reflection, discarding history. This strips the agent of its learning advantage. The whole point is that accumulated reflections compound across cycles.

4.2 The Actor Node

The Actor prompt is the most important in the system. It should include:

def actor_node(state: ReflexionState) -> ReflexionState:
    # Build context from accumulated memory
    memory_context = ""
    for i, (attempt, reflection) in enumerate(
        zip(state["attempts"], state["reflections"])
    ):
        memory_context += f"\n\n--- Attempt {i+1} ---\n{attempt}"
        memory_context += f"\n--- Self-Critique {i+1} ---\n{reflection}"

    prompt = f"""
    Task: {state['task']}

    {f"Your previous attempts and self-critiques:{memory_context}" if memory_context else "This is your first attempt."}

    Now produce your best answer, learning from any past mistakes.
    """
    response = llm.invoke(prompt)
    return {
        **state,
        "attempts": state["attempts"] + [response.content],
        "iteration": state["iteration"] + 1,
    }

Notice how the full history of attempts and reflections is injected. This is episodic memory, the agent is literally given its autobiography.

4.3 The Evaluator Node

This is the most context-dependent part. The right evaluator depends entirely on your task:

Task Type	Best Evaluator
Code generation	Unit test runner (deterministic)
Factual Q&A	Another LLM with a fact-check prompt
Essay writing	Rubric-based LLM judge
API calls	HTTP response status + schema validation
Math	Python `eval()` or symbolic solver

A deterministic evaluator (like running tests) is always preferable when available, because it is objective and cheap. LLM-as-judge is useful but introduces its own biases.

4.4 The Reflector Node

def reflect_node(state: ReflexionState) -> ReflexionState:
    last_attempt = state["attempts"][-1]
    last_score = state["scores"][-1]

    prompt = f"""
    You attempted this task: {state['task']}

    Your output was:
    {last_attempt}

    The evaluator gave it a score of {last_score:.2f} out of 1.0.

    Write a concise, specific self-critique (3–5 sentences):
    - What specifically went wrong?
    - What did you overlook or misunderstand?
    - What concrete change will you make next time?

    Do not be vague. Be precise and actionable.
    """
    reflection = llm.invoke(prompt).content
    return {**state, "reflections": state["reflections"] + [reflection]}

The prompt instructs the LLM to be specific and actionable, not vague. "I should do better" is useless. "I failed to handle the case where the input list is empty, causing an IndexError. Next time, I will add a guard clause at line 1." is useful.

4.5 The Conditional Router

def should_continue(state: ReflexionState) -> str:
    if not state["scores"]:
        return "actor"  # First iteration, no score yet

    last_score = state["scores"][-1]
    iteration = state["iteration"]

    if last_score >= 0.85:
        return END  # Good enough
    if iteration >= state["max_iterations"]:
        return END  # Safety stop
    return "reflect"  # Not done yet, reflect and retry

The threshold (0.85 here) is a hyperparameter (a design-time setting that you tune rather than the model learns) that you tune per domain. For medical or legal agents, set it close to 1.0. For creative writing suggestions, 0.7 may suffice.

Example (Real-World Case): Reflexion for Competitive Programming

DeepMind's AlphaCode 2 and similar code-agent research use Reflexion-like loops where the actor writes code, a test suite evaluates it, and failure messages are reflected into the next attempt. On LeetCode Hard problems, this pattern lifted solve rates from ~15% (single pass) to ~45% (5 reflection cycles) in published ablations. The key: tests provided a perfect, deterministic evaluator with no LLM-as-judge ambiguity.

📖 Further Reading: [Search: "LangGraph Reflexion agent code tutorial LangChain 2024"]

⚠️ WARNING: Reflection Can Degrade Quality

There is a known failure mode called "reflection poisoning" where a poor reflection actually steers the actor away from a correct answer it found. If your evaluator has a bug or blind spot, a correct output might be scored low, causing the reflector to critique something that was actually right. Always log and inspect all intermediate states, especially on tasks where correctness is hard to verify.

Section 5: Pros, Cons, and When to Use This Pattern

Reflexion + LangGraph: Honest Trade-offs

	Pros	Cons
Quality	Measurably higher accuracy on complex tasks	Quality ceiling is set by the evaluator's accuracy
Cost	No fine-tuning needed; inference-only	Multiple LLM calls per task; 3–10× base cost
Flexibility	Works with any LLM; swappable components	Adds significant engineering complexity vs. one-shot
Debuggability	State is fully inspectable at every step	More surface area for bugs; harder to trace failures
Latency	Best answer given time budget	Latency scales with iterations; not for real-time apps
Reliability	Handles task types that single-pass fails at	Can loop indefinitely without a hard iteration cap

When TO Use Reflexion-Based Agents

Tasks where errors are catchable and measurable (code, math, structured outputs)
Workflows where cost of a wrong answer exceeds the cost of extra API calls (legal, medical, financial drafting)
Asynchronous or batch tasks where latency is not the primary constraint
Tasks involving tool use where real-world feedback naturally forms the evaluation signal

When NOT To Use

Real-time, low-latency applications (chatbots with <2s response requirement)
Tasks where the evaluator itself would need to be an expensive LLM call, the economics may not hold
Simple, well-scoped tasks where a single well-crafted prompt already performs well
Domains where you cannot define a reliable evaluation metric at all

Counter-view: With inference costs falling ~50% every 12–18 months historically, the cost argument against multi-cycle agents is weakening. By 2026 standards, what costs $0.10 per task today may cost $0.01. Cost-based objections have a short half-life.

Example: When NOT to Use It: The Customer Service Case

A retail company tested Reflexion for their live chat customer support bot. The latency of 3 reflection cycles (avg. 12 seconds per loop) made conversations feel broken. Customers expected responses in 2–3 seconds. The agent was technically more accurate, but user satisfaction scores dropped because of perceived slowness. Architecture must match use-case constraints, not just quality targets.

📖 Further Reading: [Search: "LLM agent latency optimization production 2024"]

Section 6: Production Hardening What Research Papers Don't Tell You

Research papers show the happy path. Production systems face messier realities. Here is what you must address:

Critical Production Concerns

Context window overflow: By iteration 3, the state contains the original task + 3 attempts + 3 reflections. On long tasks, this can exceed the model's context window (the maximum text length a model can process at once). Implement a compression step that summarizes older reflections into a brief "lessons learned" paragraph.
Checkpointing for resilience: LangGraph's SqliteSaver and RedisSaver let you persist state between steps. If your agent is doing a 10-step task and fails at step 8, you can resume from step 8 without rerunning the first 7 steps. This is non-negotiable for long-running agents.
Observability: Use LangSmith (or equivalent tracing tools) to visualize every node's inputs and outputs in real time. Reflexion agents that fail silently are far harder to debug than a simple chain, because the error may be in the evaluator logic, the reflection prompt, or the routing condition.
Human-in-the-loop escalation: If after max_iterations the agent has not reached a satisfactory score, route to a human review queue instead of silently returning the best-so-far. This is the most important reliability upgrade for production.

Example: The GitHub Copilot Workspace Model

GitHub Copilot Workspace (released 2024) uses a multi-step agentic loop that resembles Reflexion: it generates a plan, the user can review/edit it (human evaluator), then it generates code, runs tests, and iterates on failures. The "human-as-evaluator" in the planning step is a deliberate design choice that combines automated iteration with human judgment the best of both worlds.

📖 Further Reading: [Search: "GitHub Copilot Workspace agent architecture 2024"]

⚠️ SECURITY NOTE: Prompt Injection in Agentic Loops

When the agent's tool outputs (e.g., web search results, code execution stdout) are fed back into the prompt, malicious content in those results can hijack the agent's behavior. This is called prompt injection. Always sanitize tool outputs before injecting them into prompts, and consider running evaluators and reflectors on a separate, sandboxed model invocation.

Section 7: The Bigger Picture Where Reflexion Fits in the AI Stack

Reflexion is one pattern in a growing taxonomy of agent architectures. Understanding where it sits helps you choose the right tool:

Agent Architecture Taxonomy

Single-pass LLM One prompt, one response. Fast. No self-correction.
Chain-of-thought (prompting the model to "think step by step" before answering) Better reasoning, but still single-pass.
ReAct (Reasoning + Acting: the model alternates between thinking and calling tools) Good for tool use, but no explicit self-correction loop.
Reflexion Adds a verbal self-correction cycle on top of any base agent pattern.
Multi-agent systems Multiple specialized agents (planner, executor, critic), each running independently, coordinated by an orchestrator. Reflexion can live inside each agent.
RLHF / fine-tuning (Reinforcement Learning from Human Feedback training the model's weights to be better using human preferences) Bakes improvements into the model permanently, but requires data and compute. Reflexion is the inference-time alternative.

Reflexion sits at a sweet spot: more reliable than ReAct, cheaper than fine-tuning, easier to implement than multi-agent systems. It is the right starting point when single-pass quality is insufficient, but you cannot yet justify the infrastructure cost of a full multi-agent system.

Counter-view: Some teams argue that investing engineering time in Reflexion scaffolding would be better spent curating fine-tuning data. For domain-specific, high-volume tasks, a fine-tuned small model often outperforms a Reflexion-looped large model at a fraction of the cost. This is a genuine trade-off worth modeling quantitatively before committing.

Example: Cognition AI's Devin (2024)

Devin, marketed as the first "AI software engineer," uses a multi-step loop where the agent writes code, runs it in a sandboxed terminal, observes the output (evaluator), and iterates on failures. A Reflexion-like architecture at its core. The real innovation was the deterministic evaluator: actual code execution. Devin's benchmark scores (14% on SWE-bench) became meaningful precisely because the evaluation was objective, not LLM-based.

📖 Further Reading: [Search: "Cognition AI Devin architecture evaluation 2024"]

Conclusion: The Engineering Mindset Shift

The move from simple LLMs to reliable AI systems is not about finding a better model. It is about changing your architectural mindset:

From one-shot generation to iterative refinement
From static prompts to stateful, memory-carrying agents
From hoping the model is right to building systems that verify and retry

Reflexion and LangGraph together give you the building blocks for this shift. Reflexion provides the cognitive loop, the ability to criticize and improve. LangGraph provides the execution infrastructure, typed state, conditional routing, persistence, and observability.

Neither is magic. Both require careful engineering: a well-designed evaluator, a well-tuned reflector prompt, a sensible iteration cap, and proper production hardening. But applied correctly, they transform an LLM from a clever autocomplete into a system that can be trusted with consequential tasks.

The difference between a demo and a production AI system is not the model. It is the scaffolding around it.

DEV Community