"A language model that answers questions is a tool. A language model that decides which questions to ask and then acts on the answers is something else entirely."
Introduction: When Models Started Deciding
For the first several years of modern NLP, the task was always the same: given input, produce output. One forward pass. One completion. Done.
In 2022, a paper from Google Brain asked a different question. What if, instead of producing an answer directly, a model could reason about what information it needs, act to retrieve it, and revise its thinking based on what it found?
The paper was ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022). Applying it to an LLM created something qualitatively different: a model that could take real-world actions and adapt its reasoning based on what came back.
A completion model is a calculator. An agent is a process: it has a goal, takes steps toward it, and updates when things go wrong. This week I went deep on the architecture behind these systems, the frameworks that define them, and what the open problems look like from a research perspective.
Part 1: What Makes a System "Agentic"?
The word "agent" gets used loosely in current literature. A clean definition comes from Russell and Norvig's Artificial Intelligence: A Modern Approach:
An agent is anything that perceives its environment through sensors and acts upon that environment through actuators.
For an LLM-based system, this is a loop: perceive an observation, reason about what to do, act via a tool call or output, observe the result, and loop again. But not every loop qualifies as agentic. Three properties distinguish genuinely agentic systems from tool-augmented chatbots:
| Property | What It Means |
|---|---|
| Goal persistence | Maintains the original goal across multiple steps without re-prompting |
| Adaptive planning | Revises its approach based on intermediate results |
| Tool autonomy | Decides when and which tools to use, not just how to use one it was told to call |
Most production systems in 2026 satisfy the first two reliably. The third, genuine tool autonomy where an agent discovers appropriate tools from scratch, is still largely unsolved.
Part 2: The Core Frameworks
ReAct: Reasoning and Acting Together
The core contribution of ReAct (Yao et al., ICLR 2023) is structuring the model's output as alternating Thought and Action blocks.
Thought: I need current statistics on LLM deployment.
Action: search("LLM production deployment 2025")
Observation: 68% of enterprises report using LLMs in production workflows...
Thought: Enough context. I can now answer the original question.
Two things happen here that do not happen in single-pass completion. The model commits to a reasoning step before acting, and every action has a traceable reason. The original paper evaluated ReAct on HotpotQA and ALFWorld. On both tasks it outperformed chain-of-thought alone, with the largest gains on problems requiring multiple sequential lookups.
The intuition: chain-of-thought helps a model reason over information it already has. ReAct helps it acquire what it needs, then reason over it.
What ReAct does not solve. It is reactive. If the first three steps go down the wrong path, there is no mechanism to step back and reconsider.
)
Figure 1: The ReAct agent loop — Perceive, Reason, Act, Observe. Every action traces back to a reasoning step.
Reflexion: Learning From Failure Without Gradient Updates
Reflexion (Shinn et al., NeurIPS 2023) addresses exactly that. After each failed attempt, the agent generates a verbal self-reflection analyzing what went wrong. This is stored in a memory buffer and prepended to context at the start of the next episode.
[Episode 1 fails]
Reflection: "I searched by title, which broke on the colon. Next time: search by author and year."
[Episode 2]
Agent searches "Shinn 2023 language agent" and succeeds.
The model improves across attempts through language, not weight updates. On HumanEval, Reflexion improved pass@1 by approximately 10 percentage points over a ReAct baseline. On AlfWorld, after 3 reflection cycles, success rate reached 97% on seen environments.

Figure 2: Reflexion vs ReAct on HumanEval and AlfWorld. Numbers from Shinn et al., NeurIPS 2023.
The fundamental limitation. The memory buffer lives in the context window. As episodes accumulate, early reflections get pushed out. True long-term learning from experience requires something outside the context window entirely.
Multi-Agent Frameworks
Single-agent systems face one bottleneck: one model handling planning, retrieval, tool use, and synthesis simultaneously. Multi-agent frameworks decompose this. The standard pattern has an orchestrator that breaks the goal into subtasks, specialist agents that handle each, and an aggregator that synthesizes the result.
The three dominant frameworks take meaningfully different approaches:
| Framework | Communication Model | Key Distinction |
|---|---|---|
| AutoGen (Microsoft, 2023) | Agents converse with each other | Human-in-the-loop as a first-class citizen |
| CrewAI (2024) | Role-based delegation | Each agent has an explicit role and goal |
| LangGraph (LangChain, 2024) | Directed graph with shared state | Explicit control flow, most debuggable in production |
LangGraph models the entire workflow as a directed graph: nodes are agents, edges are transitions. This makes execution paths readable and failures traceable, which matters significantly in production.
Part 3: Memory Architecture
Memory is where most agentic systems underperform. The naive approach of keeping everything in context breaks at scale. A well-designed agent needs four distinct memory types:
In-context memory is the active context window. Fast and immediate, but size-limited and cleared between sessions. Use for current task state and the ongoing reasoning chain.
External memory is a persistent vector database. Facts are stored as embeddings and retrieved by cosine similarity. This is essentially RAG applied to the agent's own accumulated knowledge rather than a document corpus.
Episodic memory is a log of past trajectories: what the agent did, what succeeded, what failed. Reflexion's verbal buffer is a simple version. More sophisticated implementations store full (observation, action, outcome) tuples and retrieve by similarity, enabling few-shot learning from experience without retraining.
Procedural memory is the agent's fixed capabilities: tool schemas and system prompts. What it contains, particularly how tools are described, has outsized influence on behavior.
The memory architecture determines the learning capacity of the system. Getting the interaction between these layers right is still an open engineering and research problem.
My Experiment: Building the Architecture From Scratch
Most tutorials on agentic AI use LangChain or AutoGen. For this week, I deliberately avoided both and built a minimal ReAct agent using only the Anthropic API. The goal was not to produce novel empirical results — it was to understand what these frameworks are actually abstracting away.
The pipeline has four tools: web_search, memory_store, memory_retrieve, and final_answer. The orchestrator is Claude running in tool-use mode. Memory is a simple cosine similarity store over embeddings. Two queries run sequentially, sharing the same memory instance, so facts retrieved in Query 1 are available to Query 2.
To be clear about what this is and is not: this is an architectural walkthrough, not an empirical study. The outcome — that a warm memory store reduces tool calls — is exactly what theory predicts. I was not testing whether it works. I was making visible how it works, because the mechanism only becomes concrete when you can see every tool call in sequence rather than having a framework handle it silently.
Two things became clear that I had not fully appreciated from reading papers alone. First, tool description quality matters more than I expected. A vague tool description produces inconsistent selection — the model sometimes calls web_search when memory_retrieve was the right first step, purely because the description did not make the priority explicit. This is a grounding problem that frameworks handle through opinionated defaults, which means when their defaults are wrong, you often cannot see why. Second, the memory store without real semantic embeddings is brittle. I used mock embeddings seeded by text hash, which are consistent but not meaningful. On queries where surface-level keyword overlap is low, retrieval fails entirely. The framework abstracts this away. Building without it made the failure visible immediately.

Figure 3: Agent trace showing cold start (Query 1, 4 steps) vs warm
start (Query 2, 2 steps). Facts stored in Query 1 were retrieved
directly in Query 2, eliminating web search entirely.
The full code is in the GitHub repo linked below. The more interesting exercise, which I plan to run properly in a later week, is a controlled comparison of prompted versus trained agents on a fixed benchmark — ideally reproducing part of the Reflexion evaluation on AlfWorld to see whether my numbers match the paper.
Part 4: Failure Modes in Production
Agentic systems fail in ways that single-pass models do not, and the failures follow predictable patterns.
Tool call loops. The agent calls the same tool repeatedly with slightly different inputs without making progress. This happens when the tool returns unhelpful results and the agent has no mechanism for declaring failure. Step limits and explicit "I cannot find this" states help.
Hallucinated observations. The model predicts what a tool would return rather than waiting for the actual result. It is a context management error and subtle to catch without logging every call.
Memory poisoning. An incorrect fact stored early gets retrieved for related queries and contaminates future reasoning. Errors compound. Confidence-weighted storage and verification before storing are partial mitigations.
Goal drift. Past roughly 15 steps, agents frequently lose track of the original objective and optimize for the most recent subtask. Re-injecting the original goal into every system prompt turn reduces this.
Prompt injection. A web search result or document contains text designed to override the agent's instructions. This is a real attack vector in production, not a theoretical one.
Each has partial mitigations. None has a clean solution.
What We Still Do Not Know
Is prompting-based agency enough? Current agents improve through prompting: ReAct, Reflexion, tool descriptions, with no weight updates. As tasks grow longer and environments more complex, will this hit a ceiling? If training-based agents eventually replace prompted ones, what does that change about interpretability and control?
How do you evaluate trustworthiness? An agent scoring 80% on SWE-bench may still fail unpredictably on cases outside the benchmark distribution. We do not have frameworks for measuring agent reliability statistically, just average performance on fixed task sets.
Memory or fine-tuning for domain adaptation? When specializing an agent for a domain, is it better to give it a rich external memory store or to fine-tune on domain trajectories? The tradeoffs in cost, latency, generalization, and catastrophic forgetting are not well characterized.
Can we formally bound agent behavior? A calculator is provably correct within its domain. An LLM agent is evaluated empirically on benchmarks. There is no formal framework for specifying what an agent will and will not do, analogous to how formal verification works for software. Whether this is achievable for learned systems is an open question.
Papers Worth Reading
| Paper | Contribution | Venue |
|---|---|---|
| Yao et al. (2022) | ReAct: Reasoning + Acting loop | ICLR 2023 |
| Shinn et al. (2023) | Reflexion: Verbal self-reflection for improvement | NeurIPS 2023 |
| Wu et al. (2023) | AutoGen: Multi-agent conversation framework | arXiv 2023 |
| Schick et al. (2023) | Toolformer: Self-supervised tool learning | NeurIPS 2023 |
| Liu et al. (2023) | AgentBench: Evaluating LLMs as agents | ICLR 2024 |
| Park et al. (2023) | Generative Agents: Simulating believable behavior | UIST 2023 |
Research Groups Doing Interesting Work
Stanford NLP (Yao et al.) is extending agent reasoning with Tree-of-Thought and beyond. Princeton NLP (Shinn et al.) continues on self-improvement mechanisms. Microsoft Research is focused on multi-agent reliability in production. DeepMind is working on agent training at scale through SIMA. LangChain's infrastructure team publishes pragmatic findings on what actually breaks in production.
Benchmarks Worth Knowing
AgentBench evaluates agents across 8 environments including code, database, and web tasks. WebArena tests realistic web navigation. SWE-bench is the most demanding: real GitHub issues requiring working code fixes. ALFWorld is the interactive environment from the original ReAct paper. ToolBench evaluates tool selection across a large library.
Conclusion
Agentic AI is not primarily a model story. The models have not changed fundamentally. What changed is the architecture around them. ReAct gave agents a structured reasoning-action cycle. Reflexion gave them a mechanism to improve from failure within a session. Multi-agent frameworks gave them specialization. Memory systems gave them persistence across queries.
The hard problems are real. Long-horizon planning still drifts. Memory poisoning is still a live issue. Prompt injection has no clean solution. There is no formal way to guarantee agent behavior.
But the direction is clear. The next frontier for LLMs is not better completions. It is better decisions.
References
- Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.
- Shinn, N., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023.
- Wu, Q., et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155.
- Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023.
- Liu, X., et al. (2023). AgentBench: Evaluating LLMs as Agents. ICLR 2024.
- Park, J. S., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023.
- Russell, S., and Norvig, P. (2020). Artificial Intelligence: A Modern Approach, 4th ed. Pearson.
This is part of a weekly series on AI/ML research. Each post covers theory, recent papers, and experiments I run myself.
Connect on LinkedIn | [GitHub]: weekly-AI-ML research
Top comments (0)