Because building agents without understanding memory is like hiring an employee who forgets everything by morning.
Introduction
Your Agent Is Not Broken. It Was Never Built to Remember.
Here is something most people get wrong when they first build an AI agent. They set it up, give it context, run a few tasks, it works great. Then they come back the next session and it has no idea who they are, what the project is, or what was decided. So they open a GitHub issue. They try different prompts. They assume something is misconfigured.
Nothing is misconfigured. The agent is working exactly as designed.
The hard truth is this: agent memory is not a model problem. It is an infrastructure problem. The LLM at the core of your agent is stateless by design every inference call starts completely fresh. No history, no context, no record of what happened before. That is never going to change, because statelessness is precisely what allows LLMs to scale to millions of users at once.
What this means for builders is important: you cannot give the model memory. You have to build memory infrastructure around it.
The agent does not remember. The infrastructure remembers. The agent only knows what the infrastructure decides to place in front of it inside the context window.
That distinction is the foundation of everything in this post. Once you understand it, the five memory types stop being abstract concepts and start being concrete engineering decisions you make when designing an agent system.
The Context Window: Why It's at the Center of Every Memory Decision
Before we get into the memory types, you need to understand one thing clearly and the context window is the only reality the LLM has.
Every token the model can reason about your message, the conversation history, retrieved documents, tool outputs, system instructions must be inside the context window at the moment of inference. If it is not in the window, the model does not know it exists. Full stop.
This is why memory architecture matters so much. Context windows are finite they have token limits, they cost money to fill, and they reset completely between sessions. You cannot just dump everything into them and call it done. You need a system that intelligently decides what information gets retrieved, when, and injected into that window at the right moment.
That system is agent memory. And because different situations demand different kinds of information recent conversation turns, user preferences, mid task reasoning state, past interaction history, domain facts there is not one type of memory but five, each built to retrieve and inject the right information at the right moment.
How the Memory Problem Got Serious
AI applications did not start as agents. They started as simple request response systems you send a message, the model replies, nothing is retained. Each call was completely isolated from the last (pervious).
The first attempt to fix this was brute force send the entire conversation history with every request. It worked well enough for short conversations, but it was never really memory it was just a growing pile of text being thrown at the model each time. Once conversations got long enough, older messages fell off the token limit and disappeared. The "memory" was already leaking.
Then models gained the ability to call tools APIs, databases, search engines and the use case jumped entirely. Now you could build agents systems that take a goal, break it into steps, call tools, observe results, and loop until the task is complete. Then came multi agent systems, where specialized agents work as a team, routing tasks between each other like a coordinated workforce.
Each step forward made the memory problem worse. A single chatbot forgetting context is annoying. An agent losing state mid task is a failure. A multi agent system where no agent knows what the others have decided is a broken system. The "stuff everything into the context window" approach simply does not hold at this level of complexity.
What you need instead is intentional memory architecture a layer that knows what to store, how long to keep it, and exactly when to surface it. That layer is built on five distinct memory types, each designed to solve a different part of the problem.
The 5 Types of Agent Memory
1. Short-Term Memory (STM) The Conversation Buffer
Short Term Memory(STM) is the simplest form of agent memory and the one you are almost certainly already using without thinking about it.
Every message the user sends and every response the agent gives gets stored in a session buffer. That buffer gets assembled into the context window on every subsequent request. This is how the agent understands follow up questions when you say "make it shorter," it knows what "it" refers to because the prior exchange is sitting in the context window.
The technical implementation is a rolling token buffer. When the buffer approaches the model's token limit, older messages get truncated or summarized before dropping off. New inputs overwrite old ones. When the session ends, the buffer clears entirely.
Think of it like RAM in a computer fast, active, and useful right now. But the moment you turn it off, it's gone.
What it solves: Conversation coherence within a single session. Follow up questions. Context continuity across a short interaction.
What it does not solve: Anything beyond the current session. Come back tomorrow, and the agent has no idea who you are.
2. Long Term Memory (LTM) Persistence Across Sessions
Long Term Memory is what makes an agent feel like it actually knows you.
Instead of losing everything when a session ends, LTM stores important information in a persistent external store user preferences, past decisions, project context, communication style, recurring constraints. The next time you interact with the agent, the most relevant pieces of that stored knowledge get retrieved and injected into the context window before the model ever sees your message.
The standard implementation uses a vector database like Pinecone, Weaviate, or ChromaDB. When something worth remembering happens, it gets converted into a vector embedding and stored with metadata. On future sessions, incoming queries trigger a similarity search the top-k most semantically relevant memories are retrieved and quietly injected into context. The model then responds as if it already knew those things about you, because from its perspective, it does.
The workflow in practice:
- User shares something reusable preferences, goals, constraints, project structure
- That information is embedded and stored in the vector database
- On every future session, a similarity search retrieves what is relevant
- Retrieved memories are injected into the context window before the model processes the request
- Memory updates when new important information is provided
What it solves: Cross session personalization. User preference retention. Long running project continuity. Making the agent feel like a real colleague who knows your context.
Real example: An AI assistant that remembers your name, your team's preferred report format, and the fact that you always prioritize cost over speed in trade off decisions even when you return after weeks away.
3. Working Memory The Reasoning Scratchpad
Working Memory is what the agent uses while it is actively thinking through a complex, multi step task.
Imagine you ask an agent to research five competitors, extract their pricing, compare them against your product, and write a summary recommendation. That is not one step it is a chain of steps where each result feeds into the next. Working memory is the temporary store that holds intermediate results across those steps, so the agent does not lose track of what it has already done.
Without working memory, each loop iteration in an agentic workflow would start with no knowledge of previous iterations. The agent would spin in circles or repeat steps it had already completed.
The implementation is typically an in-memory structure a dict or JSON object — maintained by the agent framework across loop iterations. At each step, the current working memory state gets injected into the context window alongside the new task, so the model can build on prior results. Once the task is complete, working memory is cleared.
What it solves: Multi step task execution. Complex reasoning chains. Agentic loops that need to carry state from one iteration to the next without losing the thread.
Real example: An agent planning a travel itinerary holds flights, hotel constraints, budget limits, and date conflicts in working memory building the full picture step by step before producing a final recommendation.
4. Episodic Memory The Interaction Log
Episodic Memory gives an agent the ability to recall specific things that happened in the past not just general preferences, but actual events with context and outcomes.
Where Long Term Memory stores what you like, Episodic Memory stores what happened. It is a structured log of past interactions, each saved as an event record with a timestamp, the task that was performed, inputs, actions taken, and the outcome. Think of it as the agent's diary specific, timestamped, retrievable.
When you come back and ask "what did we work on last week?" or "remind me of the decision we made on the pricing model," the agent queries the episodic store by timestamp, keyword, or semantic similarity retrieves the relevant episodes, compresses them into a summary, and injects that summary into the current context window.
This is also what enables agents to say things like: "Last time you reviewed this type of document, you flagged the legal section first want me to start there again?" That is episodic memory working correctly.
What it solves: Specific past event recall. Long running project continuity. Agents that learn from experience and build on prior decisions rather than repeating mistakes.
Real example: "Last time you chose Option A over Option B because of budget should I apply the same logic here?" That sentence could only come from an agent with episodic memory.
5. Semantic Memory The Knowledge Layer
Semantic Memory is the agent's understanding of the world facts, concepts, domain knowledge, relationships between things independent of any specific interaction with you.
It is not about your history with the agent. It is about what the agent knows to be true. That Python is a programming language. That Singapore's corporate tax rate is 17%. That a JWT token expires and must be refreshed. This kind of knowledge lives either in the model's pre-trained weights or more usefully for domain specific and u*p-to-date needs* in an external knowledge base accessed through RAG (Retrieval Augmented Generation).
When you ask a factual or domain specific question, the agent does a semantic search against the knowledge base, retrieves the most relevant facts, injects them into the context window, and generates a grounded response. This is how you build agents that give accurate answers in specialized domains without hallucinating details they were never trained on.
What it solves: Factual accuracy. Domain specific expertise. Keeping agents grounded in verified knowledge beyond their training cutoff. Enterprise knowledge bases where accuracy is non negotiable.
Real example: An agent asked "Is Bangalore more populous than Amaravathi?" does not guess from training data it queries semantic memory, retrieves the fact, and answers with confidence.
How All Five Work Together
These memory types are not mutually exclusive a well designed agent uses all of them simultaneously, each handling a different layer of the memory problem.
The Tools or Frameworks That Make This Real
This is not theoretical. The tooling is production ready right now.
LangChain handles buffer memory, summary memory, and vector-based LTM out of the box. It is the most flexible starting point for composing memory types together in one agent.
LlamaIndex is purpose built for connecting external knowledge sources PDFs, APIs, databases, knowledge graphs making it the go to for RAG heavy Semantic Memory implementations.
Pinecone, Weaviate, ChromaDB are dedicated vector stores that power both LTM and Semantic Memory with fast, scalable similarity based retrieval.
LangGraph brings graph based orchestration to stateful, multistep agentic workflows this is what Part 2 uses to wire all five memory types into a real working system.
AWS Strands Agents provides production grade agent infrastructure with memory at cloud scale also covered hands on in Part 2.
Thanks
Sreeni Ramadorai






Top comments (1)
Awesome!!