Ryan Palo for Daily Context

Posted on Jul 2

Beyond Context

#aie #ai #development #architecture

AI Engineer World's Fair Coverage

The models are getting better. As you may have seen on multiple graphs presented here at the World's Fair, that 50% success-rate task duration horizon is growing longer and longer. As the horizons get longer, context management is getting more critical, and the relatively new and highly active topic of memory management is even more important still.

When I talk about memory, I'm not talking about computer memory (RAM). That's not exciting or novel (currently, it's mostly depressing). I'm talking about your agent's memories, or, how an agent keeps track of what it has done, what it knows, and what it needs to do. I'm sorry for the confusion: naming software things is hard. That being said, for the rest of this article, you should read the word "memory" as "facts that the agent definitely knows."

This is a very exciting research space that's evolving rapidly. In this article, we'll go over a collection of 30 different memory management paradigms, tools, and concepts and when you might want to reach for them. Throughout this, I'll be pulling from a very cool repository:

NirDiamant / Agent_Memory_Techniques

Agent memory for LLMs: 30 runnable Jupyter notebooks covering conversation buffers, vector stores, knowledge graphs, episodic and semantic memory, MemGPT, Mem0, Letta, Zep, Graphiti, LoCoMo benchmarks, and production patterns.

🧠 Agent Memory Techniques

Learn every agent memory technique for LLM agents.

⭐ If you find this useful, please star the repo so more learners can discover it.

🧭 New here? Start with 01 Conversation Buffer Memory or pick a Learning Path. Prefer a visual? See the Decision Tree below. 30 runnable Jupyter notebooks covering conversation buffers, vector stores, knowledge graphs, episodic and semantic memory, working memory, MemGPT, Mem0, Letta, Zep, Graphiti, LoCoMo benchmarks, and production memory patterns.

📖 Go deeper on RAG

RAG Made Simple - the 400-page visual guide to RAG, by the author of this repo Amazon Bestseller in Generative AI · 1,500+ readers · ⭐ 4.6

Get it - 33% off with code RAGKING → · Read Chapter 1 free

📫 Stay Updated

🚀
Weekly
Updates

💡
Expert
Insights

🎯
Top 0.1%
Content

Join over 50,000 readers getting clear AI tutorials every week. Subscribers also…

View on GitHub

I recommend reading through yourself to do a deeper dive. These memory management tools are broken into six categories based on the type of issues they solve, and we'll go over those six categories below.

Short-Term

These short-term solutions are methods for "[k]eep[ing] recent turns in memory without filling up the context window."

The first one becomes the foundation for all the others to improve upon. When using Conversation Buffer memory, you save the entire chat history verbatim and send the full entirety on every LLM call. It's clear to see that this becomes untenable fairly quickly, context-wise, but it is the smallest and requires the least setup. The main concerns to keep track of here are who said what and how many tokens you're using.

Sliding Window memory is the same in effect, but you only keep the last N messages in memory to avoid context overrun. You can be sure you're not going to run into context limits, but your model will have a strict "long-term memory-loss" issue. It won't remember anything that happened a fixed number of tokens ago.

Summary memory replaces the raw running conversation history with a running LLM-generated summary of the previous history. This also keeps you from overrunning context, and it keeps at least a modicum of long-term history around, as long as that information is important enough to make it into the summary. This has a hidden side benefit of dropping out things our summarizer considers irrelevant, keeping hopefully the most important things in context. The other hidden downside is the secondary LLM call to summarize the context for every turn.

Summary Buffer strikes a balance between the previous two techniques and gets the closest to the basic functionality of current coding tools: keep the most recent N messages in memory verbatim, but summarize the older messages instead of directly forgetting about them. This balance comes at the cost of needing its own tuning. You can see we're starting to get more into "systems" of memory requiring configuration and places to stash data. You have to tune your verbatim buffer size vs. your summary refresh rate vs. your token budget, compacting only when it is cost-effective to do so.

The final technique for this section is Token Buffer. The mechanics are similar to the Summary Buffer, but you need access to your model's tokenizer so that you can keep manual track of exact tokens in the conversation history. The upside is that it accounts for individual messages that are longer and require more tokens, keeping tighter control over token cost at the sacrifice of some amount of extra context.

Long-Term

Long-term solutions solve the problem of "[s]aving knowledge across sessions, users, and time." This problem comes into play when you start to think more "agentically" about running an agent to do a task and then disposing of it while still persisting anything it learned along the way. These are six different ways to store memories.

We'll start with the first real "big kid" memory type: Vector Store. Each turn of the conversation gets embedded into a vector database and can be retrieved based on similarity with past exchanges. This requires (you guessed it) a vector database and an embedding model to generate these embeddings. One benefit here is that it interleaves nicely in systems where you already need embeddings and a vector DB for RAG-style generation.

While vector stores can get you semantic similarity, if you want to give your agent real memory about "things," or, more specifically "entities," you'll want to use Entity Memory. It involves using Named Entity Recognition (NER) through LLM prompts or NER models to identify people, places, items, concepts, and more, and then saving those into a durable place. It can keep this as a continuously evolving, living store, and inject relevant entity records into the context as needed. This is especially great for research and other places where remembering definitions and context about specifically loaded terms is important. That being said, similar terms with different meanings (like our "memory" confusion in the intro!) can give this method trouble.

Entity memory stores facts about concrete things, but not their relationships. Knowledge Graph Memory is the solution that solves that problem. This extends the NER mentioned above and specifically asks a model to extract relationships as "subject-predicate-object" triples, e.g., shark - eats - fish, and fish - live in - ocean. This would allow you to ask a more complex question like "where can I find the things sharks eat?" and follow the graph to get much more targeted context. The downside here is that the process of generating these relationships is token-heavy and tokenizing graphs or subgraphs is also token-heavy. Maintaining an accurate and not-noisy graph is also a challenge. This approach usually makes the most sense for applications that involve complex relationship webs at their core.

Episodic Memory captures separate conversation sessions as discrete episodes indexed by time and topic. This is great for maintaining information where when the conversation took place is just as important as what the conversation was about: coaching agents, meeting-based agents, and the like.

By contrast, Semantic Memory is the opposite: after every turn, an LLM prompt scans for statements that express persistent facts and keeps a store of those. These facts must be embedded to check for similarity and/or contradiction. These are the memories where a user might say, "Well, I told the agent that; it ought to remember it." There are two difficulties here. First, the LLM has a hard time differentiating between fact and strongly stated opinion by default, so it may track opinions as facts in a way that needs to be reviewed. Second, a mechanism for decay over time can sometimes be needed to invalidate old facts as they age unless they are reinforced every so often.

Surprisingly, Procedural Memory is probably the kind you have the most experience with if you've been working with AI for any period of time. It's skills! Procedural memory is a durable recollection of detailed step-by-step workflows that allow an agent to follow a procedure rather than reason it through from start to finish. These often need to be curated by the user, in conjunction with the agent, but they can be extremely powerful.

Cognitive Architectures

Cognitive architectures provide a way for your agent to decide on its own what context is meaningful to the task at hand. We're going to breeze through these a little faster because there are eight of them, but definitely reference the GitHub repo mentioned to dive deeper. The main goal of the techniques in this section is deciding what to keep in working context memory, what to get rid of, and what to do with the memories it's getting rid of.

Working Memory & Context Window is the base option: let the agent decide which memories and facts are relevant and occasionally dump the least relevant memories when storage pressure gets too high. Hierarchical Memory builds on that by adding a multiple stages of "cold storage" where memories get demoted when their importance is deemed less important, and facts that are most important or accessed the most frequently are pulled toward the top layers.

Memory Consolidation is the process of consolidating memories that are similar enough into one stored entity to reduce space requirements, and Memory Compaction is taking enough of these similar or related entities and converting them to a single smaller entity by summarization.

Self-Reflecting Memory is when the agent generates memories by reflecting on the success or failure of past actions in order to improve performance on future tasks: "I tried running X command and it only worked when I added the --please-work flag."

Memory Routing is when you have multiple different types of memory stores based on what kind of memory it is and you need to search or update amongst them. Maybe you have an entity store for definitions as well as an episodic store for various user sessions.

Temporal Memory is the idea that more recent memories are likely more important, and Forgetting and Decay is the flip side of that coin, that memories should age and eventually be forgotten if they're not used or referenced.

Retrieval & Routing

In the vein of having multiple ways of storing memories and needing to figure out what to recall, from where, and when, these are four concepts that dive deeper:

Cross-Session and Multi-Agent memory are reasonably straightforward: persisting memories to share with an agent's future self or with other agents. Memory with Tools is a neat idea of giving the agent a tool to perform CRUD operations on its own memories, using it just like any other tool it has. The term here with the most weight (LLM pun mediumly intended) is Memory Retrieval Patterns.

With all of the other techniques from the previous section for weighting, classifying, grouping, and prioritizing memories, memory retrieval patterns are really all about how the most relevant and important memories are retrieved, and it involves a synthesis of multiple storage modes and techniques. Semantic and keyword searches are run, all of the mixed results are ranked together and then top candidates are scored again more precisely, and then the final selected results, optimized for diversity, are finally injected into the context window. It's important to keep in mind that each layer of storing, retrieving, ranking, encoding, and processing adds a little latency to each turn, and that overall accuracy should be balanced versus overall latency.

Frameworks

These four tools are all production-ready memory libraries to help you avoid rebuilding the memory wheel. Again, for content-length, and because the previous sections already covered the main fundamentals, we'll just list these as references.

Graphiti: A Neo4j-backed temporal memory and knowledge graph with automatic entity and relationship extraction
Mem0: A REST API and Python SDK for keeping track of user-scoped memories
Letta (formerly MemGPT): A Hierarchical (see, you know what that means now!) memory store with three tiers: Core memory, Archival (searchable) memory, and Recall (full conversation) memory
- If you check out the repo, there's an iPython notebook where you can follow along building your own MemGPT-style memory loop from scratch!
Zep: A managed memory service that handles all of the classifiers, entity extraction, and relationship graphing (powered by Graphiti, if you'll believe it) for you.

Evaluation & Production

These last three concepts are ones that you will need to keep in mind as you head to production at a larger scale: Evaluation is the idea of intermittently measuring the quality of your agent's memory system using metrics like precision, recall, staleness detection, and LLM-as-a-Judge scoring. Benchmarks are the idea of evaluating your system against stable, standardized datasets to measure against a published baseline. The last item in our list, Production Memory Patterns is another catch-all for concepts and topics needed to fully productionize a memory system at scale: TTL, sharding, compliance, and observability. Check out the repo to dive deeper.

Memory is Data

The important thing to remember is that agent memory is, at its core, just data. Depending on what your application needs, you'll need to decide which pieces of memory metadata are important to you, how to quantify the things that need to be quantified, how to store the things that need to be stored, how to look them up again later, and how to do that reliably at whatever scale you need. Unlocking the separation between "what does my agent know" and "what does my agent need to know right now" is the key to helping our agents keep up with this ever-expanding capability horizon.

DEV Community