In the first part of this 3-part series, I covered AI memory and its classification.
Here, I will start with a deep dive into the types of AI memory, and then discuss the pain points in AI memory architectures.
A Deep Dive Into AI Memory Classification
As I mentioned in the first part, how human and AI memory work is different. The broad classification of short-term and long-term AI memory requires further analysis.
Short-Term Memory: Managing Ephemeral Context
This spans a single conversation session. The memory is preserved temporarily across a few prompts and responses. The context window, thus, encompasses the whole chat in the session, utilizing sliding windows or token-based buffers.
Short-term memory can be hence defined as the system’s ability to maintain continuity within a specific session or task. Working memory is a subset of short-term memory. While working memory only deals with tokens being processed immediately, short-term memory can strategically manage continuity to a certain extent as the conversation grows.
Implementation Strategies
Let's start by understanding why short-term memory fails. Simply because it outgrows the context window and hence cannot remember beyond the model's token limit. To counter this, the system can delete and overwrite earlier information or, because it is full, fail to add new information, leading to memory failure and possible hallucination.
There are three techniques to manage this:
- Conversation Buffers: These maintain a list of recent messages. There are two types of buffers - full, which remembers everything, and windowed, which retains the last few turns, ensuring the model works within its limits and the earlier context is not totally lost.
- Summarization: This technique compresses the conversation history into a concise summary of essential facts by forgetting redundant details. While this can extend the session limits, nuances from earlier context might get filtered out.
- Token Budgeting: This optimization technique is a variation of summarization where only the system prompt and the latest five turns are preserved, pruning the rest. Research suggests this middle data is often ignored by models anyway, and hence discarding that portion of the context is an acceptable approach.
The Key-Value (KV) cache plays a vital role in short-term memory. It preserves the exact attention patterns of recent turns and can sometimes perform better than standard Retrieval-Augmented Generation (RAG). There is, however, a trade-off between speed and memory capacity.
Long-Term Memory: Shift to Persistent and Adaptive Knowledge
This spans multiple conversation sessions. The memory is preserved for longer periods, resulting in better context. External databases (vector databases, key-value stores) help to refer to persistent data outside the model.
Long-term memory can be hence defined as the system’s ability to maintain continuity across sessions and days, sometimes even weeks, months, or years. It builds relationships rather than transactions, as external databases synchronize with the model’s built-in knowledge, and RAG is the single-most critical technology to achieve this.
RAG Pipeline and Role of Vector Databases
A typical RAG architecture computes information only one-way. Documents are converted into high-dimensional vector embeddings and stored in a vector database. The advantage of the RAG system is that the data is organized based on semantic meaning instead of literal matches.
So, when a user submits a prompt, the system embeds the query and retrieves a chunk of data that is most similar. This information is pulled into the context window, and the AI can generate answers based on private, proprietary, or real-time data, not necessarily part of its original training. Research suggests this can cut down factual errors and increase efficiency.
From Stateless RAG to Stateful Memory Loops
An essential point to note here, if it has not been apparent, is that RAG is fundamentally stateless. What does this mean? It means that every query is independent, and its significance lies in the fact that the system does not learn anything from interacting with the user. The next logical step is, therefore, moving towards stateful AI memory architectures that can operate as a continuous loop of learning.
There are 4 critical components of a truly stateful memory system.
- Extraction: First, the LLM evaluates user interactions to identify and filter salient facts or user preferences.
- Synthesis and Learning: Next, synchronization between new information and existing knowledge takes place. The system determines whether new information is a new addition entirely or an overwriting of older, redundant data.
- Conflict Resolution: Next, any contradiction arising from the new data addition is resolved by the intelligent agentic system so that consistency and continuity are maintained.
- Consolidation: Finally, all knowledge deemed as significant is moved from the short-term context to a long-term Memory Graph or relational store.
It stands to reason that such a stateful loop needs multi-agent collaboration.
Since we are discussing the various types of AI memory, there is another that deserves mention - User Profile Memory. It is usually a subset of the long-term memory with particular emphasis on the user and their preferred language, time zone, conversational and response style, etc. These structured user profiles are stored in databases and injected into prompts, giving the appearance of personalized conversations.
Security and Privacy Risks in AI Memory Architectures
AI memory is built over time, processing through huge and complex datasets that can encompass critical and sensitive information. As vector databases and RAG pipelines become live memory, there is an inevitable vulnerability to severe, crippling security threats.
Data Leakage and Unauthorized Access
This is related to the exposure of Personal Identifiable Information (PII) or proprietary content. RAG systems pull context from a vast range of documents and sources. All the data needs to be properly cleaned, classified, and scoped so that the AI models do not reveal sensitive facts by mistake. This type of context leakage is common when prompts by one user retrieve another user’s private embeddings due to misconfigured access controls.
Embedding Inversion and Data Poisoning
Embedding Inversion occurs when embeddings in the vectors, which are essentially high-dimensional semantic relationships, are inverted, leading to the reconstruction of the original source text, compromising anonymity.
Data Poisoning occurs when malicious actors inject false or adversarial data through subliminal texts hidden in documents into a knowledge base accessed by the model, leading to erroneous or harmful responses.
In the concluding part of the series, I will discuss a potential solution in decentralized and privacy-first AI, along with a mention of working use cases of portable AI memory.


Top comments (0)