View this blog on my personal site: www.kalebgarner.dev/work/r-memory
Have you ever noticed that LLMs have a hard time remembering smaller, specific details? You could go from a highly productive, multi-turn conversation to a complete change in direction when the model forgets a crucial piece of information mentioned just a few minutes ago. This isn't a quirk; it is an architectural flaw. While LLMs are great with reasoning, they should not be utilized as a database.
When you think about high-stakes, detail-oriented fields like medicine, using the LLM's short-term memory is not reliable and, at times, not safe. But there is a way to make our critical details persistent, structured, and immune to being pushed out of a context window.
Finite Memory in LLMs
Before we take a deeper dive, what exactly are context windows?
A context window is the amount of text an LLM can "remember" at one time. The text in this window is what is processed by the model to generate its next response. But there is a key design flaw with these context windows when it comes to longer, complex conversations.
The biggest issue is that this window is finite. As you continue your conversation and feed the model more tokens, older information is pushed out of the window and lost for the rest of the conversation. This is often referred to as context window overflow. If you mention a detail in the context window, then change the conversation, after a couple of turns, that detail is lost to the overflow.
This limitation can cause major issues when it comes to crucial details in fields like medicine, law, and finance. When it comes to these high-stakes conversations, forgetting a single detail can be catastrophic.
The Cost of Forgetting
Let's picture an LLM being used in a clinical setting, designed to assist doctors in diagnosing patients.
Crucial Detail
A new patient, John Doe, is seen for an initial physical. During the history intake, a key detail is mentioned: "I am severely allergic to Penicillin."
The LLM processes this detail and logs it into the current context window alongside the rest of the visit notes.
Token Overflow
Two days later, Mr. Doe returns to the clinic complaining of a severe sore throat. Since then, the doctor has used the same model for research, intake for different patients, and other daily tasks, generating thousands of tokens that are now stored in the context window.
The doctor orders a rapid strep test, which returns positive. The doctor asks the LLM: "Patient John Doe has tested positive for Strep. Recommend the standard first-line antibiotic."
When the model processes this request, it has no lasting context of this patient's allergy to penicillin. The only information about Mr. Doe is in the final prompt. Based on the information in the context window, the LLM would recommend the first-line treatment for strep throat... Penicillin. This would cause a life-threatening reaction for Mr. Doe.
Current Solutions and Their Drawbacks
So, what are LLMs currently doing to solve this context overflow problem?
1. Increase Context Windows
An easy thought to have is to simply increase the context window size. This could help, but it has significant drawbacks.
- The first issue is the cost. To increase the amount of tokens a model can process, the model needs to be larger and utilize more compute resources. This can be a significant cost barrier.
- The second issue is that the context window is still finite. Sure, we increased the window and allowed more tokens to be processed, but there is still a cutoff point where information is lost.
- Finally, we have to consider the computational cost of processing more tokens. The self-attention mechanism in these models is quadratic (O(N2)), meaning doubling the context window will quadruple the latency and memory.
2. Summarization
This process involves running a secondary function to compress the existing conversation into a shorter summary that is prepended to the context window.
- This solution causes "lossy compression". This is caused by compressing data by discarding some data. There is no way for a summarization to maintain key details from the conversation. So, in the case of Mr. Doe, the summarization may remember he is a patient, but it isn't guaranteed to remember his allergy to penicillin.
3. Retrival Augmented Generation (RAG)
RAG is a process where we utilize an external database to retrieve relevant information and include it within our prompt. This process involves three key steps:
- Embedding: All external documents are broken into text chunks and converted into vector embeddings.
- Retrieval: Based on the query, the system searches our database for chunks that are semantically similar to the query.
- Generation: The LLM processes the retrieved chunks and the original query to generate a response.
In our medical LLM, we could have a process that chunks and embeds our patient data. But, even in this process, there is still an issue:
- Even though we are retrieving relevant documents, there is no structure; the details that we are retrieving are stored in large blocks of text. If the critical detail we need is only weakly similar to the query, the RAG system may not retrieve it. While this is a valid system for some applications, it isn't reliable when it comes to crucial, high-stakes details.
These answers are valid, but can still be unreliable, costly, or inefficient. What if we could maintain key details from our conversation outside of the context window using a reliable, structural format?
Relational Memory (R-Memory)
This is where the Relational Memory (R-Memory) architecture comes in. R-Memory moves these critical facts out of the context window and into a permanent, structured external store. In this architecture, we utilize the LLM as a reasoning engine, not a database.
The Structure
This R-Memory system will be orchestrated by a Multi-Context Processing (MCP) Server that consists of three primary components:
- Extraction Model (The Translator):
This is a specialized, smaller model with the sole purpose of ingesting unstructured text, extracting the high-signal, relational facts, and converting them into a structured format, Triples (Subject, Relation, Object).
- Knowledge Graph (KG) (The Memory):
The KG is a specialized database designed to store data as Nodes (entities) and Edges (relationships), rather than tables or vectors. This introduces a structure and integrity to the data for reliable retrieval. As the architect, you define the core schema:
For example, our initial patient visit results in the Extractor Model converting the text into Triples:
# Cypher - Query language for Neo4j
CREATE (p:Patient {id: 'JD001', name: 'John Doe'})
CREATE (d:Drug {name: 'Penicillin'})
CREATE (p) -[:HAS_ALLERGY_TO {severity: 'Anaphylaxis'}]-> (d)
// Interpretation:
// (Patient Node) -[Relation/Edge with property]-> (Drug Node)
-
Node Labels: You define the entities (e.g.,
:Patient,:Drug,:Condition,:Policy). -
Edges: You define the links between the Nodes (e.g.,
:HAS_ALLERGY_TO,:HAS_FAMILY_HISTORY,:DOSE_LIMIT).
By defining this schema, we prevent ambiguity and ensure the system's memory is consistent and predictable.
- Intent Classifier (The Navigator):
This is the middleware component that enables precise retrieval. Instead of the MCP Server querying the entire graph for every fact, the classifier converts the user's query into a specified Action Type (Intent), which is mapped to the necessary Edge Types required for that action.
As the architect, you must define and set up this safety mapping. While classification can be handled by a small, dedicated LLM, the safety logic resides in the mapping between the Action Types and the Edge Types. An example of this mapping looks like:
| Action Type (Intent) | Description | Critical KG Edge Types to Query |
|---|---|---|
MED\_ORDER |
Request to prescribe, order, dispense, or recommend medication/treatment. |
:HAS\_ALLERGY\_TO, :IS\_CONTRAINDICATED\_WITH, :DOSE\_LIMIT
|
DIAG\_ORDER |
Request to order a test, scan, or lab work. |
:HAS\_CONTRAST\_ALLERGY, :HAS\_IMPLANTS, :REQUIRES\_PREP
|
How this works: When the query is classified as MED_ORDER, the MCP Server is explicitly told to only query the KG for the specified Edge Types. This ensures that the relevant details are retrieved without the noise and cost of querying the entire graph.
This system solves the context overflow problem by utilizing the LLM's reasoning capabilities, the context window's persistent attention, and the R-Memory's handling of important facts.
R-Memory in Action
The true power of this R-Memory system is shown when a critical fact must be retrieved under pressure, guaranteeing safety despite context overflow. Let's see how this system would handle the case of Mr. Doe.
Retrieval
When the doctor asks the LLM, "Recommend the standard first-line antibiotic," the MCP Server handles the request:
- The Intent Classifier classified this query as a
MED_ORDER - The MCP Server uses the defined mapping to execute the specified query targeting the necessary Edge Types (e.g.,
:HAS_ALLERGY_TO) in the Knowledge Graph (KG).
The KG performs a fast, precise path traversal, bypassing all the irrelevant noise, and returns the crucial safety constraint:
# Cypher Query used by the MCP Server
MATCH (p:Patient {id: 'JD001'}) -[:HAS_ALLERGY_TO]-> (a:Drug)
RETURN a.name AS DrugAllergy
KG Result: "CRITICAL ALLERGY: Patient John Doe is severely allergic to Penicillin (Anaphylaxis risk)."
Prompt Augmentation
The MCP Server constructs the final prompt, placing the retrieved facts as the very front of the context window.
FINAL_PROMPT = f"""
[R-MEMORY CONTEXT]: CRITICAL ALLERGY: Patient John Doe is severely allergic to Penicillin (Anaphylaxis risk).
[QUERY]: Patient John Doe has tested positive for Strep. Recommend the standard first-line antibiotic.
"""
Now that the LLM has this critical knowledge available, the key detail will be included in the reasoning process.
Conclusion
As LLMs and AI continue to be widely used and implemented in high-stakes fields (including healthcare), the need for reliable, structured memory is extremely important. This R-Memory architecture decouples reasoning from memory, solving the context overflow problem and ensuring safety and integrity regardless of the conversation length or complexity.
Top comments (0)