1. The core problem: why memory layers exist
The fundamental limitation
LLMs do not remember past conversations. Each request looks like this:
(prompt tokens) → LLM → response
Once the response is sent:
- The model forgets everything.
- There is no built-in long-term state.
Why chat history is not enough
Naively appending past messages:
- Explodes token usage
- Adds noise (irrelevant history)
- Does not create understanding, only repetition
Example:
User: I prefer C++ for DSA.
(200 messages later)
User: Solve this problem.
Without memory: The preference is gone unless re-sent. The system behaves statelessly.
We need durable, selective memory.
2. What are memory layers?
Memory layers are external systems that store and retrieve useful information across interactions. They live outside the LLM.
High-level definition
A memory layer is a pipeline that:
- Extracts candidate memory from interactions
- Validates and deduplicates it
- Stores it in retrievable form
- Injects it back into future prompts
Types of external memory
Vector memory
- Semantic recall
- "Find things similar to this"
Structured memory
- Clean facts
- Preferences, decisions, profile data
Most real systems use both.
Examples of what counts as memory
Memory-worthy:
- "User prefers C++ for DSA"
- "User is building a SaaS with Next.js"
- "Session store chosen: Redis"
Not memory-worthy:
- "Binary search is O(log n)"
- "Here is how HTTP works"
3. High-level request–response cycle
Before diving deeper, here is the big picture.
4. Beginner view: how user queries are stored (vector DB basics)
At the simplest level:
- Take user query
- Convert it into an embedding
- Store it in a vector DB
Example:
embedding = embed("I am building a SaaS using Next.js")
vector_db.add(embedding)
Later:
similar = vector_db.search(embed("Next.js project"))
This enables semantic recall.
But this is too naive.
Problems:
- Stores junk
- Stores duplicates
- Stores temporary questions
We need selection logic.
5. Deciding whether a user query should be stored
This is the first critical decision point.
There are exactly three approaches
5.1 Using an LLM (semantic judgment)
Idea: Ask an LLM: "Is this user query a stable, long-term fact?"
Example prompt:
Does this user message contain:
- a preference
- a long-term project detail
- a stable goal
Return only: STORE or IGNORE.
Example:
User: I prefer C++ for DSA. → STORE
User: Explain quicksort. → IGNORE
Pros / cons:
- Very accurate
- Extra cost and latency
5.2 Using a classifier model
Idea: Train a lightweight model on labeled examples.
"I prefer TypeScript" → STORE
"What is TCP?" → IGNORE
Runtime logic:
if classifier.predict(query) == "STORE":
store(query)
Pros / cons:
- Fast, cheap
- Less nuanced
5.3 Using rules (heuristics)
Idea: Use deterministic checks.
def should_store(query):
return (
"I prefer" in query or
"I am building" in query
)
Often combined with embeddings:
if rule_pass and similarity > 0.8:
store()
Pros / cons:
- Zero inference cost
- Brittle
6. Assume LLM-based selection is used
From now on, we assume:
The system uses an LLM to decide whether a user query becomes memory.
Once accepted, the query enters the memory pipeline shown in images 2–7.
7. Memory pipeline
This is the core of the article.
Step 1: Memory extraction (LLM call)
The LLM is asked to extract facts, not to chat.
def extract_memory(llm, conversation):
prompt = f"""
Extract facts that are:
- stable over time
- about the user
- useful later
Return JSON.
"""
return llm(prompt)
Example output:
[
{
"type": "preference",
"content": "User prefers C++ for DSA",
"confidence": 0.92
}
]
Flowchart prompt:
Flowchart:
Conversation Turn → Memory Extractor LLM → Candidate Memory Objects
Step 2: Validation gate
Filters low-quality memory.
def validate(mem):
if mem["confidence"] < 0.7:
return False
if len(mem["content"]) < 10:
return False
return True
Step 3: Deduplication (semantic)
Avoid storing the same thing repeatedly.
def is_duplicate(new_text, existing, embed):
new_vec = embed(new_text)
for m in existing:
if cosine(new_vec, m["embedding"]) > 0.9:
return m
return None
Step 4: Conflict resolution (update vs insert)
Example:
Old: "User prefers Java"
New: "User prefers C++"
def resolve(new, old):
if not old:
return "insert", new
if new["type"] == old["type"]:
old["content"] = new["content"]
return "update", old
return "ignore", None
Step 5: Embedding and storage
vec = embed(mem["content"])
vector_db.add(vec)
structured_db.insert(mem)
Now memory is:
- Durable
- Queryable
- Conflict-aware
8. Main LLM call (brief)
Once memory is stored, future requests do:
- Retrieve relevant memory (vector search)
- Inject into prompt
- Call main LLM
- Generate response
This is a normal RAG-style call, not special.
9. How LLM responses are stored
Now we repeat the same logic — but stricter.
Key rule: LLM responses are stored only if they finalize meaning.
Examples:
"You are using Prisma with Next.js" → STORE
"Binary search is O(log n)" → IGNORE
10. Deciding whether an LLM response is stored
Same three approaches again.
10.1 LLM-based judge
Prompt: Does this response define a stable user fact?
10.2 Classifier
Trained on:
"You prefer X" → STORE
Explanations → IGNORE
10.3 Rules + embeddings
Rules:
- Must reference "you / your"
- Must be declarative
Embeddings:
- Must be close to user context snapshot



Top comments (0)