Ansh Deshwal

Posted on Jan 29

LLM Memory Layers — From Zero to Production

#programming #ai #agents #machinelearning

1. The core problem: why memory layers exist

The fundamental limitation

LLMs do not remember past conversations. Each request looks like this:

(prompt tokens) → LLM → response

Once the response is sent:

The model forgets everything.
There is no built-in long-term state.

Why chat history is not enough

Naively appending past messages:

Explodes token usage
Adds noise (irrelevant history)
Does not create understanding, only repetition

Example:

User: I prefer C++ for DSA.
(200 messages later)
User: Solve this problem.

Without memory: The preference is gone unless re-sent. The system behaves statelessly.

We need durable, selective memory.

2. What are memory layers?

Memory layers are external systems that store and retrieve useful information across interactions. They live outside the LLM.

High-level definition

A memory layer is a pipeline that:

Extracts candidate memory from interactions
Validates and deduplicates it
Stores it in retrievable form
Injects it back into future prompts

Types of external memory

Vector memory

Semantic recall
"Find things similar to this"

Structured memory

Clean facts
Preferences, decisions, profile data

Most real systems use both.

Examples of what counts as memory

Memory-worthy:

"User prefers C++ for DSA"
"User is building a SaaS with Next.js"
"Session store chosen: Redis"

Not memory-worthy:

"Binary search is O(log n)"
"Here is how HTTP works"

3. High-level request–response cycle

Before diving deeper, here is the big picture.

4. Beginner view: how user queries are stored (vector DB basics)

At the simplest level:

Take user query
Convert it into an embedding
Store it in a vector DB

Example:

embedding = embed("I am building a SaaS using Next.js")
vector_db.add(embedding)

Later:

similar = vector_db.search(embed("Next.js project"))

This enables semantic recall.

But this is too naive.

Problems:

Stores junk
Stores duplicates
Stores temporary questions

We need selection logic.

5. Deciding whether a user query should be stored

This is the first critical decision point.

There are exactly three approaches

5.1 Using an LLM (semantic judgment)

Idea: Ask an LLM: "Is this user query a stable, long-term fact?"

Example prompt:

Does this user message contain:
- a preference
- a long-term project detail
- a stable goal

Return only: STORE or IGNORE.

Example:

User: I prefer C++ for DSA. → STORE
User: Explain quicksort. → IGNORE

Pros / cons:

Very accurate
Extra cost and latency

5.2 Using a classifier model

Idea: Train a lightweight model on labeled examples.

"I prefer TypeScript" → STORE
"What is TCP?" → IGNORE

Runtime logic:

if classifier.predict(query) == "STORE":
    store(query)

Pros / cons:

Fast, cheap
Less nuanced

5.3 Using rules (heuristics)

Idea: Use deterministic checks.

def should_store(query):
    return (
        "I prefer" in query or
        "I am building" in query
    )

Often combined with embeddings:

if rule_pass and similarity > 0.8:
    store()

Pros / cons:

Zero inference cost
Brittle

6. Assume LLM-based selection is used

From now on, we assume:

The system uses an LLM to decide whether a user query becomes memory.

Once accepted, the query enters the memory pipeline shown in images 2–7.

7. Memory pipeline

This is the core of the article.

Step 1: Memory extraction (LLM call)

The LLM is asked to extract facts, not to chat.

def extract_memory(llm, conversation):
    prompt = f"""
    Extract facts that are:
    - stable over time
    - about the user
    - useful later

    Return JSON.
    """
    return llm(prompt)

Example output:

[
    {
        "type": "preference",
        "content": "User prefers C++ for DSA",
        "confidence": 0.92
    }
]

Flowchart prompt:

Flowchart:
Conversation Turn → Memory Extractor LLM → Candidate Memory Objects

Step 2: Validation gate

Filters low-quality memory.

def validate(mem):
    if mem["confidence"] < 0.7:
        return False
    if len(mem["content"]) < 10:
        return False
    return True

Step 3: Deduplication (semantic)

Avoid storing the same thing repeatedly.

def is_duplicate(new_text, existing, embed):
    new_vec = embed(new_text)
    for m in existing:
        if cosine(new_vec, m["embedding"]) > 0.9:
            return m
    return None

Step 4: Conflict resolution (update vs insert)

Example:

Old: "User prefers Java"
New: "User prefers C++"

def resolve(new, old):
    if not old:
        return "insert", new
    if new["type"] == old["type"]:
        old["content"] = new["content"]
        return "update", old
    return "ignore", None

Step 5: Embedding and storage

vec = embed(mem["content"])
vector_db.add(vec)
structured_db.insert(mem)

Now memory is:

Durable
Queryable
Conflict-aware

8. Main LLM call (brief)

Once memory is stored, future requests do:

Retrieve relevant memory (vector search)
Inject into prompt
Call main LLM
Generate response

This is a normal RAG-style call, not special.

9. How LLM responses are stored

Now we repeat the same logic — but stricter.

Key rule: LLM responses are stored only if they finalize meaning.

Examples:

"You are using Prisma with Next.js" → STORE
"Binary search is O(log n)" → IGNORE

10. Deciding whether an LLM response is stored

Same three approaches again.

10.1 LLM-based judge

Prompt: Does this response define a stable user fact?

10.2 Classifier

Trained on:

"You prefer X" → STORE
Explanations → IGNORE

10.3 Rules + embeddings

Rules:

Must reference "you / your"
Must be declarative

Embeddings:

Must be close to user context snapshot

DEV Community

LLM Memory Layers — From Zero to Production

1. The core problem: why memory layers exist

The fundamental limitation

Why chat history is not enough

2. What are memory layers?

High-level definition

Types of external memory

Examples of what counts as memory

3. High-level request–response cycle

4. Beginner view: how user queries are stored (vector DB basics)

5. Deciding whether a user query should be stored

There are exactly three approaches

5.1 Using an LLM (semantic judgment)

5.2 Using a classifier model

5.3 Using rules (heuristics)

6. Assume LLM-based selection is used

7. Memory pipeline

Step 1: Memory extraction (LLM call)

Step 2: Validation gate

Step 3: Deduplication (semantic)

Step 4: Conflict resolution (update vs insert)

Step 5: Embedding and storage

8. Main LLM call (brief)

9. How LLM responses are stored

10. Deciding whether an LLM response is stored

10.1 LLM-based judge

10.2 Classifier

10.3 Rules + embeddings

Top comments (0)