DEV Community

Ansh Deshwal
Ansh Deshwal

Posted on

LLM Memory Layers — From Zero to Production

1. The core problem: why memory layers exist

The fundamental limitation

LLMs do not remember past conversations. Each request looks like this:

(prompt tokens) → LLM → response
Enter fullscreen mode Exit fullscreen mode

Once the response is sent:

  • The model forgets everything.
  • There is no built-in long-term state.

Why chat history is not enough

Naively appending past messages:

  • Explodes token usage
  • Adds noise (irrelevant history)
  • Does not create understanding, only repetition

Example:

User: I prefer C++ for DSA.
(200 messages later)
User: Solve this problem.
Enter fullscreen mode Exit fullscreen mode

Without memory: The preference is gone unless re-sent. The system behaves statelessly.

We need durable, selective memory.

2. What are memory layers?

Memory layers are external systems that store and retrieve useful information across interactions. They live outside the LLM.

High-level definition

A memory layer is a pipeline that:

  1. Extracts candidate memory from interactions
  2. Validates and deduplicates it
  3. Stores it in retrievable form
  4. Injects it back into future prompts

Types of external memory

Vector memory

  • Semantic recall
  • "Find things similar to this"

Structured memory

  • Clean facts
  • Preferences, decisions, profile data

Most real systems use both.

Examples of what counts as memory

Memory-worthy:

  • "User prefers C++ for DSA"
  • "User is building a SaaS with Next.js"
  • "Session store chosen: Redis"

Not memory-worthy:

  • "Binary search is O(log n)"
  • "Here is how HTTP works"

3. High-level request–response cycle

Before diving deeper, here is the big picture.

4. Beginner view: how user queries are stored (vector DB basics)

At the simplest level:

  1. Take user query
  2. Convert it into an embedding
  3. Store it in a vector DB

Example:

embedding = embed("I am building a SaaS using Next.js")
vector_db.add(embedding)
Enter fullscreen mode Exit fullscreen mode

Later:

similar = vector_db.search(embed("Next.js project"))
Enter fullscreen mode Exit fullscreen mode

This enables semantic recall.

But this is too naive.

Problems:

  • Stores junk
  • Stores duplicates
  • Stores temporary questions

We need selection logic.

5. Deciding whether a user query should be stored

This is the first critical decision point.

There are exactly three approaches

5.1 Using an LLM (semantic judgment)

Idea: Ask an LLM: "Is this user query a stable, long-term fact?"

Example prompt:

Does this user message contain:
- a preference
- a long-term project detail
- a stable goal

Return only: STORE or IGNORE.
Enter fullscreen mode Exit fullscreen mode

Example:

User: I prefer C++ for DSA. → STORE
User: Explain quicksort. → IGNORE
Enter fullscreen mode Exit fullscreen mode

Pros / cons:

  • Very accurate
  • Extra cost and latency

5.2 Using a classifier model

Idea: Train a lightweight model on labeled examples.

"I prefer TypeScript" → STORE
"What is TCP?" → IGNORE
Enter fullscreen mode Exit fullscreen mode

Runtime logic:

if classifier.predict(query) == "STORE":
    store(query)
Enter fullscreen mode Exit fullscreen mode

Pros / cons:

  • Fast, cheap
  • Less nuanced

5.3 Using rules (heuristics)

Idea: Use deterministic checks.

def should_store(query):
    return (
        "I prefer" in query or
        "I am building" in query
    )
Enter fullscreen mode Exit fullscreen mode

Often combined with embeddings:

if rule_pass and similarity > 0.8:
    store()
Enter fullscreen mode Exit fullscreen mode

Pros / cons:

  • Zero inference cost
  • Brittle

6. Assume LLM-based selection is used

From now on, we assume:

The system uses an LLM to decide whether a user query becomes memory.

Once accepted, the query enters the memory pipeline shown in images 2–7.

7. Memory pipeline

This is the core of the article.

Step 1: Memory extraction (LLM call)

The LLM is asked to extract facts, not to chat.

def extract_memory(llm, conversation):
    prompt = f"""
    Extract facts that are:
    - stable over time
    - about the user
    - useful later

    Return JSON.
    """
    return llm(prompt)
Enter fullscreen mode Exit fullscreen mode

Example output:

[
    {
        "type": "preference",
        "content": "User prefers C++ for DSA",
        "confidence": 0.92
    }
]
Enter fullscreen mode Exit fullscreen mode

Flowchart prompt:

Flowchart:
Conversation Turn → Memory Extractor LLM → Candidate Memory Objects
Enter fullscreen mode Exit fullscreen mode

Step 2: Validation gate

Filters low-quality memory.

def validate(mem):
    if mem["confidence"] < 0.7:
        return False
    if len(mem["content"]) < 10:
        return False
    return True
Enter fullscreen mode Exit fullscreen mode

Step 3: Deduplication (semantic)

Avoid storing the same thing repeatedly.

def is_duplicate(new_text, existing, embed):
    new_vec = embed(new_text)
    for m in existing:
        if cosine(new_vec, m["embedding"]) > 0.9:
            return m
    return None
Enter fullscreen mode Exit fullscreen mode

Step 4: Conflict resolution (update vs insert)

Example:

Old: "User prefers Java"
New: "User prefers C++"
Enter fullscreen mode Exit fullscreen mode
def resolve(new, old):
    if not old:
        return "insert", new
    if new["type"] == old["type"]:
        old["content"] = new["content"]
        return "update", old
    return "ignore", None
Enter fullscreen mode Exit fullscreen mode

Step 5: Embedding and storage

vec = embed(mem["content"])
vector_db.add(vec)
structured_db.insert(mem)
Enter fullscreen mode Exit fullscreen mode

Now memory is:

  • Durable
  • Queryable
  • Conflict-aware

8. Main LLM call (brief)

Once memory is stored, future requests do:

  1. Retrieve relevant memory (vector search)
  2. Inject into prompt
  3. Call main LLM
  4. Generate response

This is a normal RAG-style call, not special.

9. How LLM responses are stored

Now we repeat the same logic — but stricter.

Key rule: LLM responses are stored only if they finalize meaning.

Examples:

"You are using Prisma with Next.js" → STORE
"Binary search is O(log n)" → IGNORE
Enter fullscreen mode Exit fullscreen mode

10. Deciding whether an LLM response is stored

Same three approaches again.

10.1 LLM-based judge

Prompt: Does this response define a stable user fact?

10.2 Classifier

Trained on:

"You prefer X" → STORE
Explanations → IGNORE
Enter fullscreen mode Exit fullscreen mode

10.3 Rules + embeddings

Rules:

  • Must reference "you / your"
  • Must be declarative

Embeddings:

  • Must be close to user context snapshot

Top comments (0)