DEV Community

Cover image for How to Build a Simple Persistent Memory Layer for LLM Apps (With Code)
Cloyou
Cloyou

Posted on

How to Build a Simple Persistent Memory Layer for LLM Apps (With Code)

Most LLM-powered apps feel impressive for five minutes.

Then they forget everything.

You ask a chatbot something. It responds intelligently. You close the tab, come back later, and it behaves like you’ve never met.

That’s not a model problem. That’s an architecture problem.

In this article, we’ll build a simple persistent memory layer for an LLM app using:

  • Python
  • OpenAI embeddings
  • A lightweight vector store (FAISS)
  • Basic retrieval logic

By the end, you’ll understand how to move from “stateless prompt wrapper” to a structured LLM system.


Why Stateless LLM Apps Break in Production

Most basic LLM apps work like this:

  1. User sends input
  2. Input is sent to model
  3. Model responds
  4. Conversation disappears

Even if you store chat history, once you exceed the context window, you’re forced to truncate earlier messages.

Problems this creates:

  • No long-term personalization
  • No user memory
  • Repeated explanations
  • Poor multi-session experience

If you're building anything beyond a demo, you need persistent memory.


What Is a Persistent Memory Layer?

A persistent memory layer:

  • Stores meaningful interactions
  • Converts them into embeddings
  • Saves them in a vector database
  • Retrieves relevant memories for future conversations

Instead of stuffing everything into context, you retrieve only what matters.

Architecture overview:

User Input
    ↓
Embed Input
    ↓
Store in Vector DB
    ↓
Retrieve Relevant Past Memories
    ↓
Build Context
    ↓
Send to LLM
Enter fullscreen mode Exit fullscreen mode

Simple. Powerful.


Step 1: Install Dependencies

We’ll use:

  • openai
  • faiss
  • numpy

Install:

pip install openai faiss-cpu numpy
Enter fullscreen mode Exit fullscreen mode

Step 2: Create a Memory Store

Let’s build a minimal memory system.

import faiss
import numpy as np
from openai import OpenAI

client = OpenAI()

dimension = 1536  # OpenAI embedding size
index = faiss.IndexFlatL2(dimension)

memory_texts = []
Enter fullscreen mode Exit fullscreen mode

This creates a simple in-memory FAISS vector store.


Step 3: Store Memories

Every time the user sends something meaningful, embed and store it.

def add_memory(text):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )

    embedding = np.array(response.data[0].embedding).astype('float32')
    index.add(np.array([embedding]))
    memory_texts.append(text)
Enter fullscreen mode Exit fullscreen mode

Now we can persist interactions semantically.

Example:

add_memory("User prefers short technical explanations.")
add_memory("User is building a SaaS AI tool.")
Enter fullscreen mode Exit fullscreen mode

Step 4: Retrieve Relevant Memories

When the user sends a new query, embed it and search the vector index.

def retrieve_memories(query, k=3):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    )

    query_embedding = np.array(response.data[0].embedding).astype('float32')

    distances, indices = index.search(np.array([query_embedding]), k)

    return [memory_texts[i] for i in indices[0] if i < len(memory_texts)]
Enter fullscreen mode Exit fullscreen mode

Now we can pull relevant historical context.


Step 5: Build the Context for the LLM

We combine:

  • Retrieved memory
  • Current user input
def build_prompt(user_input):
    relevant_memories = retrieve_memories(user_input)

    memory_section = "\n".join(relevant_memories)

    return f"""
You are an AI assistant.

Relevant past information:
{memory_section}

Current user message:
{user_input}

Respond accordingly.
"""
Enter fullscreen mode Exit fullscreen mode

This ensures the model receives structured context, not raw history.


Step 6: Generate Response

def generate_response(user_input):
    prompt = build_prompt(user_input)

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": prompt}
        ]
    )

    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

Now your app has semantic long-term memory.


Why This Works

Instead of:

“Dump entire conversation into context”

We’re doing:

“Retrieve only relevant past knowledge”

This improves:

  • Scalability
  • Relevance
  • Token efficiency
  • Personalization

And most importantly, it shifts your app from demo-tier to architecture-tier.


Common Pitfalls

1. Storing Everything

Don’t embed trivial small talk. Store meaningful information only.

2. Memory Drift

Over time, irrelevant memories may surface. Consider tagging or pruning.

3. Cost Explosion

Embedding every interaction can become expensive. Add filtering logic.

4. Latency

Vector search is fast, but remote DB calls add delay. Optimize if needed.


Taking This Further

You can improve this system by:

  • Adding user IDs for multi-user support
  • Using persistent storage (e.g., Pinecone, Weaviate, Redis)
  • Creating memory types (preferences, facts, decisions)
  • Adding time-decay weighting

This article shows the core pattern. From here, you can productionize.


Final Thoughts

Prompt engineering is not enough for serious AI products.

If your system forgets everything, it’s not intelligent — it’s reactive.

Adding a memory layer is one of the simplest architectural upgrades you can make to move beyond basic wrappers.

And the good news?

It’s not complicated.

It’s just structured design.

Top comments (0)