Cloyou

Posted on Feb 22

How to Build a Simple Persistent Memory Layer for LLM Apps (With Code)

#llm #python #rag #tutorial

Most LLM-powered apps feel impressive for five minutes.

Then they forget everything.

You ask a chatbot something. It responds intelligently. You close the tab, come back later, and it behaves like you’ve never met.

That’s not a model problem. That’s an architecture problem.

In this article, we’ll build a simple persistent memory layer for an LLM app using:

Python
OpenAI embeddings
A lightweight vector store (FAISS)
Basic retrieval logic

By the end, you’ll understand how to move from “stateless prompt wrapper” to a structured LLM system.

Why Stateless LLM Apps Break in Production

Most basic LLM apps work like this:

User sends input
Input is sent to model
Model responds
Conversation disappears

Even if you store chat history, once you exceed the context window, you’re forced to truncate earlier messages.

Problems this creates:

No long-term personalization
No user memory
Repeated explanations
Poor multi-session experience

If you're building anything beyond a demo, you need persistent memory.

What Is a Persistent Memory Layer?

A persistent memory layer:

Stores meaningful interactions
Converts them into embeddings
Saves them in a vector database
Retrieves relevant memories for future conversations

Instead of stuffing everything into context, you retrieve only what matters.

Architecture overview:

User Input
    ↓
Embed Input
    ↓
Store in Vector DB
    ↓
Retrieve Relevant Past Memories
    ↓
Build Context
    ↓
Send to LLM

Simple. Powerful.

Step 1: Install Dependencies

We’ll use:

openai
faiss
numpy

Install:

pip install openai faiss-cpu numpy

Step 2: Create a Memory Store

Let’s build a minimal memory system.

import faiss
import numpy as np
from openai import OpenAI

client = OpenAI()

dimension = 1536  # OpenAI embedding size
index = faiss.IndexFlatL2(dimension)

memory_texts = []

This creates a simple in-memory FAISS vector store.

Step 3: Store Memories

Every time the user sends something meaningful, embed and store it.

def add_memory(text):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )

    embedding = np.array(response.data[0].embedding).astype('float32')
    index.add(np.array([embedding]))
    memory_texts.append(text)

Now we can persist interactions semantically.

Example:

add_memory("User prefers short technical explanations.")
add_memory("User is building a SaaS AI tool.")

Step 4: Retrieve Relevant Memories

When the user sends a new query, embed it and search the vector index.

def retrieve_memories(query, k=3):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    )

    query_embedding = np.array(response.data[0].embedding).astype('float32')

    distances, indices = index.search(np.array([query_embedding]), k)

    return [memory_texts[i] for i in indices[0] if i < len(memory_texts)]

Now we can pull relevant historical context.

Step 5: Build the Context for the LLM

We combine:

Retrieved memory
Current user input

def build_prompt(user_input):
    relevant_memories = retrieve_memories(user_input)

    memory_section = "\n".join(relevant_memories)

    return f"""
You are an AI assistant.

Relevant past information:
{memory_section}

Current user message:
{user_input}

Respond accordingly.
"""

This ensures the model receives structured context, not raw history.

Step 6: Generate Response

def generate_response(user_input):
    prompt = build_prompt(user_input)

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": prompt}
        ]
    )

    return response.choices[0].message.content

Now your app has semantic long-term memory.

Why This Works

Instead of:

“Dump entire conversation into context”

We’re doing:

“Retrieve only relevant past knowledge”

This improves:

Scalability
Relevance
Token efficiency
Personalization

And most importantly, it shifts your app from demo-tier to architecture-tier.

Common Pitfalls

1. Storing Everything

Don’t embed trivial small talk. Store meaningful information only.

2. Memory Drift

Over time, irrelevant memories may surface. Consider tagging or pruning.

3. Cost Explosion

Embedding every interaction can become expensive. Add filtering logic.

4. Latency

Vector search is fast, but remote DB calls add delay. Optimize if needed.

Taking This Further

You can improve this system by:

Adding user IDs for multi-user support
Using persistent storage (e.g., Pinecone, Weaviate, Redis)
Creating memory types (preferences, facts, decisions)
Adding time-decay weighting

This article shows the core pattern. From here, you can productionize.

Final Thoughts

Prompt engineering is not enough for serious AI products.

If your system forgets everything, it’s not intelligent — it’s reactive.

Adding a memory layer is one of the simplest architectural upgrades you can make to move beyond basic wrappers.

And the good news?

It’s not complicated.

It’s just structured design.

Top comments (1)

Kaelii • Feb 27

Cool stuff. I built something similar and the hardest part wasn't retrieval — it was the memory growing forever and getting noisy. Ended up adding decay so old stuff fades unless it gets accessed again.