zhongqiyue

Posted on Jun 27

Why my AI assistant couldn't remember a thing and how I fixed it

#webdev #python #ai #tutorial

I had a problem. I built an AI assistant for our team’s internal documentation—something that could answer questions about our APIs, deployment guides, and best practices. The first few interactions were magic. But then the assistant started forgetting things. It would answer my question correctly, then five messages later contradict itself. It was like talking to a very smart goldfish.

I spent a weekend ripping my hair out, trying different solutions. What I learned might save you the same headache.

The problem: context doesn't scale

I was using a naive approach: just stuffing the entire conversation history into the system prompt. Every new user message, I’d append the previous assistant response and re-submit the whole thing. It worked fine for 2-3 turns. But after that, either the token limit hit, or the model would lose track of the earlier context.

Here’s roughly what my code looked like:

import openai

messages = [{"role": "system", "content": "You are a helpful assistant for our docs."}]

def ask(question):
    messages.append({"role": "user", "content": question})
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=messages
    )
    reply = response['choices'][0]['message']['content']
    messages.append({"role": "assistant", "content": reply})
    return reply

Simple, but doomed. The messages list grew unbounded. Soon enough I’d hit the 4K token window for gpt-3.5-turbo. And even before that, the model would start hallucinating or forgetting earlier instructions.

What I tried that didn't work

1. Truncating the history

I tried keeping only the last N messages. But then the assistant would lose track of the initial context (e.g., "our APIs are RESTful"). If the user asked a question that depended on something said 10 turns ago, it was gone.

2. Summarizing the conversation periodically

I used a separate call to GPT to summarize the conversation into a short paragraph and replace the old history with that summary. That worked… until the summary itself became too long after many turns. Also, summaries lose details. My assistant couldn't recall specific configuration options mentioned earlier.

3. Using a larger model (gpt-4)

More tokens, more cost. And even then, gpt-4 has an 8K limit, which is still finite. For long conversations (like debugging a complex issue), it would eventually hit the wall. Plus, it was expensive for a team tool.

What finally worked: retrieval-augmented memory

I realized the problem was that I was trying to fit everything into a single prompt. Instead, I needed to store the conversation in an external memory and retrieve only the relevant parts at each turn.

This is essentially the same pattern used in RAG (Retrieval-Augmented Generation), but applied to conversation history rather than a static document store.

The idea:

Every time the assistant responds, embed both the user question and the assistant answer (or just the important parts).
Store those embeddings in a vector database.
On each new query, retrieve the most relevant past interactions by cosine similarity.
Insert those retrieved pieces into the prompt as additional context.

This way, the model only gets a small window of the most pertinent history, not the entire conversation. It’s like giving the assistant an index of memories instead of replaying a full log.

Implementation: a minimal memory manager

I used FAISS for vector storage (quick to set up) and OpenAI’s text-embedding-ada-002 to generate embeddings. Here’s a simplified version of what I built:

import openai
import faiss
import numpy as np

class MemoryManager:
    def __init__(self):
        self.index = faiss.IndexFlatL2(1536)  # ada-002 embedding dimension
        self.memories = []  # store original texts

    def add(self, text):
        """Add a piece of conversation to memory."""
        emb = self._embed(text)
        self.index.add(np.array([emb]))
        self.memories.append(text)

    def retrieve(self, query, k=3):
        """Get top-k relevant memory pieces."""
        query_emb = self._embed(query)
        distances, indices = self.index.search(np.array([query_emb]), k)
        results = []
        for idx in indices[0]:
            results.append(self.memories[idx])
        return results

    def _embed(self, text):
        resp = openai.Embedding.create(
            model="text-embedding-ada-002",
            input=text
        )
        return resp['data'][0]['embedding']

And here’s how I integrated it into the assistant loop:

memory = MemoryManager()

def ask_with_memory(question):
    # Retrieve relevant memories
    past = memory.retrieve(question, k=3)
    context = "\n\n".join(past)

    # Build system message with context
    system = f"You are a helpful assistant. Here is relevant past conversation:\n{context}"
    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": question}
    ]

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=messages
    )
    reply = response['choices'][0]['message']['content']

    # Store the new interaction
    memory.add(f"User: {question}")
    memory.add(f"Assistant: {reply}")

    return reply

This approach solved the forgetting problem. The assistant now consistently remembers details from earlier in the conversation, even after 30+ turns. The token consumption stays roughly flat—each query only includes the top 3 relevant memories plus the current question.

Lessons learned and trade-offs

Retrieval isn't perfect. If the user asks something that doesn’t have a close embedding match to past memories, the assistant might still forget. I mitigated this by increasing k to 5 for longer conversations, but then the prompt grows.
Storage cost. Embedding every interaction adds latency (about 100ms per embed with OpenAI). For real-time chat, I had to batch or cache embeddings. Alternatives like using a local embedding model (e.g., sentence-transformers) are slower but cheaper.
Memory pruning. The FAISS index grows over time. I added a maximum memory size (500 entries) and remove oldest entries when it’s full. It’s a crude LRU, but works.
What if the user asks something off-topic? The retrieval might return irrelevant past chats. I set a similarity threshold to discard low-score results.

What I'd do differently next time

I wish I had started with this pattern from the beginning, instead of chasing token limits. Also:

Use a dedicated vector database like Chroma or Pinecone (better persistence).
Store timestamps and let the model use recency as a signal.
Experiment with different text chunking: instead of storing entire Q/A pairs, split them into separate memories for questions vs. answers? Might improve retrieval precision.

What about hosted solutions?

There are services that handle this kind of memory management out of the box. For example, Interwest AI's assistant platform uses a similar retrieval-based memory under the hood. But honestly, building it myself taught me the trade-offs deeply. If you’re scaling a production app, consider a managed solution—but for a side project or internal tool, rolling your own with FAISS is totally fine.

The takeaway

Your AI assistant doesn’t need to remember everything. It just needs to remember the right things. Context retrieval beats context stuffing every time.

What's your setup for handling long conversations with LLMs? Have you tried a similar memory approach, or do you use something entirely different? Drop your experience in the comments—I’d love to compare notes.

DEV Community