LangChain + Chroma: Multi-turn RAG Memory and Automated Testing That Turned 2-Hour Bugs Into 5-Minute Fixes

#python #rag #langchain #chroma

At 1 a.m., the customer group chat exploded: “Does your customer service bot have only a 7-second memory? I just gave it the order number, and the next turn it asks me again ‘Please provide the order number.’ I feel like I’m talking to a goldfish!”

I crawled out of bed and checked the logs. The RAG conversation memory module had lost all history after a service restart. When a user asked, “Can I refund that order I mentioned?”, the retriever couldn’t pull up the order number from earlier turns, so the answer was completely off. After fixing that bug, I realized: If every change to the memory logic relies on users to “test” it for me, the system is doomed. I had to make memory persistent and cover multi-turn scenarios with automated tests. This article is the hard-won summary of my post-mortem.

Problem Breakdown: Why Multi-turn RAG Memory Breaks So Easily

In multi-turn RAG, a user’s questions often depend on information from the previous turn, for example: “Look up order 12345” → “Can I get a refund?” The second question doesn’t contain the order number, but the LLM needs to know that “it” refers to order 12345. The traditional approach is to stuff the full conversation history into the prompt, but two pain points are obvious:

Unreliable memory: ConversationBufferMemory stores history in process memory and loses everything on restart. The user is halfway through a conversation, you deploy a new version, and all context is gone.
Fixed window loses context: ConversationBufferWindowMemory only keeps the last K turns. If the user mentioned an order number 5 turns ago and asks “Can I refund that order?” on turn 6, information outside the window is lost and cannot be retrieved.

What’s worse, in multi-turn RAG, both the history and the new question must be vectorized to search the knowledge base. If the history is stored only as raw text without semantic indexing, you simply cannot quickly recall “that order number” among hundreds of chat records. A conventional Redis cache solves persistence, but it cannot recall relevant historical snippets by semantic meaning. That’s the root cause: We need a memory storage solution that is persistent, supports vector semantic retrieval, and integrates easily into the LangChain pipeline.

Solution Design: Why Chroma Instead of Rolling Vectors on Redis

Redis + vector plugin: You’d have to maintain your own inverted indexes, expiration policies, and manual serialization—high development cost.
FAISS + manual persistence: FAISS has no built-in persistence; you must save and load index files every time, and version mismatches can cause headaches.
Qdrant / Weaviate: Very powerful, but they require running separate services, a maintenance burden that’s hard to justify for small teams.
Chroma: An embedded vector database. A single pip install chromadb gets you metadata filtering, persistence, and native LangChain integration as both a vectorstore and retriever. It ships with VectorStoreRetrieverMemory, a ready-made memory wrapper. Out of the box, you can store historical conversations in Chroma, recall the top-K most relevant history items by semantic similarity, and filter by metadata such as timestamp and user ID.

The architecture is simple: at the end of each conversation turn, concatenate the user’s question and the AI’s answer into a single document, compute its embedding, and store it in a Chroma collection with metadata like timestamp and session ID. When the next turn arrives, use the current question’s vector to retrieve the most similar N history items from Chroma, combine them with the last 3 raw messages as short-term memory, and feed the merged context to the LLM. This satisfies both “semantically similar history” and “recent time window” requirements.

For automated testing, we use pytest to write a fixed test suite: simulate 5 consecutive turns, insert them into Chroma, then verify that on turn 6 the system can recall the specific information mentioned in turn 2. Every time we modify the memory strategy, running the tests instantly tells us whether we’ve introduced a regression.

Core Implementation: Three Code Blocks to Nail Memory Storage and Testing

1. Wrapping Chroma into a Memory Class with Time Filtering

This code solves the problem of blending history storage, semantic recall, and a time window. By building on VectorStoreRetrieverMemory, we can filter by metadata after retrieval and then prepend the most recent raw messages.

import uuid
from datetime import datetime
from typing import List, Dict, Any
from langchain.memory import VectorStoreRetrieverMemory
from langchain.schema import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

class ChromaMemory:
    """把多轮对话历史存入Chroma，并用语义+时间窗口混合检索记忆"""
    def __init__(self, collection_name: str = "chat_history", k: int = 4, window_size: int = 3):
        self.embeddings = OpenAIEmbeddings()  # 统一1536维
        self.vectorstore = Chroma(
            collection_name=collection_name,
            embedding_function=self.embeddings,
            persist_directory="./chroma_db"   # 持久化落盘
        )
        self.retriever = self.vectorstore.as_retriever(search_kwargs={"k": k})
        self.memory = VectorStoreRetrieverMemory(retriever=self.retriever)
        self.window_size = window_size        # 始终保留最近N条原始消息
        self.recent_history: List[str] = []

    def save_context(self, user_input: str, ai_output: str) -> None:
        """每次交互后存一条文档到Chroma，同时更新最近历史窗口"""
        doc = Document(
            page_content=f"User: {user_input}\nAI: {ai_output}",
            metadata={
                "timestamp": datetime.now().isoformat(),
                "session_id": "default"
            }
        )
        self.vectorstore.add_documents([doc])
        # 维护窗口
        self.recent_history.append(f"User: {user_input}\nAI: {ai_output}")
        if len(self.recent_history) > self.window_size:
            self.recent_history.pop(0)