12 Hours Debugging Memory Drift in a LangChain+Chroma Code QA Bot

#python #programming

At 2 a.m., I stared at the nonsensical reply in my terminal, a chill down my spine—just a moment ago I told it to “only look at the auth module”, and now it was recommending a frontend button color scheme. Building a local codebase Q&A bot with accurate long-conversation memory is far trickier than it seems.

Problem Deconstruction

Our scenario was clear: a Python monorepo with hundreds of thousands of lines, business logic scattered across hundreds of files. New hires and senior devs troubleshooting cross-module calls all wanted a bot that could “understand the entire codebase” and engage in multi-turn conversations—like chatting with a senior developer, not a Ctrl+F session from scratch.

The typical approach is to build a RAG pipeline with LangChain: split code files → embed → store in a vector DB → upon user query, retrieve similar chunks → stitch them into a prompt for the LLM to generate an answer. You can find tons of tutorials and have it running in 10 minutes. But in practice, as soon as the conversation exceeds three turns, the bot starts "forgetting"—mixing up previously mentioned file paths, forgetting already confirmed context, or even blending two separate questions into one answer.

The root causes are twofold:

The default ConversationBufferMemory just blindly stuffs all message history into the prompt; once it exceeds token limits, it truncates—often just the part with critical context.
The retrieval stage completely ignores conversation history, using only the latest user question to search the vector store. This renders referential queries like "modify the login function in that file" utterly useless.

The vanilla setup works for demos but collapses under real conversations. I needed a code Q&A bot that truly remembers context, plus an automated testing method to continuously verify memory accuracy—otherwise, relying on manual eyeballing after every memory logic refactor would eventually lead to failure.

Solution Design

Core stack: LangChain + Chroma + custom layered memory.

Why not Pinecone or Weaviate? Because this is a local codebase with data that must stay on-prem; and during development, we need to frequently destroy and rebuild indexes. The lightweight embedded Chroma (based on SQLite + in-memory HNSW) is fast enough, has zero deployment dependencies, and persistence is just a directory—backup is as simple as copying the folder.

For memory, I did not use LangChain's built-in SummaryMemory because summarization loses details (like an exact module path), which is fatal for code Q&A. I designed a two-layer memory structure:

Short-term window: keep the last N full conversation turns without compression to preserve referential context.
Long-term compression: for turns beyond the window, use an LLM to extract key decisions (e.g., "user is investigating concurrency issues in the payment module") and inject them as a fixed prefix into the prompt.

Most importantly, the retrieval query must be rewritten. Before hitting the vector store, I feed the user's raw question along with the last three turns into the LLM to produce a de-referenced standalone query. For example: "What exception did it throw?" becomes "What exception did the process_order function in payment/service.py throw?".

For automated testing, I designed a set of fixed conversation scripts, each ending with a question that requires memory of earlier context; then a script compares the answer with expected output to verify memory accuracy.

Core Implementation

1. Code Indexing into Chroma (persisted locally)

This code handles converting code files into searchable vectors; the key is the splitting strategy—using Language.PYTHON's separators to ensure chunks fall on function/class boundaries.

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
import os

# 加载代码目录，排除测试和虚拟环境
loader = DirectoryLoader(
    "./src",
    glob="**/*.py",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"},
    exclude=["**/test_*", "**/__pycache__/*"],
)
docs = loader.load()

# 按 Python 语法边界切分，chunk 大小 1000，重叠 200 字符保留上下文
splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = splitter.split_documents(docs)

# 使用本地 embedding 模型，避免网络依赖
embedding_model = HuggingFaceEmbeddings(
    model_name="BAAI/bge-base-zh-v1.5",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True},
)

# 存入 Chroma，持久化目录指定即可重启保留
vectordb = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="./chroma_code_index",
    collection_name="code_base",
)
vectordb.persist()

2. Q&A Chain with Dual-Layer Memory

This code tackles "how to make the LLM remember history without blowing the token limit" by combining a short-term window with long-term compression, and injecting a retrieval rewriting step.

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory, ConversationSummaryMemory
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# 短期记忆：保留最近 4 轮完整对话
short_memory = ConversationBufferWindowMemory(
    k=4,
    memor