AI Chat Memory Pitfalls: 30% of Conversations Lost on Refresh

#python #programming

It was 1 a.m. when the product manager dropped a screenshot in the group chat: “A user chatted for 20 minutes, refreshed the page, and lost all their history. Did you guys even implement the memory feature?” My stomach tightened — this was the third memory-loss report this week. What stung more was that we did write tests, but our manual multi-turn conversation test cases never touched the browser’s refresh button. I decided to write an automated test suite that actually mimics real user behavior, using Playwright and LangChain, specifically targeting memory persistence. Not only did I reproduce the bug, I followed the breadcrumbs and unearthed three hidden issues. Here’s the full post-mortem.

Why Memory Persistence Is So Hard to Test

The scenario is classic: a user opens a chat page, has a long multi-turn conversation, and at some point refreshes the page, closes and reopens the tab, or even backgrounds the app on mobile. The AI must remember the previous context — no lost history, no session mix-ups. Our chat backend uses LangChain’s ConversationBufferMemory for memory management. The frontend is an SPA, bound to a session_id on the backend.

Standard tests only cover “continuous conversation within a single page load,” because manually simulating complex refresh timings is brutal, not to mention verifying consistency among localStorage, sessionStorage, cookies, and backend memory. We’d thought about automation before, but the team had tried Selenium — page reloads caused timeout after timeout, and multi-tab scenarios ended up as a callback nightmare with a maintenance cost through the roof.

The root cause: testing memory persistence is fundamentally a stateful, cross-session, timing-sensitive E2E scenario. You must simultaneously drive the browser UI and inspect backend state — you can’t have one without the other. That’s why pure API tests (e.g., only hitting /chat) never catch the bug: when a user refreshes the page, can the frontend correctly re-fetch history from the backend? Will the session_id be wiped? Does the backend memory regress due to a serialization error? You have to let a real browser walk through it.

Solution Design: A Playwright + LangChain Memory Testing Sandbox

I needed a test harness that was quick to set up, could plug in different memory backends, and accurately simulate real user actions. Here’s the selection reasoning:

Why Playwright over Selenium or Cypress: Playwright natively supports multiple pages and contexts, auto-waits for elements, and lets you directly inject scripts to manipulate cookies/localStorage. This is a hard requirement for scenarios like “reload and reload history.” Selenium’s wait strategies are too primitive, and Cypress’s multi-tab support is limited — easy pass.
Why LangChain: Not just hype. LangChain’s memory abstractions are excellent. You can switch ConversationBufferMemory between in-memory and Redis implementations with a one-liner, making it easy to test behavioral differences across persistence strategies. Plus, its built-in message history interface let me assert memory content directly in tests, instead of scraping the frontend DOM for history records.
Architecture: A simple FastAPI chat endpoint wrapping a LangChain ConversationChain. It accepts a session_id and a user message, and returns an AI reply. Playwright test scripts simulate user interactions and use page.evaluate() to read/write the session_id in the frontend’s localStorage, even simulating edge cases like corrupted storage.

Core Implementation: Building a Testable Framework from Scratch

1. Chat Service: Exposing Memory as Assertible State

The code below solves the problem: “How do I make backend memory usable in real scenarios, yet precisely assertable in tests?” I wrapped a ConversationChain in FastAPI, with a dictionary holding the memory instance for each session. This allows a test-only endpoint to directly retrieve memory content — no dependency on the frontend DOM.

# chat_server.py
from fastapi import FastAPI
from pydantic import BaseModel
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
from langchain.chat_models import ChatOpenAI
import uuid

app = FastAPI()

# 存储不同会话的 chain 实例，真实的生成环境会用 Redis，这里演示用内存字典
chains = {}

class Message(BaseModel):
    session_id: str
    content: str

def get_or_create_chain(session_id: str):
    if session_id not in chains:
        memory = ConversationBufferMemory(memory_key="history", return_messages=True)
        llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
        chains[session_id] = ConversationChain(llm=llm, memory=memory, verbose=False)
    return chains[session_id]

@app.post("/chat")
def chat(msg: Message):
    chain = get_or_create_chain(msg.session_id)
    response = chain.run(msg.content)
    return {"reply": response}

# 测试辅助：直接暴露记忆内容，避免依赖前端解析
@app.get("/memory/{session_id}")
def get_memory(session_id: str):
    chain = chains.get(session_id)
    if not chain:
        return {"messages": []}
    # ConversationBufferMemory 的 buffer 就是消息列表
    messages = chain.memory.buffer
    return {"messages": [{"role": m.type, "content": m.content} for m in messages]}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Once this service is running, any UI action driven by Playwright can later assert backend memory directly via the /memory/{session_id} endpoint — making the test clean and deterministic.