Pitfalls of Testing LLM Long-Term Memory: A 3‑Day Debugging Saga

#llm #playwright #长期记忆 #自动化测试

I was jolted awake at 2 a.m. by a PagerDuty alert — users were complaining that the AI “called me Mr. Wang yesterday, but today it doesn’t recognise me at all.” Groggily I pulled up the monitoring dashboards and saw that the vector database’s retrieval latency had spiked, and the last two conversation turns had simply vanished from the memory recall. What made my blood run cold was the realisation: our existing manual test suite couldn’t even reach this cross‑session scenario. Long‑term memory was silently “forgetting”, and we didn’t even have a way to prove it.

This article is the story of how I turned LLM memory verification from “winging it” into a reliable automated test using Playwright — and all the traps I stepped into along the way. If you’re building the memory module for a RAG pipeline or an Agent, this might save you a few days.

Breaking down the problem: why manual testing is “pretend testing”

Our product’s long‑term memory follows a typical RAG architecture: after a conversation ends, key facts or summaries are stored in a vector database. The next time the user asks a question, the most relevant memory snippets are retrieved and stitched into the System Prompt. Sounds simple, but memory accuracy actually depends on three linked steps:

Memory writing: did the summarisation model extract the correct facts? Is the embedding properly aligned?
Retrieval recall: does the vector similarity search rank the older memory first? Is a threshold cutting out relevant fragments?
Fusion & reasoning: once the LLM sees the prompt, is it truly “recalling” or is it just improvising?

The conventional approach is to test the API — create a session, push a few messages, then query the vector database or check the model’s output. But real users switch devices, clear caches, hop between networks. The behaviour of the frontend and the API gateway is completely missed. The streaming response parsing logic, the session‑ID binding — all of that is a blind spot in API‑only testing. We had to go end‑to‑end, through a real browser, the whole journey.

Why not Cypress or Selenium? Playwright has first‑class support for isolated browser contexts, which lets you cleanly simulate “two independent sessions”. It also allows you to intercept network requests and observe exactly when the memory API is called — something that’s quite painful to achieve with Selenium. So the choice fell on Playwright + pytest.

Designing the test: a “dual‑session” pattern to verify memory persistence

The architectural idea is bluntly simple:

Open two completely isolated browser contexts with Playwright (simulating two independent visits — cookies and storage are wiped).
In the first context, run a “teaching conversation” and plant a unique fact (e.g., the user’s name is 张三-2027).
Wait for the backend’s async processing to finish — memory writing and vector index refresh.
In the second context, ask: “What name did I use when we first met?” and assert that the reply contains 张三-2027.

A failed test means the memory is inaccurate — either it was dropped or the model hallucinated. This is not a trivial keyword‑match either; the model might answer “You told me your name is 张三-2027” or “I remember you said 张三-2027”, so the assertion needs some tolerance.

The test wraps the chat UI in a Page Object. The three crucial aspects to verify:

Wait for the AI’s response to finish completely — never take a half‑streamed sentence.
Cross‑session isolation is bullet‑proof — no accidental reuse of the session ID.
Memory retrieval timeliness — if you query right after writing, the record might not be indexed yet, so you must build in a retry window.

Here’s how the implementation actually looks.

Core implementation: copy‑pastable Playwright test code

1. Chat Page Object

This code deals with two headaches: ① how to tell that “the AI has finished streaming”; and ② providing a high‑level send_and_wait_reply method so the test cases stay clean.

# chat_page.py
from playwright.sync_api import Page, Locator
import time

class ChatPage:
    def __init__(self, page: Page):
        self.page = page
        self.input_box = page.locator('textarea[placeholder="输入消息"]')
        self.send_btn = page.locator('button:has-text("发送")')
        # The stop button is only visible while streaming
        self.stop_btn = page.locator('button:has-text("停止生成")')
        # The last AI reply
        self.last_ai_msg = page.locator('[data-role="assistant-message"]').last

    def goto(self, url="/chat"):
        self.page.goto(url)
        self.page.wait_for_selector('textarea', timeout=10000)

    def send_message(self, text: str):
        self.input_box.fill(text)
        self.send_btn.click()

    def wait_for_reply_complete(self, timeout=30000):
        """Wait until streaming finishes: the stop button disappears and the last message text stabilises."""
        start = time.time()
        last_text = ""
        stable_count = 0
        while time.time() - start < timeout:
            # If the stop button is still visible, generation is ongoing
            if self.stop_btn.is_visible():
                stable_count = 0
                time.sleep(0.3)
                continue
            current = self.last_ai_msg.text_content() or ""
            if current == last_text and len(current) > 0:
                stable_count += 1
                if stable_count >= 3:   # 3 consecutive identical samples -> done
                    return current
            else:
                stable_count = 0
                last_text = current
            time.sleep(0.3)
        raise TimeoutError("AI reply did not finish within the timeout")

    def send_and_wait_reply(self, text: str) -> str:
        self.send_message(text)
        return self.wait_for_reply_complete()