I was jolted awake at 2 a.m. by a PagerDuty alert — users were complaining that the AI “called me Mr. Wang yesterday, but today it doesn’t recognise me at all.” Groggily I pulled up the monitoring dashboards and saw that the vector database’s retrieval latency had spiked, and the last two conversation turns had simply vanished from the memory recall. What made my blood run cold was the realisation: our existing manual test suite couldn’t even reach this cross‑session scenario. Long‑term memory was silently “forgetting”, and we didn’t even have a way to prove it.
This article is the story of how I turned LLM memory verification from “winging it” into a reliable automated test using Playwright — and all the traps I stepped into along the way. If you’re building the memory module for a RAG pipeline or an Agent, this might save you a few days.
Breaking down the problem: why manual testing is “pretend testing”
Our product’s long‑term memory follows a typical RAG architecture: after a conversation ends, key facts or summaries are stored in a vector database. The next time the user asks a question, the most relevant memory snippets are retrieved and stitched into the System Prompt. Sounds simple, but memory accuracy actually depends on three linked steps:
- Memory writing: did the summarisation model extract the correct facts? Is the embedding properly aligned?
- Retrieval recall: does the vector similarity search rank the older memory first? Is a threshold cutting out relevant fragments?
- Fusion & reasoning: once the LLM sees the prompt, is it truly “recalling” or is it just improvising?
The conventional approach is to test the API — create a session, push a few messages, then query the vector database or check the model’s output. But real users switch devices, clear caches, hop between networks. The behaviour of the frontend and the API gateway is completely missed. The streaming response parsing logic, the session‑ID binding — all of that is a blind spot in API‑only testing. We had to go end‑to‑end, through a real browser, the whole journey.
Why not Cypress or Selenium? Playwright has first‑class support for isolated browser contexts, which lets you cleanly simulate “two independent sessions”. It also allows you to intercept network requests and observe exactly when the memory API is called — something that’s quite painful to achieve with Selenium. So the choice fell on Playwright + pytest.
Designing the test: a “dual‑session” pattern to verify memory persistence
The architectural idea is bluntly simple:
- Open two completely isolated browser contexts with Playwright (simulating two independent visits — cookies and storage are wiped).
- In the first context, run a “teaching conversation” and plant a unique fact (e.g., the user’s name is
张三-2027). - Wait for the backend’s async processing to finish — memory writing and vector index refresh.
- In the second context, ask: “What name did I use when we first met?” and assert that the reply contains
张三-2027.
A failed test means the memory is inaccurate — either it was dropped or the model hallucinated. This is not a trivial keyword‑match either; the model might answer “You told me your name is 张三-2027” or “I remember you said 张三-2027”, so the assertion needs some tolerance.
The test wraps the chat UI in a Page Object. The three crucial aspects to verify:
- Wait for the AI’s response to finish completely — never take a half‑streamed sentence.
- Cross‑session isolation is bullet‑proof — no accidental reuse of the session ID.
- Memory retrieval timeliness — if you query right after writing, the record might not be indexed yet, so you must build in a retry window.
Here’s how the implementation actually looks.
Core implementation: copy‑pastable Playwright test code
1. Chat Page Object
This code deals with two headaches: ① how to tell that “the AI has finished streaming”; and ② providing a high‑level send_and_wait_reply method so the test cases stay clean.
# chat_page.py
from playwright.sync_api import Page, Locator
import time
class ChatPage:
def __init__(self, page: Page):
self.page = page
self.input_box = page.locator('textarea[placeholder="输入消息"]')
self.send_btn = page.locator('button:has-text("发送")')
# The stop button is only visible while streaming
self.stop_btn = page.locator('button:has-text("停止生成")')
# The last AI reply
self.last_ai_msg = page.locator('[data-role="assistant-message"]').last
def goto(self, url="/chat"):
self.page.goto(url)
self.page.wait_for_selector('textarea', timeout=10000)
def send_message(self, text: str):
self.input_box.fill(text)
self.send_btn.click()
def wait_for_reply_complete(self, timeout=30000):
"""Wait until streaming finishes: the stop button disappears and the last message text stabilises."""
start = time.time()
last_text = ""
stable_count = 0
while time.time() - start < timeout:
# If the stop button is still visible, generation is ongoing
if self.stop_btn.is_visible():
stable_count = 0
time.sleep(0.3)
continue
current = self.last_ai_msg.text_content() or ""
if current == last_text and len(current) > 0:
stable_count += 1
if stable_count >= 3: # 3 consecutive identical samples -> done
return current
else:
stable_count = 0
last_text = current
time.sleep(0.3)
raise TimeoutError("AI reply did not finish within the timeout")
def send_and_wait_reply(self, text: str) -> str:
self.send_message(text)
return self.wait_for_reply_complete()
Top comments (0)