At 1 a.m., a colleague sent me a screenshot: a user had said, “My name is Xiao Ming, remember I take my coffee without sugar.” In the next conversation, the bot served a full-sugar latte. The product manager @-mentioned everyone in the group chat: “Is memory storage broken again?” I stared at the chat history, sighed, opened my spreadsheet, and started my Nth round of manual regression: clear cache, open browser, run 10 turns of dialogue, compare against expected results, take screenshots, fill in results… Two hours later I had covered five scenarios. My eyes were exhausted and I had missed three boundary cases. At that moment I decided: a machine has to do this.
Problem Breakdown: Why RAG Memory Testing Is So Painful
Memory storage in RAG applications isn’t like a traditional API where you can verify everything with a few asserts. It involves long-term memory, session windows, vector retrieval, and LLM generation—any weak link leaves the user feeling that “the bot has amnesia.” A typical test scenario looks like this: chat with the bot for 10 rounds. In round 3, plant the information “My favorite movie is Let the Bullets Fly.” In round 7, discuss the weather. In round 10, suddenly ask, “What movie did I say I liked earlier?” and see whether the bot retrieves it from memory.
Doing this manually has three fatal flaws:
- Hard to trace long-conversation state – Memory glitches normally happen after several context shifts. By the fourth or fifth round of manual testing, even the tester has forgotten what was said earlier.
- Streaming output makes assertions unstable – LLM generation appears token by token. Often a sentence isn’t finished before someone frantically scrolls to check for a keyword, leading to a high rate of false negatives.
- Regression cost grows exponentially with memory types – Short-term memory, long-term memory, summary memory, vector memory… every additional storage type doubles the number of test cases. Manual testing simply can’t keep up.
The usual fix is to write unit tests—but LLM output is non-deterministic. Even if the memory is correct, the phrasing can vary wildly. Fixed-string asserts immediately break down. The real challenge is that you need something that can simulate a real user across multiple conversation turns—waiting, observing, asserting—and still run unattended in CI. Playwright is practically built for this.
Design Decision: Why Playwright over Selenium or Puppeteer
There were three candidates: Selenium, Puppeteer, and Playwright. Selenium was eliminated first—its automatic waiting mechanisms for modern web apps are too weak; you end up sprinkling sleep everywhere, making tests slow and brittle. Puppeteer only supports Chromium, while our RAG application has production users on Safari and Firefox, so we needed cross-browser validation.
What won me over with Playwright:
-
Auto-waiting – It handles element interactability, page loads, and network idleness for you. No need to litter assertions with
sleep. - Multi-browser support – The same script runs on Chromium, Firefox, and WebKit just by changing one configuration.
- Screenshots and video – When a test fails, the trace replays each step instead of forcing you to stare at logs and doubt everything.
- Network interception and mocking – You can intercept API requests, even simulate memory-service failures, to verify degradation logic.
Architecturally, we designed a memory-accuracy automation suite:
- Script definition – Describe multi-turn conversations in YAML. Each turn contains the user input, expected memory fields, and mandatory keywords.
- Executor – A Playwright browser instance reads the script, sends messages in order, listens for the streaming-response completion event, and collects the full generated text.
- Assertion layer – Perform semantic-level checks on the generated text instead of relying on exact string matching. Check “whether the response contains key information linked to the memory.” When necessary, plug in a small model for secondary verification.
- Report generation – Each run produces an HTML report with failing screenshots, dropped straight into CI artifacts.
Why not use API tests directly? Because in many RAG apps, state management, the front-end conversation window, and token-refresh logic are all embedded in the page. You simply cannot reproduce real-world failures without a real browser.
Core Implementation: Turning Manual Steps into Automation
The code below solves the problem of “how to make Playwright wait for each generation to finish before sending the next message” in a multi-turn dialogue. During streaming, the send button is typically disabled or shows a stop icon, then returns to normal once generation is complete. We use this change as a synchronization point.
import asyncio
from playwright.async_api import async_playwright
async def send_message_and_wait(page, text: str, timeout: int = 30000):
"""
向聊天框发送消息,并等待 LLM 生成结束。
假设:发送后 send 按钮 disabled,生成完毕恢复 enabled。
"""
textarea = page.locator('textarea[placeholder*="输入"]')
send_btn = page.locator('button:has-text("发送")')
await textarea.fill(text)
await send_btn.click()
# 关键:等待发送按钮恢复可用状态,表示生成完成
await send_btn.wait_for(state="visible", timeout=timeout)
# 保险起见再等一丢丢,让动画渲染完毕
await page.wait_for_timeout(500)
Next, we build a complete memory test scenario: the user states their name and a preference in the first turn, then much later suddenly quizzes the bot, checking whether the response contains the earlier information. Notice we use locator combined with filter to precisely grab the bot’s latest reply.
async def test_long_term_memory():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://your-rag-app.example.com/chat")
# 剧本:植入记忆
await send_message_and_wait(page, "我叫赵大宝,最喜欢的咖啡是冰美式。")
await send_message_and_wait(page, "今天天气不错,适合工作。")
await send_message_and_wait(page, "帮我记一下,下周三要去体检,别忘了提醒我。")
# 干扰对话
await send_message_and_wait(page, "那明天呢?")
await send_message_and_wait(page, "再说一下项目排期的事情吧。")
# 关键测试:询问之前的信息
await send_message_and_wait(page, "我之前说过我喜欢什么咖啡?")
# 定位最后一条机器人消息(假设每条消息都有 [data-role="assistant"])
last_bot_msg = page.locator('[data-role="assistant"]').last
response_text = await last_bot_msg.inner_text()
# 语义断言:必须包含“冰美式”或“美式”
assert "冰美式" in response_text or "美式" in response_text
await context.close()
Plugging into CI: From 2 Hours Down to 4 Minutes
We integrated the scripts into GitHub Actions. A typical workflow:
- Spin up a Playwright Docker container with all browsers pre-installed.
- Pull the YAML memory scripts from the repo and execute them in parallel matrix jobs (short-term memory, long-term memory, summary memory each in its own job).
- Collect HTML reports and screenshots as artifacts. If any job fails, post a summary comment on the PR.
The numbers speak for themselves:
- Before (manual): 5 scenarios took ~2 hours, with an error-omission rate of about 20%.
- After (Playwright automation): The same 5 scenarios, plus 15 more that we never had time to run, finish in 4 minutes. The omission rate dropped to below 4%, because the machine executes every assertion statement precisely, without fatigue.
- False-positive resistance: Because assertions are semantic, a correct answer phrased differently (“冰美式” vs “美式咖啡”) still passes; the machine doesn’t trip over wording.
Advanced Tips
-
Mock the memory service for negative testing: Use
page.route()to intercept requests to the memory backend and return 500 errors, verifying that the bot gracefully handles “I can’t access my memory right now.”
await page.route("**/api/memory/**", lambda route: route.fulfill(status=500))
Deal with slow streaming: Some models send tokens very slowly. Instead of waiting for the send button, you can listen for the
page.on("websocket")event or wait for a “done” indicator. The auto-waiting approach is solid, but if your UI differs, adjust accordingly.Semantic validation with a tiny model: For keywords that can be expressed in many ways, we ran the bot’s response and the expected fact through an embeddings model and check cosine similarity. It’s optional but reduces false negatives to near zero.
Conclusion
RAG memory testing doesn’t have to be a nightmare of manual spreadsheet checking. With Playwright as the “hands” and a few YAML scripts as the “brain,” you can turn a flaky two-hour regression into a four-minute, reliable pipeline. The robot doesn’t get tired, doesn’t miss outliers, and won’t give your user a sugary latte when they specifically asked for black coffee.
Top comments (0)