How We Dropped LLM Memory Validation Miss Rate from 25% to 0.1% with Playwright

#python #programming

At 1 a.m., a Feishu message buzzed my phone like it was about to catch fire—“The AI assistant remembered our client’s company name as a competitor’s.” While I rolled back the memory service, I pulled up the test report and saw the problem: manual spot checks covered only 50 memory samples, but the system was generating nearly 20,000 new memories daily. The parameter change that caused the incident? Naturally, it wasn’t in that sample.

This is the testing dilemma for LLM long-term memory: the memory storage pipeline changes constantly (swap embedding models, tweak chunking strategies, adjust similarity thresholds), but verification still relies on “just chat a few times and see.” What we needed was an automated validation solution that covers all scenarios, supports regression, and is quantifiable. I ended up building an end-to-end memory accuracy testing framework with Playwright + pytest. It slashed the miss rate from 25% to 0.1%, and now it automatically runs 2,000+ memory validations before every merge. Let me break it down.

Why Memory Validation Is So Hard

LLM long-term memory typically follows this flow: user conversation → extract facts → store in a vector database (like Chroma or Milvus) → semantic search on the next conversation → weave the relevant memories into the prompt. The test goal is clear: verify that what was stored can be accurately recalled in later conversations.

But doing it in practice hits a few solid nails:

Long chain: From frontend rendering to backend memory extraction, embedding, retrieval, and prompt assembly – a single break anywhere results in a “wrong answer.” Testing just one API can’t catch the overall drift.
Expensive end-to-end validation: Manual QA chat can only do a few dozen cases per day, and it’s extremely hard to precisely reproduce a scenario like “first establish 5 facts, then ask about the 2nd one.”
Timing traps: Memory writes are asynchronous (e.g., via message queue + bulk insert). A newly created memory might not be indexed yet, so querying it immediately causes a false negative.
Recall uncertainty: Even with no bugs, a change in the similarity threshold can cause some memories to be missed. This kind of borderline failure is nearly impossible to judge by human eyes.

Conventional “backend API automation + frontend screenshot comparison” approaches simply can’t handle this, because the core of memory validation isn’t about correct UI, but whether specific semantic information is accurately reproduced across multiple conversation turns.

Why Browser Automation Instead of API Testing

I chose Playwright for three reasons, while consciously avoiding two alternative paths:

Real user path: Playwright simulates complete browser actions—login, send message, wait for AI reply—covering frontend rendering, SSE streaming output, and memory embedding all the way. Calling the /chat API directly bypasses SSE parsing and frontend assembly logic, which is exactly where bugs have occurred (e.g., dropped words from stream chunk merging).
Assertion power: Playwright’s expect(page.locator(...)).to_contain_text(...) can directly verify that the stored fact appears in the reply, and with built-in timeout retries, it’s naturally suited for AI response delays.
Maintenance-free test environment: The system under test is a real, accessible web chat interface (internal deployment), so there’s no need to create stubs or mock vector stores. Playwright launches a browser and connects directly; each test gets a fresh session with isolated memory.

Architecture: pytest manages the test cases. Each case defines two phases—“memory building” and “verification question”—all driven by YAML data. During execution, Playwright controls Chrome, automatically sends messages according to the YAML steps, and checks the replies. After the memory-building phase, we explicitly wait 3 seconds (to let async writes finish and indexes land), then proceed to verification. This avoids false failures caused by asynchronous writes.

Core Implementation: Building It Step by Step

1. Playwright Fixture: Reusing a Browser Session

This code handles browser startup and management so we don’t have to open a fresh page for every test function, saving tens of seconds of overhead.

# conftest.py
import pytest
from playwright.sync_api import sync_playwright, Page

@pytest.fixture(scope="session")
def browser():
    # 启动本地 Chromium，实际 CI 环境可替换为 headless=True
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False, slow_mo=100)
        yield browser
        browser.close()

@pytest.fixture(scope="function")
def page(browser):
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://your-ai-app.com/chat")  # 被测对话页面
    # 登录逻辑：假设页面已存在 token，实际可这里注入 cookie/localStorage
    page.evaluate("localStorage.setItem('token','test-token-xxx')")
    page.reload()
    yield page
    context.close()

slow_mo=100 makes it easier to see each action during debugging. In CI you can set it to 0 or just use headless=True.

2. Test Case: Sending Memories + Verifying Recall

This code encapsulates the core operation of sending a message and waiting for the AI reply to fully render—since the UI keeps changing during streaming, a naive wait_for_selector might grab half-completed content.

# test_memory.py
import time
from playwright.sync_api import expect

def send_message(page, text: str):
    """模拟用户输入并发送，同时等待 AI 回复完整渲染"""
    input_box = page.locator("textarea[placeholder*='输入']")  # 根据实际 selector 调整
    input_box.fill(text)
    page.locator("button:has-text('发送')").click()
    # 关键：等待流式输出结束，这里通过“停止生成”按钮消失来判断
    page.locator("button:has-text('停止生成')").wait_for(state="hidden", timeout=30000)
    # 额外固定等待，确保 DOM 更新完毕（有些 UI 有转场动画）
    time.sleep(0.5)

def test_user_memory_recall(page):
    # 阶段1：建立记忆
    memories = [
        "我叫王