We Caught 90% More AI Memory Bugs Using Playwright E2E Tests

#python #programming

At 1 a.m., the user group chat exploded: "I just spent half an hour discussing project details with the AI, and then it asked 'Who are you?'—all my previous words were wasted." Checking the backend, the session data in the Memory Store was glaringly sitting in Redis, but after the 23rd turn, the context log vanished as if cut off. This was already the third "amnesia" incident this month. We had been manually clicking through long conversations, but before each release, we could test at most 10 turns; anything longer and we'd lose patience. That night, I cursed at the screen: "If we don't automate this, it'll drive us insane sooner or later." So we made up our minds: build an end-to-end test suite with Playwright + pytest that simulates 100 turns, and catch memory consistency bugs right in CI.

The Problem: Why Long-Conversation Memory Is a Blind Spot for Manual Testing

One of our AI product's core selling points is "remembering what you've said"—you tell me in turn 3 that your name is "Lao Zhang" and you like iced Americano, and by turn 50 I can still ask "Lao Zhang, iced Americano as usual?" This experience relies on the underlying Memory Storage (usually Redis or a vector DB) writing, reading, and updating the session context at each turn. The problem is that memory loss often occurs in long-distance dependencies: the context window gets full and is truncated, the session TTL expires, serialization/deserialization errors happen, multi-threaded concurrent writes overwrite data, and so on. Manual testing never reaches these scenarios—can a QA spend half an hour manually clicking through 50 turns? Unrealistic. Tests at most cover the first 10 turns, then get marked "passed." The result? As soon as a real user has a longer chat, the bug surfaces, and we end up fixing it in the middle of the night, every time.

Conventional unit and API tests can only verify that "memory is written" and "can be read," but they can't reproduce the complete interaction chain in the browser: how does the frontend bundle historical messages and send them to the API? Is the context truncated during streaming responses? Does the memory persist after a WebSocket reconnection? To catch these, you need a real browser simulating user behavior from start to finish. So end-to-end (E2E) automated testing is the only way.

The Solution: Playwright + pytest Combo

When choosing a tool, we considered Selenium and Cypress but eventually settled on Playwright. Not just following the trend—several hard requirements dictated it:

Multi-browser support: our admin backend and production use Chromium, but some clients use Firefox; we had to test in parallel.
Built-in auto-waiting and network control: AI conversations are asynchronous streaming. Playwright has waitForSelector, waitForResponse, and networkidle, which are far stronger than Selenium's explicit sleeps, making scripts both fast and stable.
Seamless integration with pytest: our backend team uses Python; pytest fixtures, parametrize, and marks can be reused, so writing tests feels like writing normal Python scripts.

In terms of architecture, we don't test the API directly, but instead simulate a "real user conversation": use Playwright to open the frontend page, input questions one by one, wait for the AI response to complete, and then at the end make a single assertion—check whether the final reply contains the personal information set in the first turn. This way one test case covers the full flow of N turns. Test data is driven by JSON files, and pytest's @pytest.mark.parametrize loads different scenarios (ultra-long conversations, mixed Chinese-English input, special character nicknames, etc.). One script runs all cases.

Core Implementation: Building a Long-Conversation Memory Test from Scratch

This code solves "how to start a browser instance in a fixture and manage configuration and cleanup uniformly"

First, install the dependencies: pip install pytest-playwright and then the browser: playwright install. We create conftest.py, leverage the fixtures from pytest-playwright to get browser and page, and wrap them with a cleanup step to clear the memory store so each test starts from a clean slate.

# conftest.py
import pytest
import requests

@pytest.fixture(scope="session")
def api_base():
    return "http://localhost:8000/api"

@pytest.fixture(autouse=True)
def reset_memory(api_base):
    """每个测试用例前，调后端接口清空记忆存储，避免干扰"""
    requests.post(f"{api_base}/memory/reset", json={"namespace": "test"})
    yield

@pytest.fixture
def chat_page(page, base_url):
    """返回已打开聊天页面的 page 对象，base_url 来自 pytest-playwright"""
    page.goto(f"{base_url}/chat")
    page.wait_for_selector("#chat-input", state="visible")
    return page

Note: inside reset_memory we directly call the backend API to clear the test namespace in Redis, so leftover memory from a previous case won't contaminate the current one. In a real project you might need authentication or a separate test database.

This code solves "how to automate multiple conversation turns and wait for streaming replies to finish"

The DOM structure of the streaming reply is dynamic: the AI answer appears character by character inside a div.message-content, with the inner text constantly changing. We need to wait until it stops changing to consider a turn finished. I wrote a utility function that uses "text length no longer grows" as the end signal.

# utils.py
import time

def wait_for_reply_done(page, selector=".message-content:last-of-type", timeout=30):
    """等待最后一条AI回复的文本长度稳定，视为流式输出结束"""
    prev_len = 0
    stable_count = 0
    deadline = time.time() + timeout
    while time.time() < deadline:
        try:
            current_text = page.inner_text(selector)
            current_len = len(current_text)
        except Exception:
            current_len = 0
        if current_len == prev_len and current_len > 0:
            stable_count += 1
            if stable_count >= 3:   # 连续3次采样长度不变，认为完成
                return
        else:
            stable_count = 0
        prev_len = current_len
        time.sleep(0.3)
    raise TimeoutError("Timed out waiting for reply to finish")