How I Cut LLM Memory Bug Diagnosis from 2 Hours to 5 Minutes with Playwright & Allure

#python #programming

At 2 a.m., the customer group chat erupted: "Your AI assistant completely forgot the client background I provided last week and even fabricated new details!" I rolled out of bed, opened the logs, and stared at tens of thousands of conversation lines—like finding a needle in a haystack. That night, I spent nearly two hours tracking down the cause: when conversations exceeded 24 turns, our history truncation logic silently dropped the system prompts in the middle, causing memory to break completely. The next day, I decided this torture had to end with automation. If you're building LLM applications and are tormented by "occasional amnesia" or "intermittent hallucinations," this article will hand you a ready-to-use end-to-end memory testing solution.

Problem Breakdown: LLM "Amnesia" Is Scarier Than Hallucination

Once an LLM application is live, users assume it will remember conversation history. The reality is: any link in the chain involving context window management, RAG retrieval, or multi-agent state passing can cause "amnesia"—needed information fails to reach the model, or gets truncated and overwritten along the way. The risk with such bugs is that they don't fail every time; they typically surface only under specific conversation lengths or message sequences, making manual reproduction extremely hard.

Conventional testing methods fall into two buckets: (1) Unit tests on the model API, verifying single-turn input/output, never touching memory logic. (2) Manually clicking through dozens of conversation turns on the UI, time-consuming, exhausting, and impossible to precisely replicate the exact path. One forgotten step by the tester and the bug slips away. What we truly lack is an automated approach that can converse with the system like a real user and clearly record what it remembers and at which turn it starts forgetting.

Solution Design: Why Playwright + Allure?

Facing this requirement, I evaluated several paths:

Direct API calls: Fast, but bypass front-end logic like message assembly, timestamps, and user identity injection, missing the full real-world chain.
Selenium: Mature ecosystem, but async waiting and Shadow DOM handling in modern front-end frameworks are painful, and the community has clearly shifted toward Playwright.
Cypress: Limited to JS/TS ecosystem. Our backend and model services are Python-based, making stack unification too costly.

The reasons for finally choosing Playwright are straightforward: native sync/async support, auto-wait mechanism drastically reducing flaky tests, and perfect control over Chromium to capture screenshots of every critical step. Allure was chosen because its reports come with built-in step trees, attachment display, and failure highlighting, enabling non-technical stakeholders (like product managers) to instantly grasp at which turn the memory broke.

The architecture is simple: pytest + playwright + allure-pytest. Test cases use Playwright to operate the live front-end, execute multiple conversation turns, and assert each turn that key information is still remembered by the model. Allure packages the conversation content, model responses, assertion results, and page screenshots for each round into a single HTML report, making the reproduction path crystal clear.

Core Implementation: From Opening the Browser to Automated Memory Assertions

First Code Block: Solving Browser Reuse and Login State

Often tests slow down drastically due to login flows, requiring scanning QR codes or entering verification codes from scratch each time. We directly save the logged-in storage_state and reuse it.


python
# conftest.py - 复用登录态，避免每次测试都登录
import pytest
from playwright.sync_api import sync_playwright
from pathlib import Path

STATE_FILE = Path(__file__).parent / "auth_state.json"

@pytest.fixture(scope="session")
def browser():
    with sync_playwright() as p:
        # 如果想看到执行过程可设为 False
        browser = p.chromium.launch(headless=True, slow_m