BAOFUFAN

Posted on Jul 1

How We Caught 90% of Our LLM Memory Bugs with Playwright

#python #programming

At 1 a.m., our product manager dropped a screenshot in the group chat: a user asked, “Repeat the order number I just gave you,” and our AI customer-service bot replied, “Please provide your order number.” The user cursed three times, and the screenshot ended up on a complaint platform.

This wasn’t a model capability problem—the model had simply “forgotten” the context. Even worse, I manually tested the same conversation three times and couldn’t reproduce it once. Then I wrote 200 automated memory tests using Playwright, and 90% of the forgetfulness pointed to one embarrassingly basic problem: inconsistent serialization and deserialization in the front-end storage layer.

Breaking down the problem: how memory silently disappears

Our customer-service bot is built on an LLM with LangChain, and the front end uses a React chat component. Conversation history is kept in the browser’s localStorage. Every time a user refreshes or reopens the page, the front end reads the history from localStorage and stitches it into the prompt to maintain long-context memory.

Sounds great in theory. In reality, many users access the service from mobile browsers, and the page gets killed frequently by the OS. When the session restores, we rely on that locally cached conversation to rebuild memory. If anything goes wrong during storage or retrieval, the model suffers amnesia.

Why didn’t manual testing catch it? Because we unconsciously test with the “correct” sequence—send message A, then message B, then ask about message A’s content. Real users, however, may refresh the page, switch apps, or even have their phone die and restart. The effect of these actions on the front-end storage is almost impossible to reproduce manually. Traditional unit tests only cover JS logic; they can’t test the complete “browser-level persistence and restoration” path.

Why Playwright for end-to-end memory verification?

I needed a tool that could control a real browser, manipulate localStorage, intercept network requests, and assert page content. Selenium is too heavy, Puppeteer only supports JS, Cypress is intrusive and has weak multi-tab support. Playwright supports Python, has a clean API, and can directly push arbitrary content into localStorage via page.evaluate(). It also simulates page refreshes and cross-page restores perfectly.

Architecturally, I divided the tests into three layers:

Scenario layer: defines memory test scenarios, e.g., “order number recallable after multiple conversation turns,” “memory not lost after refresh.”
Driver layer: wraps Playwright operations and provides high-level actions like send_message and reload_and_restore.
Verification layer: asserts memory consistency from two dimensions—page text and localStorage contents.

This way, even if the front-end framework changes from React to Vue, I only need to update selectors; the test logic stays intact.

Core implementation: let Playwright repeat every bizarre user action

1. Setting up a repeatable test environment

We need a runnable chat page. I used a small Streamlit app as a stand-in (launched with subprocess and killed after tests to ensure isolation). You can replace it with any front-end page.

import subprocess
import time
import pytest
from playwright.sync_api import sync_playwright

# 启动一个本地 Streamlit 聊天应用，模拟 LLM 对话服务
def start_chat_server():
    proc = subprocess.Popen(
        ["streamlit", "run", "chat_ui.py", "--server.port", "8510"],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL
    )
    time.sleep(3)  # 等待服务就绪
    return proc

@pytest.fixture(scope="module")
def page():
    server = start_chat_server()
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context()
        page = context.new_page()
        page.goto("http://localhost:8510")
        yield page
        browser.close()
        server.terminate()

This snippet solves “how to guarantee each test starts from a clean state.” new_context() ensures isolated storage space, preventing cross-test contamination.

2. Simulating multi-turn conversation and asserting memory retention

The test below is the core: the user first provides an order number, chats about unrelated topics, then asks the bot to repeat the number. We expect the model to correctly recall it.

def test_order_id_recall_after_conversation(page):
    # 封装发送消息的函数，处理前端可能的异步渲染
    def send_message(text: str):
        # 找到输入框并键入消息。实际选择器按你的前端结构调整
        page.fill('textarea[aria-label="输入消息"]', text)
        page.click('button:has-text("发送")')
        # 等待 AI 回复出现（以最后一条气泡里的文本为准）
        page.wait_for_selector(".message.assistant:last-of-type .content", timeout=5000)

    # 用户提供订单号
    send_message("我的订单号是 ORD-9876，查一下物流")
    # 中间几轮无关对话，目的是扩大上下文距离，更容易触达遗忘边界
    send_message("这个商品支持七天无理由吗？")
    send_message("发货一般要多久？")
    # 关键轮次：再次询问订单号
    send_message("我刚才说的订单号是什么？你重复一遍")

    # 获取最后一条助手的回复文本
    last_reply = page.text_content(".message.assistant:last-of-type .content")
    # 核心断言：回复中必须包含之前给出的订单号
    assert "ORD-9876" in last_reply, (
        f"记忆丢失！模型没有复述订单号，实际回复: {last_reply}"
    )

If this step fails, it’s classic long-context forgetting—something manual testing would likely never hit because nobody manually pads exactly the right number of distraction turns.

3. Testing memory recovery after refresh — the real danger zone

When a user refreshes the page or restores it from the background, the front end reads from localStorage to rebuild memory. Common serialization mistakes (e.g., storing and reading bigint, Date objects) cause the restored conversation to be truncated or empty, leading to immediate amnesia.

def test_memory_persists_after_reload(page):
    def send_message(text: str, sender="user"):
        # 这个测试中我们直接操作 localStorage 模拟前置对话
        # 更贴近“用户之前聊过，现在刷新页面回来”的场景
        pass

In the actual test script I’ll first inject a complete conversation into localStorage, then reload the page, and send a new message referencing the conversation. If the LLM can’t recall, the serialization is broken. The code below demonstrates how to inject and verify.

def test_memory_after_reload_and_send(page):
    # 直接往 localStorage 里写入一段历史对话（模拟用户之前聊过）
    existing_conversation = [
        {"role": "user", "content": "我的订单号是 ORD-1234"},
        {"role": "assistant", "content": "好的，我记下了您的订单号 ORD-1234。"}
    ]
    page.evaluate(
        "window.localStorage.setItem('chat_history', JSON.stringify(arguments[0]))",
        existing_conversation
    )

    # 刷新页面，模拟用户重新打开或刷新
    page.reload()
    # 等待页面加载完成
    page.wait_for_selector('textarea[aria-label="输入消息"]', timeout=5000)

    # 续接对话：询问之前告知的订单号
    page.fill('textarea[aria-label="输入消息"]', "我之前说的订单号是什么？")
    page.click('button:has-text("发送")')
    page.wait_for_selector(".message.assistant:last-of-type .content", timeout=5000)

    last_reply = page.text_content(".message.assistant:last-of-type .content")
    assert "ORD-1234" in last_reply, f"刷新后记忆丢失，回复内容: {last_reply}"

4. Verifying `localStorage` contents directly — catching hidden corruption

Sometimes the model still works in isolated tests but the stored data is silently mangled. By adding a direct assertion on the localStorage string, we can detect inconsistencies early.

def test_localstorage_integrity(page):
    page.fill('textarea[aria-label="输入消息"]', "我的会员ID是 VIP-2024")
    page.click('button:has-text("发送")')
    page.wait_for_selector(".message.assistant:last-of-type .content", timeout=5000)

    raw = page.evaluate("window.localStorage.getItem('chat_history')")
    assert raw is not None, "localStorage 中没有对话历史"
    parsed = json.loads(raw)
    # 验证关键数据没有被序列化破坏
    assert any("VIP-2024" in msg.get("content", "") for msg in parsed), (
        "localStorage 中的数据丢失了原始信息"
    )

This test alone caught 8 different serialization bugs during the sprint.

The real culprit: serialization silently breaks your context

Most people treat JSON.stringify and JSON.parse as transparent. They aren’t, especially when custom objects, undefined, or non-serializable values sneak into the conversation array. In our case:

A Date object from a message timestamp was directly stored, and after parsing it became a string, causing downstream code to skip that message because the valid-time check failed.
Some messages had content fields set to undefined (due to streaming glitches). JSON.stringify converts undefined to null, but the restoration logic expected typeof checks on undefined, causing an entire dialog block to disappear.
On mobile browsers with tight storage quotas, partial writes happened, leaving the JSON structure broken.

After fixing all these, the memory test pass rate went from 32% to 96% overnight.

Takeaways: make memory testing a habit, not an afterthought

Test memory logic in the browser, not in unit tests. Serialization is a runtime concern; unit tests with mocked localStorage won’t catch real-world quirks.
Write adversarial memory scenarios. Refresh, background-kill, and restore from partial state—these are exactly Playwright’s strengths.
Verify at multiple layers. Page output and raw localStorage content must align.
Automate relentlessly. If you have an AI feature that depends on context, you should have hundreds of Playwright tests that wear down the context like a user with a habit of switching apps.

That 1 a.m. bug report was a gift. It pushed me to stop trusting manual memory tests and start treating front-end storage as a first-class citizen. Playwright gave me the power to simulate the chaotic behavior of real users, and those 200 tests now run on every CI build. Memory bugs? They show up in 90 seconds now, not in customer complaints.

If you’re struggling with similar long-context memory issues, consider adding a “memory canary” test—a simple prompt-response pair that must survive whatever your front-end storage does. It might be the most productive 20 lines of code you write this week.

DEV Community

How We Caught 90% of Our LLM Memory Bugs with Playwright

Breaking down the problem: how memory silently disappears

Why Playwright for end-to-end memory verification?

Core implementation: let Playwright repeat every bizarre user action

1. Setting up a repeatable test environment

2. Simulating multi-turn conversation and asserting memory retention

3. Testing memory recovery after refresh — the real danger zone

4. Verifying `localStorage` contents directly — catching hidden corruption

The real culprit: serialization silently breaks your context

Takeaways: make memory testing a habit, not an afterthought

Top comments (0)

Breaking down the problem: how memory silently disappears

Why Playwright for end-to-end memory verification?

Core implementation: let Playwright repeat every bizarre user action

1. Setting up a repeatable test environment

2. Simulating multi-turn conversation and asserting memory retention

3. Testing memory recovery after refresh — the real danger zone

4. Verifying localStorage contents directly — catching hidden corruption

The real culprit: serialization silently breaks your context

Takeaways: make memory testing a habit, not an afterthought

4. Verifying `localStorage` contents directly — catching hidden corruption