I Spent 48 Hours Debugging Multi-Turn LLM Memory Loss—Then Playwright + Pytest Locked It Down

#python #programming

At 2 a.m., I was jolted awake by an ops call: “The smart customer service bot went nuts. A user asked, ‘What’s the order number I just told you?’, and it replied, ‘You haven’t given me an order number yet.’” I snapped wide awake — we had spent two months refining the memory storage module and all unit tests passed, so how did it lose its memory in production? Over the next 48 hours, I went from questioning my sanity to a permanent fix. If you’re building multi-turn conversations with LLMs, this postmortem should save you some painful detours.

The Diagnosis: Why All Unit Tests Passed, But Production Failed

Our architecture wasn’t complex: frontend chat UI → API gateway → conversation service → LLM. Inside the conversation service, a MemoryStore saved each turn into Redis and assembled context for the next turn. Unit tests used a mocked Redis client and covered every line of MemoryStore with over 95% coverage — looking rock solid.

But the real problem was a “timing amplification effect” that only surfaced with real user interactions:

Users could send messages in rapid succession; the write of one turn might not finish before the next turn reads history — a race condition your synchronous mocks never produce.
Session management might be cached or cross-bound across the frontend, gateway, and service layers; unit tests only cover the service layer, never touching the real chain.
After a WebSocket reconnection in the browser, some frontend frameworks lose the context identifier, causing the backend to treat it as a new session and wipe memory.

Any one of these gremlins is easy to fix in isolation, but together they turn multi-turn consistency into a dark art. Manually clicking through the browser just isn’t feasible and misses many cases.

The Plan: Lock Down Memory with End-to-End Tests

I needed an end-to-end test approach that could simulate the complete user path — sending multiple messages from the browser — and assert that memory was truly retained. I considered three directions:

Backend integration tests: direct API calls with requests, but they skip the WebSocket handshake and frontend session management, missing critical realism.
Selenium: the veteran, but async waits and modern frontend frameworks often clash, and handling iframes/shadow DOM adds friction.
Playwright: natively supports auto-waiting, network interception, and multiple browsers. Its API feels modern (like it’s from 2024), and it integrates seamlessly with Pytest.

I chose Playwright + Pytest because it can control Chromium directly, simulate continuous input, wait for LLM streaming output, inspect chat bubbles on the page, and even call backend APIs to cross-check memory data. The test chain becomes:

Pytest test → Playwright controlling browser → Frontend Chat UI → API → MemoryStore → Redis

A single script validates both frontend display and backend storage, leaving no room for amnesia.

The Implementation: Building Memory-Verification Tests Step by Step

1. Test Fixture: a Browser That Auto-Cleans Memory

This fixture ensures each test starts from a clean session to avoid memory contamination between tests.

# conftest.py
import pytest
from playwright.sync_api import sync_playwright, Page
import requests

BASE_URL = "http://localhost:3000"        # 前端地址
API_BASE = "http://localhost:8000/api"   # 后端地址

@pytest.fixture(scope="function")
def browser_page():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)  # 生产环境请用 headless
        context = browser.new_context()
        page = context.new_page()
        # 先通过 API 确保会话清理干净，官方文档没告诉你这一步能避免一半的 flaky
        requests.post(f"{API_BASE}/session/reset", json={"user_id": "test_user"})
        yield page
        context.close()
        browser.close()

Why not clean up after the test? Because the test might exit abnormally, and cleanup code after yield isn’t guaranteed to run. It’s safer to force‑reset during setup, guaranteeing a clean start every time.

2. Core Test Case: Verifying Multi-Turn Memory Consistency

This test simulates a user first telling the bot their name, then asking “What’s my name?” and asserts the reply contains the previously given name. We also directly fetch the memory store via the backend API for double assurance.

# test_memory.py
import time, requests
from playwright.sync_api import Page

def test_multi_turn_memory_consistency(browser_page: Page):
    page = browser_page
    page.goto(f"{BASE_URL}/chat")

    # 第一轮：告诉它名字
    send_message(page, "我叫李白，记住这个。")
    # 等待 LLM 流式输出完成，这里自动等元素出现避免假失败
    page.wait_for_selector("div.message-bot:last-child", timeout=10000)
    bot_reply_1 = get_last_bot_message(page)

    # 第二轮：询问名字，必须在回复中找到“李白”
    send_message(page, "我刚才说我叫什么名字？")
    page.wait_for_selector("div.message-bot:last-child", timeout=10000)
    bot_reply_2 = get_last_bot_message(page)

    assert "李白" in bot_reply_2, f"预期回复包含名字，实际：{bot_reply_2}"

    # 后端双重验证：确认记忆存储确实有相应历史
    session_history = requests.get(
        f"{API_BASE}/memory/test_user"
    ).json()
    messages = [m["content"] for m in session_history["messages"]]
    assert any("我叫李白" in m for m in messages), "记忆存储中没有第一轮用户消息"
    assert any