From 2 Hours to 3 Minutes: Automated Agent Memory Regression Testing with Playwright & Pytest

#python #programming

Last month we added a "long-term memory" module to our intelligent customer-service agent. A user says they like coffee, and three days later the agent can proactively suggest "Still going with an Americano?" Sounds beautiful, right? I stopped smiling after the first round of testing. Before every minor release I had to manually open a browser, switch users, and walk through dozens of dialog scripts step by step — it never took less than two hours, and I often missed nasty issues like "cross-session memory not clearing." Even more ridiculous: once we changed a single sentence in the prompt and the model suddenly forgot half the names; we only found out after users complained.

Regressing memory functionality purely by hand is a slow suicide. So I spent a weekend building an automated regression test suite with Playwright + pytest. It slashed the time from two hours down to three minutes, and since then no release with memory degradation has slipped through. This article breaks down the entire process — from tooling decisions to complete test code, plus every pitfall I encountered.

Problem Breakdown: Why Is Memory Testing So Painful?

Let’s set the scene. Our agent is a web chat app. Each conversation has a session_id, but the memory module shares a user_id across sessions. Before shipping, we need to verify three things:

In-session memory: the user says "My name is John, I like coffee", then asks "What did I say my name was?" The reply must include "John".
Cross-session memory persistence: after logging in again and asking "What's my name?", the agent should still get it right.
No cross-contamination: one user’s memory must not leak into another user’s context.

It looks simple, but manual testing is a nightmare: you have to complete a conversation as user A, switch to user B and do another round, then switch back to A to verify — after a few rounds you just want to scream. Even worse, the memory module sometimes has write delays, so deciding "how long to wait before verifying" is pure guesswork.

The common approach of "calling the API to test the model directly" doesn’t work here. The memory logic is the result of layers — backend, prompt, gateway — all wrapped together. Interferences like URL routing, authentication, and WebSocket streaming only show up in end-to-end tests. You have to walk the full chain from the perspective of a real browser.

Solution Design: Why Not Selenium? Why Not Just `requests`?

My bottom line: the test code must simulate real human actions, and the assertions must be rock solid.

Tech choices:

Playwright: Compared to Selenium, it brings built-in smart waiting, network monitoring, and isolated browser contexts, which make it easy to simulate completely separate login states for multiple users. Plus, playwright codegen can record actions directly — quick to get started.
pytest: Fixtures manage user state, parametrization handles multiple dialog scripts, and with pytest-repeat you can even run stability stress tests.
No direct API calls with requests: We need to cover the full front-end-to-model chain, especially since the memory module trigger point might sit after the gateway and auth layer. Using a real browser is the most faithful reproduction.
Why not Cypress: Our backend team is more comfortable in the Python ecosystem, and our existing test infrastructure is all pytest. Adding Playwright was basically a pip install job.

Overall architecture: use pytest fixtures to create two fully isolated browser contexts (UserA and UserB), run a complete multi-turn conversation + cross-session verification in the same test file. The dialog process is abstracted into a run_dialogs helper that takes a list of turns and executes them in order, with assertions at the end — adding a new test case just means appending a dialog script.

Core Implementation: From Fixtures to Multi-User Memory Verification

Directory structure:

tests/
  conftest.py          # browser fixtures
  test_memory.py       # memory verification cases
  dialog_utils.py      # dialog execution & assertion helpers

Creating isolated test environments for different users

In conftest.py we declare two user-level fixtures, each bound to an independent browser context, guaranteeing complete cookie/localStorage isolation and preventing memory mix-ups.

# conftest.py
import pytest
from playwright.sync_api import sync_playwright, Browser, BrowserContext, Page

BASE_URL = "http://localhost:3000"  # your chat front-end URL

@pytest.fixture(scope="session")
def browser():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)  # use headless in CI
        yield browser
        browser.close()

@pytest.fixture
def user_a_context(browser: Browser) -> BrowserContext:
    # create a fully isolated context, equivalent to a fresh incognito window
    context = browser.new_context(viewport={"width": 1280, "height": 720})
    yield context
    context.close()

@pytest.fixture
def user_b_context(browser: Browser) -> BrowserContext:
    context = browser.new_context()
    yield context
    context.close()

Executing multi-turn conversations in the chat window and capturing responses

We encapsulate a dialog_runner that handles message input, sending, and waiting for the streaming response to finish (our frontend uses WebSocket, so we need to wait in the DOM until the "typing" indicator disappears).

# dialog_utils.py
from playwright.sync_api import Page, expect
import re, time

def send_message(page: Page, text: str, timeout: int = 15000):
    """发送一条消息并等待回复完全输出"""
    # 使用data-testid定位，避免class变来变去
    input_box = page.locator('[data-testid="chat-input"]')
    input_box.fill(text)
    page.locator('[data-testid="send-btn"]').click()

    # 等待“正在输入”指示器出现再消失，确保模型答完
    typing_indicator = page.locator('[data-testid="typing-indicator"]')
    typing_indicator.wait_for(state="visib

DEV Community

From 2 Hours to 3 Minutes: Automated Agent Memory Regression Testing with Playwright & Pytest

Problem Breakdown: Why Is Memory Testing So Painful?

Solution Design: Why Not Selenium? Why Not Just `requests`?

Core Implementation: From Fixtures to Multi-User Memory Verification

Creating isolated test environments for different users

Executing multi-turn conversations in the chat window and capturing responses

Top comments (0)

Problem Breakdown: Why Is Memory Testing So Painful?

Solution Design: Why Not Selenium? Why Not Just requests?

Core Implementation: From Fixtures to Multi-User Memory Verification

Creating isolated test environments for different users

Executing multi-turn conversations in the chat window and capturing responses

Solution Design: Why Not Selenium? Why Not Just `requests`?