We Automated LLM Memory Tests and Got 8x Efficiency

#python #programming

At 2 a.m., I was jolted awake by a DingTalk message from my QA colleague: “ChatGPT’s Memory is broken again. It forgot all the preferences I taught it yesterday.” I got up, opened my laptop, logged in, switched sessions, typed prompts, compared results — half an hour gone. And this was already the third time this month. Even worse, nobody wanted to do this tedious manual work every day. We were just clicking around, and missed regressions were almost inevitable. I started thinking: could I write a Playwright script, hook it up to GitHub Actions to run automatically every day, and get an alert when something breaks? This article walks through the entire implementation.

Breaking Down the Problem: Why Memory Is Hard to Test

The “Memory” of a large language model isn’t the same as traditional conversation context. Conversation context only lives within a single session window, while Memory persists across sessions. For example, if you tell the model “I prefer replies in Chinese,” it should remember that preference even when you start a brand new chat.

The challenges in testing this feature are:

It depends on real user interaction paths. Many commercial LLMs (like ChatGPT, Claude) only allow memory configuration through the web UI. The API either lacks memory management or goes through a different backend path, so API tests cannot truly replicate the end-user experience.
Cross-session state isolation. You must completely end one session and start a new one between “setting memory” and “verifying memory.” You cannot reuse the same conversation ID. Pure API scripts struggle to precisely simulate the front-end behavior of starting a new chat.
Manual regression is expensive and unreliable. Running through a dozen test cases takes half an hour. Nobody wants to do that daily, so testing often only happens right before a release — by then the bugs have been hiding for several versions.

In one sentence: we needed a test solution that can genuinely drive a browser, switch sessions automatically, and run on a schedule every day.

Solution Design: Playwright + GitHub Actions

I compared a few technical options:

Selenium: The old guard. Its waiting mechanisms for modern web apps aren’t as elegant as Playwright’s, and you end up writing a lot of WebDriverWait. Cross-browser context isolation also feels clunky.
Direct API calls: As mentioned, the memory functionality isn’t fully exposed via APIs, and this wouldn’t test the real user path anyway.
Playwright: Natively supports multiple Browser Contexts, has built-in auto-waiting and network monitoring, and writing a script feels just like describing user actions. Plus, its storageState capability makes it easy to persist the login state. You can keep the user logged in while still starting a completely fresh conversation — perfect for simulating cross-session scenarios.

The architecture is dead simple: a Python + pytest test script using Playwright’s synchronous API to define the business flow, combined with a GitHub Actions workflow scheduled to run daily at 2:00 AM UTC (10:00 AM Beijing Time). Once the run finishes, results get pushed to Slack or the Action itself fails and triggers an alert.

The biggest win here is zero extra operational cost. No need to host your own runner — GitHub Actions’ 2,000 free minutes per month is plenty for most teams.

Core Implementation: Building a Working Test Step by Step

1. Handling Login and Memory Setup Flow

This snippet logs into the platform, sends an instruction to the LLM to set a memory, and confirms the setup was successful. Using sync_playwright is straightforward; every line reads like “click here, type that.”

# test_memory.py
import os
import time
from playwright.sync_api import sync_playwright, expect

BASE_URL = "https://chat.example.com"          # 替换为你的大模型Web地址
EMAIL = os.getenv("MODEL_EMAIL")                # 从环境变量读取，不硬编码
PASSWORD = os.getenv("MODEL_PASSWORD")

def test_memory_persists_across_sessions():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)   # CI环境下必须无头模式
        context = browser.new_context()
        page = context.new_page()

        # ---- 登录 ----
        page.goto(f"{BASE_URL}/login")
        page.fill('input[name="email"]', EMAIL)
        page.fill('input[name="password"]', PASSWORD)
        page.click('button[type="submit"]')
        # 等待登录成功后的页面元素，比如侧边栏出现
        page.wait_for_selector('nav.sidebar', timeout=10000)

        # ---- 设置记忆：告诉模型用户偏好 ----
        # 确保在一个干净的对话里设置，先点“新建聊天”
        page.click('button:has-text("New chat")')
        page.wait_for_selector('textarea[placeholder*="Send a message"]')
        msg_input = page.locator('textarea[placeholder*="Send a message"]')
        msg_input.fill("记住，我接下来所有对话都请用中文回复，且称呼我为老张。")
        page.click('button[type="submit"]')     # 发送按钮
        # 等待模型回复，且回复中需包含确认信息，避免记忆还没保存就继续
        page.wait_for_selector('text=记住了', timeout=15000)

        # ---- 至关重要的一步：等待记忆后台写入 ----
        # 官方文档不会告诉你这个延迟，但实测记忆同步需要几秒
        time.sleep(5)

        # 保存当前浏览器状态，以备后续新会话复用登录态（但我们不用它来“新建会话”）
        context.storage_state(path="/tmp/state.json")
        context.close()
        browser.close()

2. Verifying That Memory Persists Across Sessions

This step creates a completely independent Browser Context, loads the previous login state to maintain the same user identity, but starts a brand new conversation session. It asks the model “What’s my name?” and expects a reply containing “老张” or a Chinese response.