Pytest + Docker: 3 Bugs That Broke My AI Agent's Memory (and Cost Me 8 Hours)

#python #programming

At 1:23 AM our ops group chat exploded—users were reporting that the agent had completely lost its memory. Every conversation felt like it was starting from scratch. I dug into the logs and found the memory module returning an empty list, even though the records were sitting right there in the database. It wasn’t a model hallucination. The consistency of the memory storage had been silently broken: two concurrent requests had overwritten the last message of the session with an older version. And the mocked MemoryStore in our unit tests would never tell you that. That night I decided to build a reproducible, real-storage verification setup using Pytest + Docker. I started at midnight and didn’t stop until dawn—and I hit way more pitfalls than I expected. Here’s the full postmortem so you can save yourself a few sleepless nights.

Breaking Down the Problem: Why Mocks Can’t Catch Memory’s Fatal Flaws

An AI agent’s memory storage seems deceptively simple: insert a row per conversation turn with session_id, role, content, and created_at, then fetch the most recent N rows per session to build the context. We used PostgreSQL with SQLAlchemy + asyncpg. It all looked harmless—until concurrency showed up and the gremlins came out.

Concurrent insert ordering chaos: Instead of relying on the database’s auto-increment sequence, we generated created_at timestamps in application code. But server clock drift or Python’s datetime.utcnow() reordering inside coroutines would push later messages before earlier ones.
“Vanishing writes” under read/write splitting: The primary accepted the write, but the subsequent query hit a read replica. Replication lag made the freshly inserted message invisible—so the agent simply “forgot” it.
Fake snapshots due to transaction isolation: Under default READ COMMITTED, a long transaction could see different versions of the same session on successive reads. This introduced phantom rows while assembling the context.

How do typical unit tests handle this? They swap out the repository with unittest.mock and assert “the insert method was called.” That never touches a real storage engine. Isolation levels, concurrent scheduling, network delays—all gone. Testing memory storage with mocks is like learning to parallel park in a simulator—you’ll never learn the real thing.

The Plan: Pull a Real Database Into Tests with Docker

To verify both correctness and consistency, you have to swing at real pitches. The plan was straightforward: Pytest organizes the test cases, and Docker provides a disposable, genuine database. At test time you spin up a PostgreSQL container, wait for its health check, run migrations, execute concurrent scenarios, then tear it all down. Every run starts from a clean slate.

Why not other approaches?

❌ Testcontainers-Python: Nice idea, but it requires a Docker daemon in CI and its abstraction isn’t transparent enough. When things break you can’t tell if the container never started or the port mapping went sideways.
❌ SQLite in-memory mode: Its isolation level and concurrency model are too different from PostgreSQL. It won’t surface transaction conflicts or simulate replication lag—a total waste.
✅ Docker Compose: A single YAML describes the dependencies, works the same in CI and locally. The way ops orchestrates production is how we orchestrate tests, reproducing ~90% of real behavior.

The architecture in text form:

Test startup
  ├─ docker compose up -d (postgres, optionally pgvector, redis)
  ├─ wait for health check
  ├─ run alembic migrations / create tables
  ├─ pytest cases (correctness + concurrency consistency)
  └─ docker compose down -v

One easily overlooked point: concurrency tests must run with real async I/O. You can’t just rely on pytest-asyncio’s default loop. We need to control the event loop lifecycle so all async fixtures share the same loop, giving us the same coroutine scheduling behavior as production.

Core Implementation: Building It Step by Step

First, the docker-compose.yml. Keep it minimal, but get the health check right—screw this up and you’ll step on landmines.

# docker-compose.yml
version: "3.9"
services:
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: agent_test
      POSTGRES_PASSWORD: test_pass
      POSTGRES_DB: memory_test
    ports:
      - "0:5432"            # 随机端口，避免本地冲突
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "agent_test"]
      interval: 1s
      timeout: 3s
      retries: 10

The port mapping 0:5432 tells Docker to assign a random host port. In Python we’ll grab it with docker compose port, so parallel test runs never collide.

Now for conftest.py, which manages the container lifecycle and the database connection pool. I’ve stepped on enough landmines here—here’s the final working version.

# conftest.py
import asyncio
import subprocess
import time

import asyncpg
import pytest
import pytest_asyncio


def _get_port() -> int:
    """通过 docker compose port 获取容器映射出来的宿主机端口"""
    result = subprocess.run(
        ["docker", "compose", "port", "postgres", "5432"],
        capture_output=True, text=True, check=True,
    )
    # 输出格式: "0.0.0.0:54321"
    return int(result.stdout.strip().split(":")[1])


@pytest.fixture(scope="session")
def docker_services():
    """启动 docker compose 服务，返回服务端口映射"""
    subprocess.run(["docker", "compose", "up", "-d"], check=True)
    # 等待健康检查通过，而不是死等 sleep
    for _ in range(30):
        try:
            port = _get_port()
            # 用 pg_isready 再次确认
            subprocess.run(
                ["pg_isready", "-h", "127.0.0.1", "-p", str(port),