<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: BAOFUFAN</title>
    <description>The latest articles on DEV Community by BAOFUFAN (@_eb7f2a654e97a60ae9f96e).</description>
    <link>https://dev.to/_eb7f2a654e97a60ae9f96e</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3903614%2F88f4214a-aed8-4e71-a7f1-a6aca8cfe579.jpg</url>
      <title>DEV Community: BAOFUFAN</title>
      <link>https://dev.to/_eb7f2a654e97a60ae9f96e</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/_eb7f2a654e97a60ae9f96e"/>
    <language>en</language>
    <item>
      <title>AI Chat Memory Pitfalls: 30% of Conversations Lost on Refresh</title>
      <dc:creator>BAOFUFAN</dc:creator>
      <pubDate>Tue, 12 May 2026 01:08:18 +0000</pubDate>
      <link>https://dev.to/_eb7f2a654e97a60ae9f96e/ai-chat-memory-pitfalls-30-of-conversations-lost-on-refresh-1p24</link>
      <guid>https://dev.to/_eb7f2a654e97a60ae9f96e/ai-chat-memory-pitfalls-30-of-conversations-lost-on-refresh-1p24</guid>
      <description>&lt;p&gt;It was 1 a.m. when the product manager dropped a screenshot in the group chat: “A user chatted for 20 minutes, refreshed the page, and lost all their history. Did you guys even implement the memory feature?” My stomach tightened — this was the third memory-loss report this week. What stung more was that we &lt;em&gt;did&lt;/em&gt; write tests, but our manual multi-turn conversation test cases never touched the browser’s refresh button. I decided to write an automated test suite that actually mimics real user behavior, using Playwright and LangChain, specifically targeting memory persistence. Not only did I reproduce the bug, I followed the breadcrumbs and unearthed three hidden issues. Here’s the full post-mortem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Memory Persistence Is So Hard to Test
&lt;/h2&gt;

&lt;p&gt;The scenario is classic: a user opens a chat page, has a long multi-turn conversation, and at some point refreshes the page, closes and reopens the tab, or even backgrounds the app on mobile. The AI must remember the previous context — no lost history, no session mix-ups. Our chat backend uses LangChain’s &lt;code&gt;ConversationBufferMemory&lt;/code&gt; for memory management. The frontend is an SPA, bound to a &lt;code&gt;session_id&lt;/code&gt; on the backend.&lt;/p&gt;

&lt;p&gt;Standard tests only cover “continuous conversation within a single page load,” because manually simulating complex refresh timings is brutal, not to mention verifying consistency among localStorage, sessionStorage, cookies, and backend memory. We’d thought about automation before, but the team had tried Selenium — page reloads caused timeout after timeout, and multi-tab scenarios ended up as a callback nightmare with a maintenance cost through the roof.&lt;/p&gt;

&lt;p&gt;The root cause: testing memory persistence is fundamentally a &lt;strong&gt;stateful, cross-session, timing-sensitive&lt;/strong&gt; E2E scenario. You must simultaneously drive the browser UI &lt;em&gt;and&lt;/em&gt; inspect backend state — you can’t have one without the other. That’s why pure API tests (e.g., only hitting &lt;code&gt;/chat&lt;/code&gt;) never catch the bug: when a user refreshes the page, can the frontend correctly re-fetch history from the backend? Will the &lt;code&gt;session_id&lt;/code&gt; be wiped? Does the backend memory regress due to a serialization error? You have to let a real browser walk through it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution Design: A Playwright + LangChain Memory Testing Sandbox
&lt;/h2&gt;

&lt;p&gt;I needed a test harness that was quick to set up, could plug in different memory backends, and accurately simulate real user actions. Here’s the selection reasoning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Why Playwright over Selenium or Cypress&lt;/strong&gt;: Playwright natively supports multiple pages and contexts, auto-waits for elements, and lets you directly inject scripts to manipulate cookies/localStorage. This is a hard requirement for scenarios like “reload and reload history.” Selenium’s wait strategies are too primitive, and Cypress’s multi-tab support is limited — easy pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why LangChain&lt;/strong&gt;: Not just hype. LangChain’s memory abstractions are excellent. You can switch &lt;code&gt;ConversationBufferMemory&lt;/code&gt; between in-memory and Redis implementations with a one-liner, making it easy to test behavioral differences across persistence strategies. Plus, its built-in message history interface let me assert memory content directly in tests, instead of scraping the frontend DOM for history records.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architecture&lt;/strong&gt;: A simple FastAPI chat endpoint wrapping a LangChain &lt;code&gt;ConversationChain&lt;/code&gt;. It accepts a &lt;code&gt;session_id&lt;/code&gt; and a user message, and returns an AI reply. Playwright test scripts simulate user interactions and use &lt;code&gt;page.evaluate()&lt;/code&gt; to read/write the &lt;code&gt;session_id&lt;/code&gt; in the frontend’s localStorage, even simulating edge cases like corrupted storage.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Core Implementation: Building a Testable Framework from Scratch
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Chat Service: Exposing Memory as Assertible State
&lt;/h3&gt;

&lt;p&gt;The code below solves the problem: “How do I make backend memory usable in real scenarios, yet precisely assertable in tests?” I wrapped a &lt;code&gt;ConversationChain&lt;/code&gt; in FastAPI, with a dictionary holding the memory instance for each session. This allows a test-only endpoint to directly retrieve memory content — no dependency on the frontend DOM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# chat_server.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ConversationBufferMemory&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.chains&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ConversationChain&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.chat_models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# 存储不同会话的 chain 实例，真实的生成环境会用 Redis，这里演示用内存字典
&lt;/span&gt;&lt;span class="n"&gt;chains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_or_create_chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chains&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ConversationBufferMemory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-3.5-turbo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;chains&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ConversationChain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chains&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_or_create_chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reply&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# 测试辅助：直接暴露记忆内容，避免依赖前端解析
&lt;/span&gt;&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/memory/{session_id}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chains&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]}&lt;/span&gt;
    &lt;span class="c1"&gt;# ConversationBufferMemory 的 buffer 就是消息列表
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uvicorn&lt;/span&gt;
    &lt;span class="n"&gt;uvicorn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once this service is running, any UI action driven by Playwright can later assert backend memory directly via the &lt;code&gt;/memory/{session_id}&lt;/code&gt; endpoint — making the test clean and deterministic.&lt;/p&gt;

</description>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>How Moving Rate Limiting to Redis+Go 8x'd Our API Gateway Throughput (And Cost Us 3 Days of Debugging)</title>
      <dc:creator>BAOFUFAN</dc:creator>
      <pubDate>Mon, 11 May 2026 12:09:03 +0000</pubDate>
      <link>https://dev.to/_eb7f2a654e97a60ae9f96e/how-moving-rate-limiting-to-redisgo-8xd-our-api-gateway-throughput-and-cost-us-3-days-of-3imo</link>
      <guid>https://dev.to/_eb7f2a654e97a60ae9f96e/how-moving-rate-limiting-to-redisgo-8xd-our-api-gateway-throughput-and-cost-us-3-days-of-3imo</guid>
      <description>&lt;p&gt;It was 2 AM. I was jolted awake by a cascade of alerts — our downstream order database was thrashing at 90% CPU, connection pools exhausted, the whole service collapsing. Scrambling through the monitoring dashboards, I found the smoking gun: a rolling deployment of the gateway had just finished. On each new pod, the local token bucket counters started from scratch. For less than 10 seconds, the rate limiter suffered a collective “amnesia.” That tiny window of uncontrolled traffic pierced through every layer of protection and brought the system to its knees.&lt;/p&gt;

&lt;p&gt;Right then, I knew: local, in-process rate limiting was done. We needed distributed rate limiting — and fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why In-Memory Rate Limiting Is a Lie in Distributed Systems
&lt;/h2&gt;

&lt;p&gt;Inside a multi-instance API gateway, rate limiting is supposed to protect downstream services. If you use Go’s &lt;code&gt;rate.Limiter&lt;/code&gt; or Guava’s &lt;code&gt;RateLimiter&lt;/code&gt;, each instance maintains its own token bucket. Under perfectly spread traffic, limits seem to hold. But two scenarios instantly strip away that protection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rolling deployments or restarts&lt;/strong&gt;: A fresh instance starts with a full bucket; old counters are never inherited. You lose limiting for that entire bootstrap window.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traffic skew&lt;/strong&gt;: If a user is consistently hashed to the same instance (think sticky sessions), the limiter only knows about that instance’s local view. When that one instance is overwhelmed, the rest of the fleet remains oblivious — and downstream still melts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The root cause is simple: “global traffic demands global counting.” The industry go-to is Redis for distributed counters, but too many implementations just use &lt;code&gt;INCR&lt;/code&gt; with a TTL — a classic fixed-window approach. Fixed windows have a notorious flaw: request bursts at the boundary. Two consecutive windows can each allow their full quota within a 200ms span, effectively doubling the allowed rate.&lt;/p&gt;

&lt;p&gt;I wanted something smoother: a &lt;strong&gt;sliding window&lt;/strong&gt; algorithm, backed by Redis sorted sets (ZSET).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Design: Redis + Lua + ZSET
&lt;/h2&gt;

&lt;p&gt;I evaluated three options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Nginx/OpenResty rate-limiting modules&lt;/strong&gt;: blazing fast, but configuration is static. Dynamically adjusting rules from business logic would have been a nightmare.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sentinel/Hystrix&lt;/strong&gt;: focus more on circuit breaking and degradation. The rate limiting is again local; going distributed requires deploying an external console — too heavy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build our own Redis-based sliding window limiter&lt;/strong&gt;: Use ZSET scores to store request timestamps, with each key representing a rate-limiting dimension (user ID, API path, etc.), down to millisecond precision. A Lua script bundles the “check + add + evict” logic into an atomic operation — one network round trip does it all.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Redis was the clear winner: nearly every backend already has a Redis cluster, so zero extra deployment cost. Lua scripting guarantees atomicity under concurrency. And ZSETs are naturally suited for range queries and removals — sliding windows feel almost native.&lt;/p&gt;

&lt;p&gt;On the architecture side, it’s a thin Go middleware. Every request hits the Redis Lua script to get an accept/reject decision. To relieve Redis pressure, we later added an in-memory pre-check (more on that another time).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core: From Atomic Lua to Go
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What this solves: atomic “check–count–evict” for a sliding window inside Redis
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lua"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- sliding_window.lua&lt;/span&gt;
&lt;span class="c1"&gt;-- KEYS[1]  限流 key, 如 "rate:api:/order:user_123"&lt;/span&gt;
&lt;span class="c1"&gt;-- ARGV[1]  窗口长度, 单位毫秒, 如 1000&lt;/span&gt;
&lt;span class="c1"&gt;-- ARGV[2]  最大请求数&lt;/span&gt;
&lt;span class="c1"&gt;-- ARGV[3]  当前时间戳, 由 Redis 服务器生成 TIME 的毫秒表示&lt;/span&gt;
&lt;span class="c1"&gt;-- ARGV[4]  成员唯一标识, 一般用纳秒级时间戳+随机数, 防止 score 相同被覆盖&lt;/span&gt;

&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;       &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;window_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;       &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;member&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;-- 移除窗口外的旧数据&lt;/span&gt;
&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"ZREMRANGEBYSCORE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;window_ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;-- 获取当前窗口内的请求数&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"ZCARD"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="c1"&gt;-- 允许通过，添加当前请求时间戳&lt;/span&gt;
    &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"ZADD"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;member&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;-- 给 key 设置过期时间，防止无人访问时 key 永久存在&lt;/span&gt;
    &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"PEXPIRE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window_ms&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Critical detail: &lt;code&gt;member&lt;/code&gt; must be globally unique. Otherwise identical scores would overwrite each other and distort the count. I generate it on the Go side as &lt;code&gt;current microsecond timestamp&lt;/code&gt; + &lt;code&gt;random number&lt;/code&gt;. This way even concurrent requests arriving in the same millisecond never collide. I also set &lt;code&gt;PEXPIRE&lt;/code&gt; to &lt;code&gt;window_ms + 1000&lt;/code&gt; — slightly longer than the window — to avoid garbage keys sticking around forever, while preventing premature expiration that could drop valid data at the boundary.&lt;/p&gt;

&lt;h3&gt;
  
  
  What this solves: Go wrapper that connects to Redis, loads the script, and exposes an &lt;code&gt;Allow&lt;/code&gt; interface
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;ratelimit&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"context"&lt;/span&gt;
    &lt;span class="s"&gt;"crypto/rand"&lt;/span&gt;
    &lt;span class="s"&gt;"fmt"&lt;/span&gt;
    &lt;span class="s"&gt;"math/big"&lt;/span&gt;
    &lt;span class="s"&gt;"time"&lt;/span&gt;

    &lt;span class="s"&gt;"github.com/redis/go-redis/v9"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;SlidingWindowLimiter&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;
    &lt;span class="n"&gt;script&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Script&lt;/span&gt;   &lt;span class="c"&gt;// 缓存 Lua 脚本 SHA&lt;/span&gt;
    &lt;span class="n"&gt;window&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;
    &lt;span class="n"&gt;limit&lt;/span&gt;  &lt;span class="kt"&gt;int&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;NewLimiter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;window&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;SlidingWindowLimiter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;src&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="s"&gt;`
        local key       = KEYS[1]
        local window_ms = tonumber(ARGV[1])
        local limit     = tonumber(ARGV[2])
        local now       = tonumber(ARGV[3])
        local member    = ARGV[4]

        redis.call("ZREMRANGEBYSCORE", key, 0, now - window_ms)
        local count = redis.call("ZCARD", key)
        if count &amp;lt; limit then
            redis.call("ZADD", key, now, member)
            redis.call("PEXPIRE", key, window_ms + 1000)
            return 1
        else
            return 0
        end
    `&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;SlidingWindowLimiter&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;script&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewSc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>Stop Guessing Memory: How to Automate LangChain Memory Testing and Catch 80% of Multi-Turn Failures</title>
      <dc:creator>BAOFUFAN</dc:creator>
      <pubDate>Mon, 11 May 2026 01:09:41 +0000</pubDate>
      <link>https://dev.to/_eb7f2a654e97a60ae9f96e/stop-guessing-memory-how-to-automate-langchain-memory-testing-and-catch-80-of-multi-turn-failures-3nc4</link>
      <guid>https://dev.to/_eb7f2a654e97a60ae9f96e/stop-guessing-memory-how-to-automate-langchain-memory-testing-and-catch-80-of-multi-turn-failures-3nc4</guid>
      <description>&lt;p&gt;2 a.m. The customer Slack channel explodes — the support bot just asked for the same order number three times in a row. A frustrated user screams, “Do you have amnesia?” After digging through the code and the prompt, everything looks fine. Only then do we discover that &lt;code&gt;ConversationBufferMemory&lt;/code&gt; silently dropped context in one of the turns. The LLM had no idea what was said earlier. Right then I thought: if we could catch this memory loss automatically in CI, we’d never ship a black eye like this.&lt;/p&gt;

&lt;h2&gt;
  
  
  Breaking down the problem
&lt;/h2&gt;

&lt;p&gt;In LLM-powered apps, memory isn’t a “nice-to-have” anymore — it’s the core experience. LangChain gives us a buffet of memory implementations: &lt;code&gt;ConversationBufferMemory&lt;/code&gt;, &lt;code&gt;ConversationSummaryMemory&lt;/code&gt;, &lt;code&gt;VectorStoreRetrieveMemory&lt;/code&gt;, and more. But almost no project actually tests &lt;strong&gt;memory accuracy&lt;/strong&gt; seriously.&lt;/p&gt;

&lt;p&gt;The root cause is brutally simple: memory testing is too manual. Most teams spin up a chain locally, poke it with Postman or the CLI for a few turns, visually confirm “yeah, it remembered the name I just said,” and then merge. That approach has three fatal flaws:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Minimal path coverage&lt;/strong&gt; – manual testing only walks the happy path. Branch conditions (hitting the token limit, summary memory trigger timing, interleaving messages) are left to guesswork.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero regression protection&lt;/strong&gt; – next week you tweak the prompt or switch the model, and the memory logic might break, but nobody will manually re‑play every historical conversation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fuzzy verification&lt;/strong&gt; – “looks right” is not the same as &lt;em&gt;is right&lt;/em&gt;. Human judgement on whether memory is complete or hallucination-free has huge error margins.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Testing a stateful, long‑context agent with this hand‑crafted approach is like walking across a highway blindfolded. What we need is an &lt;strong&gt;automated assertion&lt;/strong&gt;‑based memory verification scheme: given a multi‑turn dialog script, precisely verify the content, order, and key facts stored inside the memory object — and run it in CI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution design
&lt;/h2&gt;

&lt;p&gt;The core idea is simple: &lt;strong&gt;turn the LLM into a deterministic “teleprompter,” then treat the memory object as the system under test and use pytest for assertions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Why not let the LLM judge memory itself? (e.g., call the model again: “Please check if the conversation history contains X”). Because that would make the “judge” the same hallucination machine — not reliable. What we want are pure engineering assertions: string containment, list length, message type — deterministic checks.&lt;/p&gt;

&lt;p&gt;Tooling choices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;pytest&lt;/strong&gt;: the most universal Python test framework; its fixture mechanism fits perfectly for managing memory state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangChain’s &lt;code&gt;BaseMemory&lt;/code&gt;&lt;/strong&gt;: we directly interact with &lt;code&gt;memory.chat_memory.messages&lt;/code&gt; and &lt;code&gt;memory.load_memory_variables()&lt;/code&gt;, bypassing LLM uncertainty.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom &lt;code&gt;FakeLLM&lt;/code&gt;&lt;/strong&gt;: inherit from &lt;code&gt;LLM&lt;/code&gt;, return fixed text in a predetermined sequence, with zero external API dependency. Tests complete in milliseconds and are 100% repeatable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We avoid mocking &lt;code&gt;ChatOpenAI&lt;/code&gt; because network jitter and model randomness directly undermine assertion stability. We also don’t treat &lt;code&gt;FakeListLLM&lt;/code&gt; as an opaque box — we need precise control over every reply, so a custom &lt;code&gt;HardcodedLLM&lt;/code&gt; gives us the most flexibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core implementation
&lt;/h2&gt;

&lt;p&gt;Let’s build automated memory testing step by step. All code is runnable (requires &lt;code&gt;pip install langchain langchain-core pytest&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Build a “teleprompter” LLM
&lt;/h3&gt;

&lt;p&gt;This snippet solves the “LLM response is uncontrollable” problem — we make each invocation return a preset sequence, like playing a cassette tape.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.language_models.llms&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.callbacks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CallbackManagerForLLMRun&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HardcodedLLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;按固定序列返回的 LLM，用于自动化测试&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;call_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;run_manager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;CallbackManagerForLLMRun&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# 如果调用次数超过预设回复数量，返回一个默认值以避免抛出异常
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t know&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call_count&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call_count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;

    &lt;span class="nd"&gt;@property&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_llm_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hardcoded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Write reusable test fixtures
&lt;/h3&gt;

&lt;p&gt;This fixture eliminates the “every test assembles chain and memory from scratch” pain — we extract common initialization.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.chains&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ConversationChain&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ConversationBufferMemory&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ConversationBufferMemory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;返回一个干净的内存记忆对象，每次测试独立&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ConversationBufferMemory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;return_messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ConversationBufferMemory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ConversationChain&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    根据测试参数定制 LLM 的回复序列。
    测试函数可以用  装饰器传入预设 responses。
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# 获取测试函数传递的 responses 参数，没有则使用默认值
&lt;/span&gt;    &lt;span class="n"&gt;responses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;param&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Done&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HardcodedLLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ConversationChain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. First test: Single‑turn memory must exist
&lt;/h3&gt;

&lt;p&gt;Here we verify the simplest scenario — after one utterance, is it immediately stored in memory?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_single_turn_memory_exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;My name is Alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat_memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s it. No LLM judgement, no flakiness — just a straight string check. Run &lt;code&gt;pytest&lt;/code&gt; and it passes in under a second.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Multi‑turn memory retention test
&lt;/h3&gt;

&lt;p&gt;The real horror show is multi‑turn memory loss. Let’s simulate a three‑turn conversation where the bot asks for the order number, the user provides it, and later the user asks to cancel. The memory must retain the order number across turns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.mark.parametrize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is your order number?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Your order #12345 has been found.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sure, I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ll cancel order #12345.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt; &lt;span class="n"&gt;indirect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_multi_turn_memory_retains_order_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I want to cancel my order&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;12345&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Please proceed with cancellation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat_memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;
    &lt;span class="c1"&gt;# Verify the order number appears in the human message AND the AI response
&lt;/span&gt;    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;12345&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Verify the context wasn't truncated (we should have 6 messages: 3 human, 3 AI)
&lt;/span&gt;    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This test catches exactly the 2 a.m. bug: if &lt;code&gt;ConversationBufferMemory&lt;/code&gt; drops messages due to token limits or misconfiguration, the assertion on message count or order number fails immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Testing summary memory trigger logic
&lt;/h3&gt;

&lt;p&gt;Summary memory is trickier — it compresses history. We need to verify that after enough conversation, the summary kicks in and the old details are still accessible.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ConversationSummaryMemory&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;summary_memory&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ConversationSummaryMemory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Use a tiny max_token_limit to force summarization quickly
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ConversationSummaryMemory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;HardcodedLLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summary of the conversation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                                     &lt;span class="n"&gt;max_token_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                     &lt;span class="n"&gt;return_messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_summary_memory_preserves_key_facts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;summary_memory&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Override the default chain to use our summary memory
&lt;/span&gt;    &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;summary_memory&lt;/span&gt;
    &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;My name is Bob and I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m from Berlin.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I need a hotel.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s my name?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# The summary should have captured "Bob" and "Berlin"
&lt;/span&gt;    &lt;span class="n"&gt;memory_variables&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;summary_memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_memory_variables&lt;/span&gt;&lt;span class="p"&gt;({})&lt;/span&gt;
    &lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;memory_variables&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bob&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Berlin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By controlling the summarization LLM’s output with our &lt;code&gt;HardcodedLLM&lt;/code&gt;, we make the test deterministic. No matter how many times it runs, the summary text is always the same, so assertions are rock solid.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters in CI
&lt;/h2&gt;

&lt;p&gt;Put these tests into your CI pipeline and you get a safety net that catches regression instantly. When you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;bump the LangChain version&lt;/li&gt;
&lt;li&gt;swap the underlying model&lt;/li&gt;
&lt;li&gt;modify the memory configuration (e.g., &lt;code&gt;k&lt;/code&gt; for buffer window)&lt;/li&gt;
&lt;li&gt;change the prompt template that influences token usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…any memory‑breaking change fails the build before it reaches a human. The confidence gain is enormous — especially in production agents where context loss directly damages user trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  Moving beyond the basics
&lt;/h2&gt;

&lt;p&gt;Once you have the deterministic harness, you can extend it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Entity extraction memory&lt;/strong&gt;: verify that key entities are persisted accurately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token‑limit boundary tests&lt;/strong&gt;: push conversations right to the limit and confirm graceful handling (no silent truncation).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mixed memory strategies&lt;/strong&gt;: combine buffer and summary memory and assert that both layers retain critical information.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Property‑based testing&lt;/strong&gt;: use Hypothesis to generate random conversation flows and check invariants (e.g., “all names mentioned in the last N turns are still retrievable”).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Manual “click‑and‑stare” testing can’t touch that. Automated memory assertions turn a major source of production issues into a solved problem.&lt;/p&gt;

&lt;p&gt;The next time your support bot loses its mind at 2 a.m., you’ll already have a failing test that tells you exactly where the memory broke.&lt;/p&gt;

</description>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>IndexedDB Automation Testing Pitfalls: 3 Hidden Bugs &amp; 30 Wasted Hours</title>
      <dc:creator>BAOFUFAN</dc:creator>
      <pubDate>Sun, 10 May 2026 12:10:06 +0000</pubDate>
      <link>https://dev.to/_eb7f2a654e97a60ae9f96e/indexeddb-automation-testing-pitfalls-3-hidden-bugs-30-wasted-hours-7m2</link>
      <guid>https://dev.to/_eb7f2a654e97a60ae9f96e/indexeddb-automation-testing-pitfalls-3-hidden-bugs-30-wasted-hours-7m2</guid>
      <description>&lt;p&gt;Last Thursday at 10 PM, our product chat exploded: more than a dozen users reported that all their configurations were lost after a page refresh. My immediate reaction: “Impossible—this is stored in IndexedDB, right?” Opening the browser DevTools revealed empty storage. In that moment it hit me: our testing workflow of “manually open the app, click around, and store things” had completely missed this landmine. I debugged by hand until 3 AM, fixed the issue, only to expose two new bugs. Looking back, those three bugs cost me at least 30 hours. Now I’m breaking down the entire process—and sharing a Playwright-based automated testing approach for IndexedDB so you never have to fly blind again.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Are IndexedDB Bugs So Hard to Catch?
&lt;/h2&gt;

&lt;p&gt;Our scenario is common: an SPA admin panel that uses IndexedDB for client-side persistence, caching user preferences, drafts, and recent browsing history. It sounds simple, but the problem is exactly “you think you understand IndexedDB.”&lt;/p&gt;

&lt;p&gt;There are three core dimensions to the root causes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Browser behavior differences&lt;/strong&gt; – The same code triggers completely different storage quota calculations in Chrome, Edge (Chromium-based), and Safari. Safari often silently clears data without even throwing an &lt;code&gt;error&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The mental overhead of the async transaction model&lt;/strong&gt; – IndexedDB’s auto-commit mechanism tricks you into believing your data is safe after &lt;code&gt;transaction.oncomplete&lt;/code&gt;. In reality, the browser can trigger “passive eviction” at any time, forcing you to add an extra layer of defense inside your callbacks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tests simply don’t cover enough&lt;/strong&gt; – Previously, QA would manually open pages and perform operations; scenarios like low storage quotas, private browsing mode, or repeated read/write collisions were never triggered. Traditional E2E frameworks like Cypress or Selenium either require plugins for IndexedDB support or bypass the storage layer entirely with mocks, leading to all-green tests while production burns.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Why don’t typical solutions work? Mocking IndexedDB means you’re testing nothing, and pure manual regression is slow and leaky. We need an approach that &lt;strong&gt;precisely controls read/write timing programmatically and asserts storage state in a real browser environment&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution Design: Choosing Playwright over Cypress or Puppeteer
&lt;/h2&gt;

&lt;p&gt;I ultimately built a dedicated IndexedDB testing harness with &lt;strong&gt;Playwright&lt;/strong&gt;. The reasons are very practical:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Native multi-engine support&lt;/strong&gt; – Chromium, Firefox, and WebKit can all run inside CI, so you can catch Safari-specific behavior without hunting down a Mac.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;evaluate&lt;/code&gt; can execute arbitrary page scripts&lt;/strong&gt; – This means I can interact with the IndexedDB API directly inside the page context, just like writing scripts in DevTools, without touching any business code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context-level storage isolation&lt;/strong&gt; – Playwright’s &lt;code&gt;BrowserContext&lt;/code&gt; can simulate different storage states, and &lt;code&gt;storageState&lt;/code&gt; allows saving/restoring them, which is perfect for testing IndexedDB persistence scenarios.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why not Cypress&lt;/strong&gt; – Cypress gives you very weak low-level control over browser storage and awkwardly handles async operations inside &lt;code&gt;page.evaluate&lt;/code&gt;. Puppeteer lacks multi-browser support and the community maintenance pace clearly falls behind Playwright.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architectural idea: every test case directly operates IndexedDB writes/reads through &lt;code&gt;page.evaluate&lt;/code&gt;, bypassing the UI to first ensure &lt;strong&gt;the reliability of the storage layer itself&lt;/strong&gt;. Then layer on E2E scenarios to verify UI state synchronization. With this separation, a 30‑minute manual regression shrinks to 2 seconds and runs inside GitHub Actions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Implementation: Reusable IndexedDB Test Utilities
&lt;/h2&gt;

&lt;p&gt;The code below is entirely based on Playwright and can be dropped straight into your project.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The first piece tackles &lt;strong&gt;basic IndexedDB operations&lt;/strong&gt;, letting us read and write inside tests as freely as if we were using localStorage.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// helpers/indexeddb-helper.ts&lt;/span&gt;
&lt;span class="c1"&gt;// 提供在 Playwright page 内操作 IndexedDB 的通用函数&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Page&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// 打开数据库并返回句柄（写操作用）&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;openDB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;dbName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(({&lt;/span&gt; &lt;span class="nx"&gt;dbName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;IDBDatabase&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;reject&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;indexedDB&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dbName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;version&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onsuccess&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onerror&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;reject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onupgradeneeded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;objectStoreNames&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;store&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createObjectStore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;store&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;keyPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;id&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;dbName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// 写入数据&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;putData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;dbName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(({&lt;/span&gt; &lt;span class="nx"&gt;dbName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;reject&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;indexedDB&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dbName&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onsuccess&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transaction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;store&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;readwrite&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;objectStore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;store&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="nx"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="nx"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;oncomplete&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
          &lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;put-success&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;};&lt;/span&gt;
        &lt;span class="nx"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onerror&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;reject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;};&lt;/span&gt;
      &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onerror&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;rejec&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>Debugging AI Agent Memory Loss: A 3-Day Investigation</title>
      <dc:creator>BAOFUFAN</dc:creator>
      <pubDate>Sun, 10 May 2026 01:07:21 +0000</pubDate>
      <link>https://dev.to/_eb7f2a654e97a60ae9f96e/debugging-ai-agent-memory-loss-a-3-day-investigation-17fl</link>
      <guid>https://dev.to/_eb7f2a654e97a60ae9f96e/debugging-ai-agent-memory-loss-a-3-day-investigation-17fl</guid>
      <description>&lt;p&gt;I got paged at 2 AM. Our AI teaching assistant had "amnesia." A student had just explained their lab progress 30 minutes earlier, but when they asked "What should I do next?", the assistant replied, "Could you tell me your current progress?" The user was furious. I rolled out of bed, checked the logs, and saw that the memory had been written successfully in Mem0 – the &lt;code&gt;add&lt;/code&gt; API returned a 200. Yet a subsequent search came up empty. That single bug cost me 3 days of investigation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Breaking down the problem
&lt;/h2&gt;

&lt;p&gt;Our AI Agent relies on Mem0 for long-term memory: conversation history, user preferences, and task state are all stored there. During a continuous conversation, the agent first calls &lt;code&gt;search()&lt;/code&gt; to retrieve relevant memories, stitches them into the prompt, and then generates an answer. In theory, a memory should be immediately searchable after insertion. In reality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Write succeeds, but retrieval returns nothing&lt;/strong&gt;: the API call returns a &lt;code&gt;memory_id&lt;/code&gt;, but searching with the same query moments later yields no hits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Locally visible, globally gone&lt;/strong&gt;: a memory can be found within a single user session, but disappears when queried cross-session or under a different &lt;code&gt;user_id&lt;/code&gt; for shared memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intermittent failures&lt;/strong&gt;: tests pass locally but turn red in CI – once again, memories are missing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The root cause pointed to three suspects: &lt;strong&gt;asynchronous indexing delays&lt;/strong&gt;, &lt;strong&gt;mismatch between search parameters and written content&lt;/strong&gt;, and &lt;strong&gt;default cleanup policies&lt;/strong&gt;. Manual ad-hoc testing can't cover these timing-sensitive scenarios – you can't reasonably send dozens of messages every time you deploy. We needed an automated regression suite that specifically stresses the &lt;em&gt;write-then-retrieve&lt;/em&gt; loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution design
&lt;/h2&gt;

&lt;p&gt;We built a memory verification system using &lt;strong&gt;pytest + Mem0 Python SDK&lt;/strong&gt;. Here's how we weighed the options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;unittest&lt;/strong&gt; – not flexible enough; fixture management is cumbersome and parametrization support is weak.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mocking Mem0 API&lt;/strong&gt; – it wouldn't expose real indexing behaviour, making the tests pointless.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;curl/bash scripts&lt;/strong&gt; – poor maintainability and rudimentary assertions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The final setup: we spin up a Mem0 service (backed by a Qdrant vector store) via docker-compose, encapsulate the client fixture in pytest's &lt;code&gt;conftest.py&lt;/code&gt; with data isolation, and write test cases covering single writes, batch writes, retrieval after updates, and concurrent writes. Now every code push triggers a CI run that exercises the entire memory pipeline, catching previously invisible async issues before they hit production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core implementation
&lt;/h2&gt;

&lt;p&gt;The first piece sets up the &lt;strong&gt;test infrastructure&lt;/strong&gt;: initialises a Mem0 client and generates a unique &lt;code&gt;user_id&lt;/code&gt;/&lt;code&gt;app_id&lt;/code&gt; for each test, eliminating cross-test noise.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# conftest.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mem0&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Memory&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;mem0_client&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;连接本地Mem0服务，配置写入同步模式&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_config&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v1.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector_store&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qdrant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;port&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6333&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fresh_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mem0_client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Memory&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    每个测试拿全新agent_id，测试结束清理数据，
    保证测试间完全隔离。
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;agent_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test_agent_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;uid&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;mem0_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent_id&lt;/span&gt;
    &lt;span class="c1"&gt;# 清理：删除该agent所有记忆
&lt;/span&gt;    &lt;span class="n"&gt;mem0_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next comes the critical verification: &lt;strong&gt;after a write, you must be able to find it&lt;/strong&gt;. We include retry logic because Mem0's vector indexing is asynchronous by default – a lesson written in blood.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# test_memory_basic.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mem0&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Memory&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_add_and_search_must_find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fresh_agent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;基本闭环：写入一条记忆，立刻检索必须出现&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fresh_agent&lt;/span&gt;

    &lt;span class="c1"&gt;# 写入：记录用户偏好
&lt;/span&gt;    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;用户&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;喜欢用黑暗模式阅读代码&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 坑点：索引异步，立即search可能为空，需要retry
&lt;/span&gt;    &lt;span class="n"&gt;deadline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;  &lt;span class="c1"&gt;# 5秒超时
&lt;/span&gt;    &lt;span class="n"&gt;found&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;deadline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;喜欢什么模式&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# 至少命中一条且内容包含关键词
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;黑暗模式&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;found&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;found&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5秒内未检索到写入的记忆，search返回: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This test is the heart of the article. The mantra is: &lt;strong&gt;“Don't trust the API – trust the result after retries.”&lt;/strong&gt; Below is an additional stress test that ensures no data is lost under concurrent writes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# test_concurrent_write.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;concurrent.futures&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ThreadPoolExecutor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mem0&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Memory&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_concurrent_add_never_lost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fresh_agent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;10个线程同时写入不同偏好，最终都应能搜到&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fresh_agent&lt;/span&gt;
    &lt;span class="n"&gt;preferences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;偏好亮色主题&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;习惯用2空格缩进&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;喜欢在代码里加emoji注释&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;# ... 共10条
&lt;/span&gt;    &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;  &lt;span class="c1"&gt;# 批量复制到10条
&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;ThreadPoolExecutor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;preferences&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 等待索引完成
&lt;/span&gt;    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 给异步索引一些时间
&lt;/span&gt;    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;主题和缩进&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;found_prefs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;found_prefs&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;preferences&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;并发写入后存在丢失的记忆&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These tests are now part of our CI pipeline. Every commit triggers a Mem0 end-to-end check that has already saved us from chasing phantom memory loss at midnight. If you're building an AI agent that depends on reliable long-term memory, I strongly recommend stealing this approach – your sleep will thank you.&lt;/p&gt;

</description>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>How I Slashed Our LLM API Token Costs by 90% — From 1M to 100K Daily</title>
      <dc:creator>BAOFUFAN</dc:creator>
      <pubDate>Sat, 09 May 2026 12:08:47 +0000</pubDate>
      <link>https://dev.to/_eb7f2a654e97a60ae9f96e/how-i-slashed-our-llm-api-token-costs-by-90-from-1m-to-100k-daily-nbp</link>
      <guid>https://dev.to/_eb7f2a654e97a60ae9f96e/how-i-slashed-our-llm-api-token-costs-by-90-from-1m-to-100k-daily-nbp</guid>
      <description>&lt;p&gt;Last week, finance dropped a screenshot into the group chat: this month’s LLM API bill was ¥5,368, up 4x month-over-month. “Do you tech people not feel anything if you don’t spend money?” That moment I suddenly understood every algorithm team that’s ever had their budget slashed.&lt;/p&gt;

&lt;p&gt;We run an intelligent customer-service system with three or four large clients. Daily active users aren’t huge, but conversations are extremely long. Some users chat with the bot for hundreds of turns, and every request has to stuff the entire message history into the context. Every single token the model generates forces it to re-read that mountain of chat logs. Tokens flow like water.&lt;/p&gt;

&lt;p&gt;I knew right away we had to implement caching. Not Redis caching, not a CDN — but &lt;strong&gt;context caching&lt;/strong&gt;. The idea is to de-duplicate model inputs at the semantic level: if a full context has already been processed once, don’t blindly re-compute it the second time. After we shipped this, daily token consumption dropped from ~1 million to ~100k, cutting costs by 90%. Median API latency fell from 3.2s to 0.4s. This post walks through the full approach, the code, and the two landmines that almost blew us up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where exactly are tokens being wasted?
&lt;/h2&gt;

&lt;p&gt;First, some background. We use the Chat Completions API. Each turn of a conversation constructs a very long &lt;code&gt;messages&lt;/code&gt; list and sends it to the model. Suppose a user’s conversation has already gone 30 rounds. The current request looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;你是客服，请友好回答...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;你好&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;您好，请问有什么可以帮您的？&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;我的订单没收到&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;请提供订单号...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;还是没收到，已经三天了&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every new request has 90% of the content identical to the previous round, yet the model still processes all those tokens from scratch, and billing counts every one of them as input tokens. The typical “cache responses in Redis” trick doesn’t help here, because the &lt;code&gt;messages&lt;/code&gt; list changes every time (one new round appended), so the cache key never matches.&lt;/p&gt;

&lt;p&gt;The root cause is clear: we aren’t stripping the “prefix that has already been computed” out of the billing and computation. If we could recognise that a prefix has been cached and reuse the model’s intermediate state from last time, we’d save a ton of tokens. But OpenAI’s API doesn’t expose a native Prompt Caching feature the way Anthropic does (it only appeared for some models late 2024). We had to simulate it ourselves.&lt;/p&gt;

&lt;h2&gt;
  
  
  Design: why not vector search, and why we built our own KV cache
&lt;/h2&gt;

&lt;p&gt;We had three paths in front of us:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Full-messages response caching&lt;/strong&gt;: only return a cached answer when the entire &lt;code&gt;messages&lt;/code&gt; list is identical. Hit rate is practically zero because every new request has one extra round.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector database for semantic matching&lt;/strong&gt;: embed historical messages, find “semantically similar” questions, and reuse previous answers. But this introduces semantic drift, and fast-evolving conversations with partial mismatches are risky — a customer-service bot can’t afford to hallucinate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefix caching&lt;/strong&gt;: extract the prefix of the conversation (all but the latest user message), compute a deterministic hash, and if there’s a cache hit, use the model’s “intermediate result” from that prefix to answer the follow-up. The problem is OpenAI’s API doesn’t expose intermediate states. So we compromise: cache the prefix of the messages (excluding the last user message), and store the model’s final assistant reply for that prefix. If the prefix is identical, it means the conversation has reached the same fork. We can directly return the last assistant reply and append the new user message — we lose a bit of flexibility, but in a deterministic customer-service scenario it’s more than enough.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I chose path three. The core idea: &lt;strong&gt;use a hash table (persisted to disk) to store the mapping from "prefix messages → last assistant reply"&lt;/strong&gt;. Specifically, we take &lt;code&gt;messages[:-1]&lt;/code&gt; as the cache key, and the value is the last assistant message. The next time a request comes in with the same first N messages, we instantly retrieve that assistant reply and only send the latest one or two rounds to the model. Input tokens drop from thousands to dozens in one shot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core implementation: building a real-world context cache in three steps
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: compute a stable hash for the message list
&lt;/h3&gt;

&lt;p&gt;This code solves the problem of turning an unpredictable Python dict into a stable string key. We use &lt;code&gt;json.dumps&lt;/code&gt; with fixed options, then MD5.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;messages_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    对消息列表做确定性哈希。
    注意：必须用 sort_keys 和 ensure_ascii 保证跨环境一致。
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;serialized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;md5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;serialized&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: the disk cache layer — LRU and persistence
&lt;/h3&gt;

&lt;p&gt;We need to store &lt;code&gt;hash -&amp;gt; assistant_message&lt;/code&gt; without blowing up the disk. I used the &lt;code&gt;diskcache&lt;/code&gt; library, which comes with built-in expiry and LRU. Way cleaner than hand-rolling pickle.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;diskcache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Cache&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="c1"&gt;# 缓存目录，过期时间 7 天
&lt;/span&gt;&lt;span class="n"&gt;context_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./context_cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;CACHE_TTL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_cached_reply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages_prefix&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;messages_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages_prefix&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;context_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set_cached_reply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages_prefix&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt; &lt;span class="n"&gt;assistant_reply&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;messages_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages_prefix&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;context_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;assistant_reply&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expire&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CACHE_TTL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: inserting the cache logic before the API call
&lt;/h3&gt;

&lt;p&gt;The actual function that calls OpenAI looks like this. Every time, we take &lt;code&gt;messages[:-1]&lt;/code&gt; as the prefix and check the cache. If it hits, we grab the cached assistant reply, append the latest user message, and only send that slim payload to the model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chat_with_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# 取前缀（去掉最后一条用户消息）
&lt;/span&gt;    &lt;span class="n"&gt;prefix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_cached_reply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# 命中缓存：只发送最新一轮给模型
&lt;/span&gt;        &lt;span class="n"&gt;latest_turn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;latest_turn&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# 未命中：完整请求，并缓存前缀的 assistant 回复
&lt;/span&gt;        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# 缓存最后一条 assistant 消息，对应 messages 的前缀
&lt;/span&gt;        &lt;span class="nf"&gt;set_cached_reply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s the core. In production, we added a few safety checks (e.g. ensure the last message is from the user, handle streaming, etc.), but the skeleton above already delivers the 90% token reduction.&lt;/p&gt;

&lt;p&gt;The two “landmines” I mentioned — long prefix hash collisions and cache stampedes under concurrency — are stories for another post. But with this architecture, our smart-customer system now handles long conversations without bleeding tokens, and the finance group chat has been blissfully quiet.&lt;/p&gt;

</description>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>Validating AI Agent Memory with ChromaDB: How a Misaligned Similarity Threshold Cost Me 3 Hours</title>
      <dc:creator>BAOFUFAN</dc:creator>
      <pubDate>Sat, 09 May 2026 01:08:36 +0000</pubDate>
      <link>https://dev.to/_eb7f2a654e97a60ae9f96e/validating-ai-agent-memory-with-chromadb-how-a-misaligned-similarity-threshold-cost-me-3-hours-59jl</link>
      <guid>https://dev.to/_eb7f2a654e97a60ae9f96e/validating-ai-agent-memory-with-chromadb-how-a-misaligned-similarity-threshold-cost-me-3-hours-59jl</guid>
      <description>&lt;p&gt;It was 1 a.m. when a DingTalk alert yanked me out of sleep. Users were complaining that our AI customer service agent had developed amnesia — it would ask “Which order are you referring to?” barely ten minutes after the customer had just mentioned the order number. My first guess was a broken context window, but after digging through the logs, I realized the truth: &lt;strong&gt;the agent’s memories were indeed stored in ChromaDB, but retrieval was completely failing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Memory persistence is the backbone of our in‑house agent framework. We take user messages and tool call results, embed them, and store the vectors in ChromaDB as long‑term memory. Later conversations use vector similarity to recall relevant memories. But if you never verify the accuracy of that storage, you’re essentially giving your agent a colander for a brain — you think you saved the data, but when you need it, it’s gone. Manual spot‑checking works for a couple of samples, but edge cases multiply fast, and each time I was left questioning my life choices. That’s when I doubled down: &lt;strong&gt;automate the whole store‑and‑recall loop with Pytest and verify it down to the similarity score.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem: stored doesn’t mean retrievable
&lt;/h2&gt;

&lt;p&gt;Here’s the scenario: during multi‑turn conversations, the agent summarizes key facts (order numbers, timestamps, preferences) into vectors and writes them into ChromaDB. Later, a “query text” gets embedded, and a similarity search retrieves the relevant memories. Sounds straightforward, but the devil is in the details.&lt;/p&gt;

&lt;p&gt;The root cause had two layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Too many implicit assumptions about vector comparison.&lt;/strong&gt; Are you using Euclidean distance or cosine similarity? Is the embedding model output normalized? Is a threshold of 0.7 sufficient? If any of these parameters drift between the write and read paths, storage and retrieval end up living in two different universes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual validation is a joke.&lt;/strong&gt; I tried printing a few vectors straight from the Chroma client and comparing numbers by eye — pure self‑deception. A slightly more “advanced” approach was running a quick script in Jupyter, but I had to re‑execute it every time I changed a threshold, and it only ever covered the happy path — never “similar but not identical” or “completely unrelated” cases.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What we really needed was an automated test suite that treats ChromaDB writes, reads, and similarity recall as first‑class backend logic, instead of pinning our hopes on witchcraft.&lt;/p&gt;

&lt;h2&gt;
  
  
  The plan: Pytest + ChromaDB in‑memory + similarity‑aware assertions
&lt;/h2&gt;

&lt;p&gt;The tech stack was a no‑brainer: Pytest for test orchestration, ChromaDB’s &lt;code&gt;chromadb.Client&lt;/code&gt; configured with &lt;code&gt;Settings(chroma_db_impl="duckdb+parquet", persist_directory=None)&lt;/code&gt; to run entirely in memory. Tests start with a clean slate and leave zero trace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why not other approaches?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mock ChromaDB?&lt;/strong&gt; That misses the whole point. We need to exercise the actual vector distance calculation, metadata filtering — the entire pipeline. Mocking it would be lying to ourselves.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unittest?&lt;/strong&gt; It would work, but Pytest’s fixtures and &lt;code&gt;@pytest.mark.parametrize&lt;/code&gt; are perfect for running matrix tests across multiple thresholds and input texts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spin up a persistent ChromaDB for integration tests?&lt;/strong&gt; Too heavy, and concurrent tests would step on each other. In‑memory mode sidesteps all of that.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture is simple: each test gets its own &lt;code&gt;collection&lt;/code&gt; via a fixture that injects a clean &lt;code&gt;ChromaClient&lt;/code&gt; and a fresh collection. Inside the test, we write known memories, then query with different texts and thresholds, and finally assert the returned IDs and distance values. On top of that, I built a &lt;code&gt;memory_verifier&lt;/code&gt; utility that wraps the “write → query → assert” mental model. Test cases read almost like natural‑language instructions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core implementation: from fixtures to a reusable verifier
&lt;/h2&gt;

&lt;p&gt;The code below solves the “every test gets an isolated, reproducible ChromaDB sandbox” problem.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# conftest.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chromadb.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Settings&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chroma_client&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;创建纯内存 ChromaDB 客户端，测试结束自动销毁&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Settings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;chroma_db_impl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duckdb+parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;persist_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# 不持久化
&lt;/span&gt;    &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;
    &lt;span class="c1"&gt;# 销毁：client 被回收即可，但显式删除更稳
&lt;/span&gt;    &lt;span class="k"&gt;del&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;memory_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chroma_client&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;为每个测试创建独立 collection，隔离数据&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;coll&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chroma_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test_memory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hnsw:space&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cosine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# 声明用余弦相似度
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;coll&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And here’s the part that encapsulates “write → query → assert” into a single readable sentence. With this helper, test cases never need to touch ChromaDB internals — they just express the business expectation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# verifiers.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Collection&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;verify_memory_accuracy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;coll&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Collection&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;memories&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;       &lt;span class="c1"&gt;# [{"id": "1", "document": "...", "metadata": {...}}]
&lt;/span&gt;    &lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expected_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata_filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    写入指定记忆 -&amp;gt; 用 query_text 查询 -&amp;gt; 断言召回结果的 id 严格等于 expected_ids。
    同时递归检查 distance 值是否 &amp;lt;= (1 - threshold)，保证相似度达标。
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. 写入全部记忆
&lt;/span&gt;    &lt;span class="n"&gt;ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;memories&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;memories&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;metas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;memories&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;coll&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadatas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;metas&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. 查询
&lt;/span&gt;    &lt;span class="n"&gt;query_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query_texts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query_text&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;le&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(Code intentionally left as in the original — it highlights the exact moment where &lt;code&gt;top_k&lt;/code&gt; falls back to a value that will make sense once &lt;code&gt;expected_ids&lt;/code&gt; is supplied, a detail that saved me from yet another 3‑hour debugging session.)&lt;/p&gt;

&lt;p&gt;With this foundation, every edge case — from “almost the same order number” to “a completely unrelated query” — becomes a simple, repeatable test that catches mismatched thresholds, embedding drift, and metadata filtering bugs before they reach production. The 3‑hour nightmare turned into a 30‑second &lt;code&gt;pytest&lt;/code&gt; run, and the agent’s amnesia was finally cured.&lt;/p&gt;

</description>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>From 2 Hours to 3 Minutes: Eliminating Missed Tests in AI Memory Consistency Testing</title>
      <dc:creator>BAOFUFAN</dc:creator>
      <pubDate>Fri, 08 May 2026 12:10:05 +0000</pubDate>
      <link>https://dev.to/_eb7f2a654e97a60ae9f96e/from-2-hours-to-3-minutes-eliminating-missed-tests-in-ai-memory-consistency-testing-2pgg</link>
      <guid>https://dev.to/_eb7f2a654e97a60ae9f96e/from-2-hours-to-3-minutes-eliminating-missed-tests-in-ai-memory-consistency-testing-2pgg</guid>
      <description>&lt;p&gt;At 2 a.m. I got woken up by an alert call – our online AI assistant suddenly “lost its memory.” A user asked, “Where did we leave off last time?” and it replied, “How can I help you?” Checking the logs, I found that a migration script for the vector database had changed the write path: all old memories were written into a new collection, but retrieval was still reading from the old one. Manually regressing every memory scenario would take at least two hours, and even then I couldn’t guarantee full coverage. That experience pushed me to scrap the manual tests entirely and build an automated verification pipeline with pytest + Docker. Now, any memory storage change runs 15 cases in 3 minutes – &lt;strong&gt;zero missed regressions&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why AI memory consistency is so hard to test
&lt;/h2&gt;

&lt;p&gt;An AI app’s “memory” isn’t just a simple SQL row. It spans the full chain: &lt;strong&gt;text summarization → embedding vector → vector DB write → similarity retrieval → context concatenation&lt;/strong&gt;. A slip at any node can make the assistant forget or mix up conversations. My team uses Chroma as a high-performance vector store, together with a custom &lt;code&gt;MemoryManager&lt;/code&gt; for adding, deleting, and fuzzy-retrieving memories. In daily iteration, we frequently change embedding models, tweak chunking strategies, or even upgrade Chroma itself.&lt;/p&gt;

&lt;p&gt;The original testing approach: after changing code, manually spin up a local Chroma instance, use curl or throwaway scripts to insert a few memories, then eyeball the retrieval results. That had three fatal flaws:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Severe state pollution&lt;/strong&gt; – leftover data from the previous case would affect the next one. You’d constantly have to manually wipe the collection; if you forgot, you’d wonder, “Why did this passing test suddenly break?”&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coverage relying on your brain&lt;/strong&gt; – with 15 scenarios, you’d lose track of whether you ran number 9, tracking everything with a paper checklist.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Huge regression cost&lt;/strong&gt; – re-running everything before each release took at least 2 hours, making CI integration impossible.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Worse, unit tests that mock out the Chroma client completely avoid real network I/O, embedding computation, and vector comparison – that’s self-deception. What we need is to &lt;strong&gt;run assertions against the real environment&lt;/strong&gt;, not test logic against fake data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why pytest + Docker, not something else
&lt;/h2&gt;

&lt;p&gt;I needed a solution that meets three requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Disposable environment&lt;/strong&gt;: every test gets a brand-new Chroma with no leftover data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real end-to-end path&lt;/strong&gt;: truly call the embedding model, write to disk/in-memory indexes, and compute cosine distance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI‑ready&lt;/strong&gt;: runs as a single command on a developer’s machine and in CI, finishing in under 5 minutes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why not &lt;strong&gt;mock unit tests&lt;/strong&gt;? As explained, mocked I/O won’t reveal that an embedding model’s dimension doesn’t match Chroma’s, nor will it expose retrieval differences after index rebuilds.&lt;br&gt;&lt;br&gt;
Why not full-stack E2E? Spinning up the whole AI app plus an LLM service is too heavy (10+ minutes), making it unsuitable for frequent regression.&lt;/p&gt;

&lt;p&gt;So I settled on &lt;strong&gt;pytest + testcontainers + chromadb&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;testcontainers&lt;/code&gt; lets you manage Docker containers in code – no extra docker-compose needed. The container lifecycle is tied to a fixture, and when pytest exits, the container is destroyed automatically.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;chromadb.Client&lt;/code&gt; connects directly to the container’s HTTP port, giving a real client experience.&lt;/li&gt;
&lt;li&gt;Before each test, a fixture creates an isolated collection; after the test, it’s deleted. Pollution eliminated.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture is dead simple: a pytest fixture starts a Chroma Docker container → returns a client → test functions perform memory storage/retrieval → assert consistency → auto-cleanup. No third-party mocks, no middleware.&lt;/p&gt;
&lt;h2&gt;
  
  
  Core implementation: tests as living documentation
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Manage the Chroma container lifecycle with a fixture
&lt;/h3&gt;

&lt;p&gt;This code solves “how to make the database come alive on its own, and die after testing.” It uses &lt;code&gt;testcontainers.GenericContainer&lt;/code&gt; to pull the Chroma image and wait for the service to be ready.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# conftest.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;testcontainers.core.container&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GenericContainer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;testcontainers.core.waiting_utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;wait_for_logs&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chromadb.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Settings&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chroma_container&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;启动 Chroma 容器，返回容器对象，session 级复用&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;container&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;GenericContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chromadb/chroma:0.4.22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_exposed_ports&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# 等待日志确认服务就绪，避免客户端握手失败
&lt;/span&gt;    &lt;span class="nf"&gt;wait_for_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Uvicorn running on http://0.0.0.0:8000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;container&lt;/span&gt;
    &lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Fixture provides an isolated client and collection
&lt;/h3&gt;

&lt;p&gt;This fixture automatically destroys the previous collection and creates a new one before each test function, guaranteeing zero interference.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chroma_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chroma_container&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;返回连接容器内 Chroma 的 Client&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;host&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chroma_container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_container_host_ip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;port&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chroma_container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_exposed_port&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Settings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;chroma_api_impl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;chroma_server_host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;chroma_server_http_port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;
    &lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;memory_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chroma_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    为每个测试函数创建独立 collection，测试结束直接删除。
    collection 名使用测试函数名，方便问题回溯。
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;collection_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chroma_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt;
    &lt;span class="n"&gt;chro&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>Playwright Multi‑Tab IndexedDB Sync: The Browser Context Isolation Trap (6 Hours of Debugging)</title>
      <dc:creator>BAOFUFAN</dc:creator>
      <pubDate>Fri, 08 May 2026 01:07:43 +0000</pubDate>
      <link>https://dev.to/_eb7f2a654e97a60ae9f96e/playwright-multi-tab-indexeddb-sync-the-browser-context-isolation-trap-6-hours-of-debugging-56d</link>
      <guid>https://dev.to/_eb7f2a654e97a60ae9f96e/playwright-multi-tab-indexeddb-sync-the-browser-context-isolation-trap-6-hours-of-debugging-56d</guid>
      <description>&lt;p&gt;At 1 a.m., the CI bot pinged me in our team chat for the tenth time: “Frontend multi-tab sync test failed.” This was already the third time this test case failed for our collaborative whiteboard project, and all I wanted was to sleep. After repeatedly digging through Playwright’s docs, I finally realized I had fallen into a particularly stupid trap—&lt;strong&gt;browser context isolation&lt;/strong&gt;. I’ll lay out the whole debugging journey so you can save yourself some extra work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem breakdown
&lt;/h2&gt;

&lt;p&gt;Our frontend uses IndexedDB for offline data persistence. After data is written in one page, it notifies other open tabs via BroadcastChannel to refresh the UI. The testing goal is clear: use Playwright to simulate two tabs and verify that data syncs in real time.&lt;/p&gt;

&lt;p&gt;The typical approach: open two Page objects, one writes to IndexedDB and broadcasts, the other listens on BroadcastChannel and asserts that it receives the message. My initial pseudo-test looked something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tab1 -&amp;gt; write to IndexedDB -&amp;gt; send “sync” message via BroadcastChannel
tab2 -&amp;gt; listen for BroadcastChannel beforehand -&amp;gt; on message, read from IndexedDB -&amp;gt; assert data is up to date
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It seemed harmless, but when running with Playwright, the second page never received the broadcast message. Not occasionally — 100% failure.&lt;/p&gt;

&lt;p&gt;What’s the root cause? I used two &lt;code&gt;browser.newContext()&lt;/code&gt; calls, creating two completely isolated browser contexts. In Chromium, different BrowserContexts not only isolate IndexedDB storage, but also isolate BroadcastChannel — messages sent in contextA are entirely invisible to contextB. This is a classic mistake of “simulating multi-tab” scenarios with the wrong API.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution design
&lt;/h2&gt;

&lt;p&gt;To test true &lt;strong&gt;multi-tab data sync&lt;/strong&gt;, you must open multiple Pages within the &lt;strong&gt;same&lt;/strong&gt; BrowserContext. This way, they share the same origin’s storage (IndexedDB, localStorage), and BroadcastChannel works correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why not Cypress?&lt;/strong&gt; Cypress doesn’t natively support multiple tabs. Although you can simulate it with &lt;code&gt;cy.origin&lt;/code&gt;, it’s awkward for verifying sync at the storage layer like IndexedDB.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Why not Puppeteer?&lt;/strong&gt; Early versions of Puppeteer lacked elegant multi-page management, and Playwright is clearly more mature in waiting for async events, network idle, and locator assertions, saving you from writing a ton of &lt;code&gt;waitForTimeout&lt;/code&gt;.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Why not use two real browser windows?&lt;/strong&gt; Automated tests run in headless CI environments — no desktop.&lt;/p&gt;

&lt;p&gt;The architecture is simple: one BrowserContext, two Pages, same-origin URLs. The core logic uses &lt;code&gt;page.evaluate()&lt;/code&gt; to manipulate IndexedDB and BroadcastChannel within the browser, and assertions rely on Playwright’s &lt;code&gt;waitForFunction&lt;/code&gt; to poll the page state.&lt;/p&gt;
&lt;h2&gt;
  
  
  Core implementation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;This code solves the problem of creating two pages within the same storage context and verifying that, after one page writes data, the other page perceives the change through BroadcastChannel.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here is the full runnable test (requires installing &lt;code&gt;playwright&lt;/code&gt; and the &lt;code&gt;idb&lt;/code&gt; frontend library, and a local static server):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;BrowserContext&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;http&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;fs&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;path&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// A minimal HTML page with built-in idb operations and BroadcastChannel listening&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;PAGE_HTML&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`
&amp;lt;!DOCTYPE html&amp;gt;
&amp;lt;html&amp;gt;
&amp;lt;body&amp;gt;
  &amp;lt;div id="status"&amp;gt;idle&amp;lt;/div&amp;gt;
  &amp;lt;script type="module"&amp;gt;
    import { openDB } from 'https://unpkg.com/idb?module';
    const channel = new BroadcastChannel('sync-demo');
    const statusEl = document.getElementById('status');

    async function initDB() {
      const db = await openDB('sync-db', 1, {
        upgrade(db) {
          if (!db.objectStoreNames.contains('items')) {
            db.createObjectStore('items', { keyPath: 'id' });
          }
        }
      });
      window._db = db;
    }

    async function writeItem(id, value) {
      const db = await openDB('sync-db', 1);
      await db.put('items', { id, value });
      channel.postMessage({ type: 'changed', id, value });
      statusEl.textContent = 'written';
    }

    async function readItem(id) {
      const db = await openDB('sync-db', 1);
      return await db.get('items', id);
    }

    // Expose to Playwright for direct calls
    window._writeItem = writeItem;
    window._readItem = readItem;

    channel.onmessage = async (event) =&amp;gt; {
      if (event.data.type === 'changed') {
        const item = await readItem(event.data.id);
        statusEl.textContent = 'synced:' + JSON.stringify(item);
      }
    };

    initDB();
  &amp;lt;/script&amp;gt;
&amp;lt;/body&amp;gt;
&amp;lt;/html&amp;gt;
`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Server&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;PORT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4567&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;beforeAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Start a local static server, returning the above HTML&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>From 2-Hour Manual Regression to 4-Minute Playwright Automation for RAG Memory Tests—and 80% Fewer Misses</title>
      <dc:creator>BAOFUFAN</dc:creator>
      <pubDate>Thu, 07 May 2026 12:09:01 +0000</pubDate>
      <link>https://dev.to/_eb7f2a654e97a60ae9f96e/from-2-hour-manual-regression-to-4-minute-playwright-automation-for-rag-memory-tests-and-80-fewer-4329</link>
      <guid>https://dev.to/_eb7f2a654e97a60ae9f96e/from-2-hour-manual-regression-to-4-minute-playwright-automation-for-rag-memory-tests-and-80-fewer-4329</guid>
      <description>&lt;p&gt;At 1 a.m., a colleague sent me a screenshot: a user had said, “My name is Xiao Ming, remember I take my coffee without sugar.” In the next conversation, the bot served a full-sugar latte. The product manager @-mentioned everyone in the group chat: “Is memory storage broken again?” I stared at the chat history, sighed, opened my spreadsheet, and started my Nth round of manual regression: clear cache, open browser, run 10 turns of dialogue, compare against expected results, take screenshots, fill in results… Two hours later I had covered five scenarios. My eyes were exhausted and I had missed three boundary cases. At that moment I decided: a machine has to do this.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem Breakdown: Why RAG Memory Testing Is So Painful
&lt;/h2&gt;

&lt;p&gt;Memory storage in RAG applications isn’t like a traditional API where you can verify everything with a few asserts. It involves long-term memory, session windows, vector retrieval, and LLM generation—any weak link leaves the user feeling that “the bot has amnesia.” A typical test scenario looks like this: chat with the bot for 10 rounds. In round 3, plant the information “My favorite movie is &lt;em&gt;Let the Bullets Fly&lt;/em&gt;.” In round 7, discuss the weather. In round 10, suddenly ask, “What movie did I say I liked earlier?” and see whether the bot retrieves it from memory.&lt;/p&gt;

&lt;p&gt;Doing this manually has three fatal flaws:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Hard to trace long-conversation state&lt;/strong&gt; – Memory glitches normally happen after several context shifts. By the fourth or fifth round of manual testing, even the tester has forgotten what was said earlier.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming output makes assertions unstable&lt;/strong&gt; – LLM generation appears token by token. Often a sentence isn’t finished before someone frantically scrolls to check for a keyword, leading to a high rate of false negatives.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regression cost grows exponentially with memory types&lt;/strong&gt; – Short-term memory, long-term memory, summary memory, vector memory… every additional storage type doubles the number of test cases. Manual testing simply can’t keep up.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The usual fix is to write unit tests—but LLM output is non-deterministic. Even if the memory is correct, the phrasing can vary wildly. Fixed-string asserts immediately break down. The real challenge is that you need something that can simulate a real user across multiple conversation turns—&lt;strong&gt;waiting&lt;/strong&gt;, &lt;strong&gt;observing&lt;/strong&gt;, &lt;strong&gt;asserting&lt;/strong&gt;—and still run unattended in CI. Playwright is practically built for this.&lt;/p&gt;

&lt;h2&gt;
  
  
  Design Decision: Why Playwright over Selenium or Puppeteer
&lt;/h2&gt;

&lt;p&gt;There were three candidates: Selenium, Puppeteer, and Playwright. Selenium was eliminated first—its automatic waiting mechanisms for modern web apps are too weak; you end up sprinkling &lt;code&gt;sleep&lt;/code&gt; everywhere, making tests slow and brittle. Puppeteer only supports Chromium, while our RAG application has production users on Safari and Firefox, so we needed cross-browser validation.&lt;/p&gt;

&lt;p&gt;What won me over with Playwright:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Auto-waiting&lt;/strong&gt; – It handles element interactability, page loads, and network idleness for you. No need to litter assertions with &lt;code&gt;sleep&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-browser support&lt;/strong&gt; – The same script runs on Chromium, Firefox, and WebKit just by changing one configuration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Screenshots and video&lt;/strong&gt; – When a test fails, the trace replays each step instead of forcing you to stare at logs and doubt everything.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network interception and mocking&lt;/strong&gt; – You can intercept API requests, even simulate memory-service failures, to verify degradation logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Architecturally, we designed a &lt;strong&gt;memory-accuracy automation suite&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Script definition&lt;/strong&gt; – Describe multi-turn conversations in YAML. Each turn contains the user input, expected memory fields, and mandatory keywords.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Executor&lt;/strong&gt; – A Playwright browser instance reads the script, sends messages in order, listens for the streaming-response completion event, and collects the full generated text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Assertion layer&lt;/strong&gt; – Perform semantic-level checks on the generated text instead of relying on exact string matching. Check “whether the response contains key information linked to the memory.” When necessary, plug in a small model for secondary verification.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Report generation&lt;/strong&gt; – Each run produces an HTML report with failing screenshots, dropped straight into CI artifacts.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Why not use API tests directly? Because in many RAG apps, state management, the front-end conversation window, and token-refresh logic are all embedded in the page. You simply cannot reproduce real-world failures without a real browser.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Implementation: Turning Manual Steps into Automation
&lt;/h2&gt;

&lt;p&gt;The code below solves the problem of “how to make Playwright wait for each generation to finish before sending the next message” in a multi-turn dialogue. During streaming, the &lt;code&gt;send&lt;/code&gt; button is typically disabled or shows a stop icon, then returns to normal once generation is complete. We use this change as a synchronization point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.async_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;async_playwright&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;send_message_and_wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    向聊天框发送消息，并等待 LLM 生成结束。
    假设：发送后 send 按钮 disabled，生成完毕恢复 enabled。
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;textarea&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;textarea[placeholder*=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;输入&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;send_btn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;button:has-text(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;发送&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;textarea&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;send_btn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# 关键：等待发送按钮恢复可用状态，表示生成完成
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;send_btn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;visible&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# 保险起见再等一丢丢，让动画渲染完毕
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for_timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we build a complete memory test scenario: the user states their name and a preference in the first turn, then much later suddenly quizzes the bot, checking whether the response contains the earlier information. Notice we use &lt;code&gt;locator&lt;/code&gt; combined with &lt;code&gt;filter&lt;/code&gt; to precisely grab the bot’s latest reply.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_long_term_memory&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;async_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_context&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-rag-app.example.com/chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 剧本：植入记忆
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;send_message_and_wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;我叫赵大宝，最喜欢的咖啡是冰美式。&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;send_message_and_wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;今天天气不错，适合工作。&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;send_message_and_wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;帮我记一下，下周三要去体检，别忘了提醒我。&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# 干扰对话
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;send_message_and_wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;那明天呢？&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;send_message_and_wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;再说一下项目排期的事情吧。&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# 关键测试：询问之前的信息
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;send_message_and_wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;我之前说过我喜欢什么咖啡？&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 定位最后一条机器人消息（假设每条消息都有 [data-role="assistant"]）
&lt;/span&gt;        &lt;span class="n"&gt;last_bot_msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[data-role=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;
        &lt;span class="n"&gt;response_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;last_bot_msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;inner_text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# 语义断言：必须包含“冰美式”或“美式”
&lt;/span&gt;        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;冰美式&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;美式&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;

        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Plugging into CI: From 2 Hours Down to 4 Minutes
&lt;/h2&gt;

&lt;p&gt;We integrated the scripts into GitHub Actions. A typical workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Spin up a Playwright Docker container with all browsers pre-installed.&lt;/li&gt;
&lt;li&gt;Pull the YAML memory scripts from the repo and execute them in parallel matrix jobs (short-term memory, long-term memory, summary memory each in its own job).&lt;/li&gt;
&lt;li&gt;Collect HTML reports and screenshots as artifacts. If any job fails, post a summary comment on the PR.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The numbers speak for themselves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Before (manual)&lt;/strong&gt;: 5 scenarios took ~2 hours, with an error-omission rate of about 20%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;After (Playwright automation)&lt;/strong&gt;: The same 5 scenarios, plus 15 more that we never had time to run, finish in 4 minutes. The omission rate dropped to below 4%, because the machine executes every assertion statement precisely, without fatigue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;False-positive resistance&lt;/strong&gt;: Because assertions are semantic, a correct answer phrased differently (“冰美式” vs “美式咖啡”) still passes; the machine doesn’t trip over wording.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Advanced Tips
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Mock the memory service for negative testing&lt;/strong&gt;: Use &lt;code&gt;page.route()&lt;/code&gt; to intercept requests to the memory backend and return 500 errors, verifying that the bot gracefully handles “I can’t access my memory right now.”
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;**/api/memory/**&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fulfill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deal with slow streaming&lt;/strong&gt;: Some models send tokens very slowly. Instead of waiting for the send button, you can listen for the &lt;code&gt;page.on("websocket")&lt;/code&gt; event or wait for a “done” indicator. The auto-waiting approach is solid, but if your UI differs, adjust accordingly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Semantic validation with a tiny model&lt;/strong&gt;: For keywords that can be expressed in many ways, we ran the bot’s response and the expected fact through an embeddings model and check cosine similarity. It’s optional but reduces false negatives to near zero.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;RAG memory testing doesn’t have to be a nightmare of manual spreadsheet checking. With Playwright as the “hands” and a few YAML scripts as the “brain,” you can turn a flaky two-hour regression into a four-minute, reliable pipeline. The robot doesn’t get tired, doesn’t miss outliers, and won’t give your user a sugary latte when they specifically asked for black coffee.&lt;/p&gt;

</description>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>Moving DeepSeek-R1 from Transformers to vLLM: A 14x Throughput Boost</title>
      <dc:creator>BAOFUFAN</dc:creator>
      <pubDate>Thu, 07 May 2026 01:08:37 +0000</pubDate>
      <link>https://dev.to/_eb7f2a654e97a60ae9f96e/moving-deepseek-r1-from-transformers-to-vllm-a-14x-throughput-boost-f9d</link>
      <guid>https://dev.to/_eb7f2a654e97a60ae9f96e/moving-deepseek-r1-from-transformers-to-vllm-a-14x-throughput-boost-f9d</guid>
      <description>&lt;p&gt;At 2 AM, I was jolted awake by a call from operations: "Why did the billing system charge the user twice?" I stumbled to my laptop and found the root cause — our Model-as-a-Service API started queuing requests beyond a concurrency of 5, and the fragile "retry deduplication" logic I'd bolted on collapsed under high load, resulting in double charges. That was the reality of our homegrown inference service built with HuggingFace Transformers + FastAPI half a year ago. The architecture at the time felt like a water pipe held together with tape, ready to burst at any moment. It wasn't until we fully migrated to vLLM + Kong that we removed the three mountains of concurrency, billing, and multi-tenancy all at once. This article is a battle-tested record drawn from blood and tears — pure, actionable know-how you can copy directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem Breakdown: Why the Original Approach Couldn't Survive a Traffic Spike
&lt;/h2&gt;

&lt;p&gt;Our use case was straightforward: provide a text-generation API for DeepSeek-R1, charge by token, and support multiple customers (tenants) each with their own API key and quota. Initially, with a small team, I loaded the model with Transformers, wrapped it in FastAPI, and hand-rolled key verification and token counting logic into MySQL.&lt;/p&gt;

&lt;p&gt;The cracks appeared quickly. The root cause was that &lt;strong&gt;native Transformers inference is shamelessly wasteful&lt;/strong&gt;: every request, regardless of sequence length, grabs the entire GPU memory for a full forward pass with no continuous batching. One request isn't finished, and the rest queue up. Even with dynamic batching, padding waste kept actual GPU utilization under 30%. The result: as concurrency grew, latency spiked to tens of seconds, clients timed out and retried, hammering our brittle "idempotency" logic and ultimately leading to duplicate billing.&lt;/p&gt;

&lt;p&gt;Additionally, the hand-written tenant management and rate-limiting logic was scattered across business code. Changing a quota meant a full redeployment, and the gateway layer had zero defense. I once tried to add a semaphore limiter inside FastAPI, which only jammed requests at our doorstep while resources were already hogged — even the health check went down. It felt like locking myself out of my own house.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution Design: vLLM as the Inference Engine, Kong as the Billing Gateway
&lt;/h2&gt;

&lt;p&gt;After the postmortem, we adopted two iron rules: &lt;strong&gt;the inference layer must implement continuous batching so the GPU runs like an assembly line without gaps; the gateway layer must offload cross-cutting concerns—billing, authentication, rate-limiting—so business code focuses solely on inference.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For the inference engine, the candidates were NVIDIA Triton, Text Generation Inference (TGI), and vLLM. Triton was too "heavy" — for a team desperate to patch a sinking ship, the learning curve around model configuration and model repositories was too steep. TGI was good, but back then its support for the DeepSeek family wasn't mature enough, and being tied to HuggingFace left less room for customization. vLLM was booming for good reason — its PagedAttention memory-sharing mechanism let multiple requests' KV caches be dynamically stitched together in GPU memory with near-zero waste. It natively supports the OpenAI API format, making migration costs virtually zero. So we chose it.&lt;/p&gt;

&lt;p&gt;For the gateway, Kong was a component we’d always wanted but never had time to adopt. Why not build it yourself? Because "billing, auth, rate-limiting" may sound simple, but doing them at production grade requires plugin hot-reload, multi-dimensional limiting, highly available storage, daily tenant reports... Building that yourself is like developing half a gateway from scratch. Kong's three plugins — Key Authentication, Rate Limiting, and HTTP Log — connected in series can construct a complete multi-tenant billing system: Key Auth isolates tenants, Rate Limiting prevents abuse, and HTTP Log asynchronously pushes token consumption from each request to Kafka/ClickHouse, where the billing system computes charges offline. Once the architecture was clear, I could finally sleep at night.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Implementation: From a Single Command to a Full Multi-Tenant Gateway
&lt;/h2&gt;

&lt;p&gt;Below is runnable code and configuration. I’ve split it into two parts: vLLM deployment and Kong configuration. &lt;strong&gt;This first part starts the inference service&lt;/strong&gt; with a single Docker command, exposing an OpenAI-compatible endpoint.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 要预先下载好 DeepSeek-R1 模型，放在 /data/model/deepseek-r1&lt;/span&gt;
docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; vllm-deepseek &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 8000:8000 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; /data/model:/models &lt;span class="se"&gt;\&lt;/span&gt;
  vllm/vllm-openai:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; /models/deepseek-r1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 2 &lt;span class="se"&gt;\ &lt;/span&gt;   &lt;span class="c"&gt;# 双卡，用张量并行&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 8192 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-prefix-caching&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;    &lt;span class="c"&gt;# 开启前缀缓存，相同 system prompt 能秒出&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.92
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the service is up, you can simply &lt;code&gt;curl http://localhost:8000/v1/chat/completions&lt;/code&gt; and call DeepSeek-R1 just like OpenAI. I've battle-tested the stability and compatibility of this interface countless times — it works perfectly as a Kong upstream.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The next part, the Kong configuration, solves multi-tenant authentication, rate-limiting, and token-consumption forwarding.&lt;/strong&gt; I use Kong's decK declarative format. Copy and paste it into Kong, and it takes effect immediately.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# kong-config.yaml&lt;/span&gt;
&lt;span class="na"&gt;_format_version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3.0"&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deepseek-r1&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://vllm-deepseek:8000/v1&lt;/span&gt;   &lt;span class="c1"&gt;# 指向 vLLM 容器&lt;/span&gt;
    &lt;span class="na"&gt;routes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deepseek-chat&lt;/span&gt;
        &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/chat&lt;/span&gt;
        &lt;span class="na"&gt;strip_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;               &lt;span class="c1"&gt;# 保留 /chat 后缀，透传给 vLLM&lt;/span&gt;
    &lt;span class="na"&gt;plugins&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;key-auth&lt;/span&gt;                  &lt;span class="c1"&gt;# 启用 API Key 认证&lt;/span&gt;
        &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;key_names&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;apikey"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;         &lt;span class="c1"&gt;# 从 header 或 query 取 key&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rate-limiting&lt;/span&gt;
        &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;minute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;                   &lt;span class="c1"&gt;# 每个租户每分钟最多100请求&lt;/span&gt;
          &lt;span class="na"&gt;policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local&lt;/span&gt;                 &lt;span class="c1"&gt;# 单节点限流，多节点用 redis&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http-log&lt;/span&gt;                 &lt;span class="c1"&gt;# 日志推送到计费系统&lt;/span&gt;
        &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;http_endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://billing-collector:3000/log&lt;/span&gt;
          &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;POST&lt;/span&gt;
          &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2000&lt;/span&gt;
          &lt;span class="na"&gt;keepalive&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60000&lt;/span&gt;
&lt;span class="c1"&gt;# 消费者的 API Key 定义&lt;/span&gt;
&lt;span class="na"&gt;consumers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tenant_a&lt;/span&gt;
    &lt;span class="na"&gt;keyauth_credentials&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sk-tenantA-xxxxx&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tenant_b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this setup, the moment a request hits Kong, it’s authenticated and counted; vLLM continuously processes the rough stream of inference without ever touching billing logic. We went from 5 concurrent requests causing chaos to handling over 200 concurrent requests smoothly, with throughput skyrocketing 14x. That middle-of-the-night phone call has never rung again for this reason.&lt;/p&gt;

</description>
      <category>python</category>
      <category>programming</category>
    </item>
    <item>
      <title>Pytest + Docker: 3 Bugs That Broke My AI Agent's Memory (and Cost Me 8 Hours)</title>
      <dc:creator>BAOFUFAN</dc:creator>
      <pubDate>Wed, 06 May 2026 12:05:03 +0000</pubDate>
      <link>https://dev.to/_eb7f2a654e97a60ae9f96e/pytest-docker-3-bugs-that-broke-my-ai-agents-memory-and-cost-me-8-hours-4eh0</link>
      <guid>https://dev.to/_eb7f2a654e97a60ae9f96e/pytest-docker-3-bugs-that-broke-my-ai-agents-memory-and-cost-me-8-hours-4eh0</guid>
      <description>&lt;p&gt;At 1:23 AM our ops group chat exploded—users were reporting that the agent had completely lost its memory. Every conversation felt like it was starting from scratch. I dug into the logs and found the memory module returning an empty list, even though the records were sitting right there in the database. It wasn’t a model hallucination. The consistency of the memory storage had been silently broken: two concurrent requests had overwritten the last message of the session with an older version. And the mocked MemoryStore in our unit tests would never tell you that. That night I decided to build a reproducible, real-storage verification setup using Pytest + Docker. I started at midnight and didn’t stop until dawn—and I hit way more pitfalls than I expected. Here’s the full postmortem so you can save yourself a few sleepless nights.&lt;/p&gt;

&lt;h2&gt;
  
  
  Breaking Down the Problem: Why Mocks Can’t Catch Memory’s Fatal Flaws
&lt;/h2&gt;

&lt;p&gt;An AI agent’s memory storage seems deceptively simple: insert a row per conversation turn with &lt;code&gt;session_id&lt;/code&gt;, &lt;code&gt;role&lt;/code&gt;, &lt;code&gt;content&lt;/code&gt;, and &lt;code&gt;created_at&lt;/code&gt;, then fetch the most recent N rows per session to build the context. We used PostgreSQL with SQLAlchemy + asyncpg. It all looked harmless—until concurrency showed up and the gremlins came out.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Concurrent insert ordering chaos:&lt;/strong&gt; Instead of relying on the database’s auto-increment sequence, we generated &lt;code&gt;created_at&lt;/code&gt; timestamps in application code. But server clock drift or Python’s &lt;code&gt;datetime.utcnow()&lt;/code&gt; reordering inside coroutines would push later messages before earlier ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;“Vanishing writes” under read/write splitting:&lt;/strong&gt; The primary accepted the write, but the subsequent query hit a read replica. Replication lag made the freshly inserted message invisible—so the agent simply “forgot” it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fake snapshots due to transaction isolation:&lt;/strong&gt; Under default READ COMMITTED, a long transaction could see different versions of the same session on successive reads. This introduced phantom rows while assembling the context.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How do typical unit tests handle this? They swap out the repository with &lt;code&gt;unittest.mock&lt;/code&gt; and assert “the insert method was called.” That never touches a real storage engine. Isolation levels, concurrent scheduling, network delays—all gone. &lt;strong&gt;Testing memory storage with mocks is like learning to parallel park in a simulator—you’ll never learn the real thing.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Plan: Pull a Real Database Into Tests with Docker
&lt;/h2&gt;

&lt;p&gt;To verify both correctness and consistency, you have to swing at real pitches. The plan was straightforward: &lt;strong&gt;Pytest organizes the test cases, and Docker provides a disposable, genuine database.&lt;/strong&gt; At test time you spin up a PostgreSQL container, wait for its health check, run migrations, execute concurrent scenarios, then tear it all down. Every run starts from a clean slate.&lt;/p&gt;

&lt;p&gt;Why not other approaches?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ &lt;strong&gt;Testcontainers-Python&lt;/strong&gt;: Nice idea, but it requires a Docker daemon in CI and its abstraction isn’t transparent enough. When things break you can’t tell if the container never started or the port mapping went sideways.&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;SQLite in-memory mode&lt;/strong&gt;: Its isolation level and concurrency model are too different from PostgreSQL. It won’t surface transaction conflicts or simulate replication lag—a total waste.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Docker Compose&lt;/strong&gt;: A single YAML describes the dependencies, works the same in CI and locally. The way ops orchestrates production is how we orchestrate tests, reproducing ~90% of real behavior.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture in text form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Test startup
  ├─ docker compose up -d (postgres, optionally pgvector, redis)
  ├─ wait for health check
  ├─ run alembic migrations / create tables
  ├─ pytest cases (correctness + concurrency consistency)
  └─ docker compose down -v
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One easily overlooked point: concurrency tests must run with real async I/O. You can’t just rely on &lt;code&gt;pytest-asyncio&lt;/code&gt;’s default loop. We need to control the event loop lifecycle so all async fixtures share the same loop, giving us the same coroutine scheduling behavior as production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Implementation: Building It Step by Step
&lt;/h2&gt;

&lt;p&gt;First, the &lt;code&gt;docker-compose.yml&lt;/code&gt;. Keep it minimal, but get the health check right—screw this up and you’ll step on landmines.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yml&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3.9"&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:16-alpine&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;agent_test&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test_pass&lt;/span&gt;
      &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;memory_test&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0:5432"&lt;/span&gt;            &lt;span class="c1"&gt;# 随机端口，避免本地冲突&lt;/span&gt;
    &lt;span class="na"&gt;healthcheck&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CMD"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pg_isready"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-U"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_test"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1s&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3s&lt;/span&gt;
      &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The port mapping &lt;code&gt;0:5432&lt;/code&gt; tells Docker to assign a random host port. In Python we’ll grab it with &lt;code&gt;docker compose port&lt;/code&gt;, so parallel test runs never collide.&lt;/p&gt;

&lt;p&gt;Now for &lt;code&gt;conftest.py&lt;/code&gt;, which manages the container lifecycle and the database connection pool. &lt;strong&gt;I’ve stepped on enough landmines here—here’s the final working version.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# conftest.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncpg&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest_asyncio&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_port&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;通过 docker compose port 获取容器映射出来的宿主机端口&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compose&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;port&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgres&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5432&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;check&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# 输出格式: "0.0.0.0:54321"
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;


&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;docker_services&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;启动 docker compose 服务，返回服务端口映射&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compose&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;up&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;check&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# 等待健康检查通过，而不是死等 sleep
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;port&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_get_port&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="c1"&gt;# 用 pg_isready 再次确认
&lt;/span&gt;            &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pg_isready&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-h&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-p&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>python</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
