Wanda

Posted on Apr 7 • Originally published at apidog.com

How AI agent memory works (and how to test it via API)

#agents #ai #api #testing

TL;DR

AI agents often fail not due to lack of intelligence, but because of faulty memory architectures. Understanding the four types of agent memory, their storage mechanisms, and how they affect API behavior enables you to build more reliable agents and catch bugs before production.

Try Apidog today

Introduction

Most AI agent failures aren’t about bad models—they’re about broken memory layers.

If your agent forgets what happened a few turns ago, loses user context between sessions, or contradicts itself mid-task, the issue is likely in the memory design or lack of proper testing.

Hippo, an open-source agent memory system, takes a biologically inspired approach by modeling short-term, long-term, and episodic memory separately. This highlights a real gap: many developers treat agent memory as an afterthought and only discover issues when live.

💡 Pro Tip: Apidog’s Test Scenarios let you verify stateful, multi-turn agent conversations before release. You can ensure session state persists across API calls, assert on context structures, and simulate memory failures using Smart Mock. This article covers both the memory architecture and how to test it effectively. For a primer on broader testing, see [internal: api-testing-tutorial].

What is AI Agent Memory?

Agent memory refers to mechanisms that let an AI system retain or access information beyond a single input. Without memory, every API call is stateless—the model sees only the current prompt and forgets everything else.

There are four main types of agent memory, each serving distinct roles.

The Four Types of Agent Memory

1. Working Memory

Working memory is the agent’s active context—the data in the current prompt. For LLM-based agents, this is the context window (e.g., GPT-4o’s 128K tokens, Claude 3.5 Sonnet’s 200K, Gemini 1.5 Pro’s 1M).

Pros: Fast, precise.
Cons: Expensive (per-token cost), size-bounded (oldest context drops off). Common bug: Context overflow in long-running tasks.

2. Episodic Memory

Episodic memory logs past interactions, decisions, and observations—a diary for the agent.

Typical implementation: Vector DB (Chroma, Pinecone, Qdrant) or structured event log.
Access pattern: Agent retrieves relevant episodes via semantic search.
Example: Hippo stores interactions with timestamps and decay weights for prioritization.

3. Semantic Memory

Semantic memory stores facts, domain knowledge, stable information, and user preferences. It’s not time-ordered.

Implementation: Pre-loaded (system prompt), dynamically built (knowledge graph), or external (RAG over docs).

4. Procedural Memory

Procedural memory encodes how to do things: action patterns, tool usage, and learned skills.

Implementation: Few-shot examples in prompts, or a library of stored action plans.
Note: Often neglected in production agents.

How Memory is Stored in Real Systems

In practice, these types rarely map to separate stores. Typical approaches:

Context Window (Working): Managed by the agent framework; expires with conversation end.
External Vector Store (Episodic + Semantic): Chroma, Pinecone, or Qdrant store embeddings of past data. Queried each turn; relevant chunks injected into the prompt.
Structured DB (Semantic + Procedural): PostgreSQL/SQLite for user preferences, state, and action templates.
In-memory Cache (Working Overflow): Redis or simple dict for fast access to recent context.

Hippo’s three-tier system explicitly hands off working memory to episodic when stale, then summarizes episodic into semantic—mirroring human memory consolidation. (Hippo even includes a "sleep" command for consolidation.)

How Agent Memory Affects API Behavior

Memory design directly impacts API usage and failure modes. Key considerations:

Session IDs: Most agent APIs use session/thread IDs (e.g., OpenAI’s thread_id). Dropping or reusing IDs causes context loss or session blending.
Context Size: Injecting memory grows request payloads. Small conversations (2KB) can balloon to 40KB after 20 turns. Watch for client payload limits.
Retrieval Latency: Vector store lookups can add 50–200ms per turn.
Inconsistent State After Failures: If a tool call fails mid-task, the episodic log may record partial actions, causing corrupted state on the next turn. Good agents checkpoint state around tool use.

How to Test Agent Memory via API with Apidog

Testing stateful agent APIs requires more than single-request checks. You need to verify context carryover across calls, memory-backed response changes, and graceful degradation on failure.

Apidog Test Scenarios make this straightforward. Here’s how to set up key memory tests for agent APIs:

Test 1: Context Carryover

Create a scenario with three steps:

POST /agent/chat — Introduce a fact (e.g., "My project uses PostgreSQL 16").
POST /agent/chat — Ask a follow-up requiring recall (e.g., "What database should I optimize for?").
Assert: Step 2’s response contains "PostgreSQL".

If memory works, the agent recalls and uses the fact. Otherwise, you’ll get a generic answer.

Test 2: Session Isolation

Run the above sequence twice with different session_id values. Assert that the second session’s response does not contain context from the first session. This catches multi-tenant memory leaks.

Test 3: Memory Failure Degradation

Use Apidog’s Smart Mock to simulate a backend failure (e.g., return 503 on vector store lookup). Assert that:

The agent responds without crashing.
The response includes a fallback (e.g., "I don't have enough context to answer that").
The session resumes normally after the mock is removed.

Test 4: Context Window Overflow

Send 30+ rapid messages to exceed the context window. Assert that:

No context_length_exceeded error is thrown (agent should truncate gracefully).
The agent still answers correctly using episodic retrieval.
response.usage token counts stay within expected limits.

Chain these as a single Test Scenario in Apidog, using shared variables for session IDs and responses. For background on context windows at the model level, see [internal: how-to-build-tiny-llm-from-scratch].

Common Memory Failure Modes

Silent Context Truncation: Context window overflows and older messages disappear. Catch this by asserting response.usage.prompt_tokens stays below your model’s limit.
Session Bleed: Memory leaks between user sessions. Detect with session isolation tests.
Stale Semantic Memory: Old knowledge contradicts current facts. Assert on current values in tests.
Embedding Drift: Switching embedding models breaks vector store retrievals. While not directly testable, you can assert semantic relevance in retrieved context.
Prompt Injection via Memory: Malicious inputs manipulate stored memory. Test with adversarial inputs; store a “user preference” containing a system prompt override and ensure it’s ignored. See [internal: rest-api-best-practices] for API security testing.

Conclusion

Agent memory distinguishes an intelligent assistant from one that’s amnesiac. Understanding working, episodic, semantic, and procedural memory—and how they’re actually stored—lets you identify and test for real-world bugs.

Tools like Hippo show the shift toward principled memory architectures. Whatever your memory system, Apidog Test Scenarios provide the API testing layer to verify behavior, especially for failures that only appear at scale.

FAQ

What’s the simplest way to add memory to an agent?

Use a sliding window over conversation history—keep the last N turns in the prompt. For longer tasks, add a vector store and semantic retrieval.

How does the OpenAI Assistants API handle memory?

It manages a server-side thread object with conversation history. You can add file/code tools for external knowledge. The memory layer is abstracted, which simplifies development but can make debugging harder.

What’s the best vector database for agent memory?

For local dev: Chroma. For production: Qdrant (self-hosted) or Pinecone (managed). Hippo supports pluggable storage. See [internal: claude-code] for details on Claude Code’s memory approach.

How do I prevent agents from hallucinating past interactions?

Log interactions with metadata (timestamp, confidence, source). When retrieving, include metadata in the prompt:

"According to our conversation on [date], you mentioned X."

Explicit citations reduce hallucination.

Can I test agent memory without a running agent?

Yes. Use Apidog’s Smart Mock to simulate agent API responses, including memory-backed ones. Define mock responses based on session ID or request content to test your frontend or integration layer.

How much does vector storage cost in production?

Pinecone’s free tier: 1 index/100K vectors. Paid: ~$0.096/hour for 1M vectors. Qdrant self-hosted is free. Typically, embedding generation costs outweigh storage. See [internal: what-is-mcp-server] for production memory integration details.

What’s the difference between RAG and agent memory?

RAG retrieves static knowledge at query time. Agent memory is dynamic, updating as the agent interacts. RAG answers “What do the docs say?”—agent memory answers “What do I know about this user and our interactions?”

DEV Community