DEV Community

Cover image for Memory in production agents: what most tutorials skip
SapotaCorp
SapotaCorp

Posted on • Originally published at sapotacorp.vn

Memory in production agents: what most tutorials skip

#ai

A B2B SaaS founder pinged us with a confused question: "We built an AI assistant following a tutorial. It works for one question, but if the user asks a follow-up, it acts like the previous message never happened. Why?"

The team had assumed that calling the OpenAI API in a loop would make the model "remember" the conversation. It does not. LLMs are stateless by design. What looks like memory in ChatGPT or Claude is the application sending the entire conversation history back into the model on every turn.

This is the layer most teams discover the hard way. Memory in AI agents is a system design problem, not a property of the model. Here is the four-layer pattern Sapota uses in production and the failure modes each layer prevents.

Why memory is harder than it looks

Every LLM API call is independent. The model has no internal state between calls. When a user sends a follow-up message, the model has no idea what was said earlier unless the application explicitly includes the previous turns in the new prompt.

Naive solutions immediately hit two problems:

Token cost grows linearly. Turn 1 costs $0.001. Turn 10, with 9 previous turns appended, costs around $0.01. Turn 50 costs $0.05 per message. By turn 100, every reply is a $0.10 LLM call.

Context window fills up. Even a 200k-token context window on Claude or GPT-4o eventually fills. Long sessions hit the limit and either crash or silently truncate, dropping critical context.

The solution is not "more memory." It is engineering memory as a layered system, where different types of state live in different stores with different retention policies.

Layer 1: Short-term memory (within session)

This is the conversation history within the current session. The naive version is "append every turn forever." The production version is one of three strategies, depending on use case.

Sliding window keeps the last N turns and drops older ones. Cheap, predictable, but loses anything beyond the window. Good for FAQ-style chatbots where each interaction is mostly self-contained.

Summarization compresses older turns into a paragraph and keeps recent turns full. The summary is regenerated incrementally as new turns arrive. Production default for customer support and assistant-style products.

Hierarchical combines both: a high-level user-profile summary, a session-level summary, plus the last few turns in full. Necessary when conversations span hundreds of turns or multiple sessions.

The mistake we see is teams skipping this layer entirely and just sending the full history every time. It works for the first ten turns of testing. It breaks at scale.

Layer 2: Long-term memory (across sessions)

This is the layer most tutorials never mention. When a user comes back tomorrow, what does the agent remember about them?

Long-term memory holds three categories of information:

User preferences: communication style, language, expertise level, what to avoid. These rarely change and should be loaded on every session start.

Past interaction summaries: a one-paragraph summary of each previous session. Retrieved selectively based on relevance to the current query.

Entity memory: structured facts about people, projects, products mentioned in past conversations. Stored as key-value records, not chunked text.

Storage choice matters here. We default to Redis for preferences (instant lookup), Postgres for entity memory (structured queries), and a vector database (Qdrant) for semantic recall of episodic summaries. The hybrid setup is not over-engineering; each store solves a problem the others do not.

Layer 3: Entity memory (the one teams underestimate)

This is the layer that turns a generic chatbot into something that feels like a real assistant. Without entity memory, the agent has no way to track that "John" mentioned in turn 3 is the same "Mr. Smith" referenced in turn 12 and the "the founder" referred to in turn 20.

The structured version looks like:

{
  "Acme Corp": {
    "type": "customer",
    "tier": "enterprise",
    "primary_contact": "John Smith",
    "open_tickets": 3,
    "last_interaction": "2026-04-25"
  }
}
Enter fullscreen mode Exit fullscreen mode

Entity memory updates online (during the conversation) and persists across sessions. When the agent encounters "Acme" in a new session, it loads the entity record and acts with full context.

The pattern that fails: relying on the LLM to "remember" entities just by reading conversation history. The LLM sees text, not structured records. It will conflate similar names, miss disambiguating context, and confidently answer with the wrong account when there are two customers with similar names.

Layer 4: Retrieval-based memory (when scale matters)

For agents that accumulate hundreds or thousands of past interactions per user, injecting all of long-term memory into every prompt is wasteful. Retrieval-based memory treats past conversations as a searchable corpus.

The pattern: embed each past session summary, store in a vector database with metadata (user_id, timestamp, topic). On each new query, embed the query and search for the top three to five most relevant past memories. Inject only those into the current prompt.

Retrieval-based memory bridges the gap between "agent remembers everything" (impossible at scale) and "agent forgets everything" (current state of most production agents). It also enables time-decay weighting, so a memory from yesterday ranks higher than one from a year ago at equivalent semantic similarity.

The privacy layer most teams forget

Long-term memory is by definition personal data. Production memory systems have to handle:

  • User control: view what the AI remembers, delete specific memories, export all data
  • Retention policy: TTL on episodic memory (we default to 90-180 days), permanent for explicit preferences only
  • Encryption: at rest and in transit, especially for cross-border deployments
  • Audit log: every read and write to long-term memory, for GDPR compliance

The teams that skip this layer ship into a regulatory minefield. The teams that bolt it on after launch spend months retrofitting compliance into a system that was not designed for it.

What we shipped for the founder

The diagnostic took two hours. The fix was a three-layer setup:

  1. Short-term: summarization with token-based trigger at 8000 tokens. Last five turns kept full, older turns compressed into a running summary.

  2. Long-term: a Redis store for user preferences (instant lookup) and a Postgres table for past session summaries (loaded by user_id on session start).

  3. Entity memory: structured records updated online from conversation, stored in Postgres with the user_id as a foreign key.

We skipped retrieval-based memory for v1. The user volume was 200 users, not 200,000, and the simpler architecture was sufficient. The team can add the vector layer when scale demands it.

The agent stopped forgetting the user's name. More importantly, it started recognizing returning users, recalling their account context, and personalizing responses. User satisfaction scores went up about 25% in the first month.

When to build each layer

  • Just shipping? Add short-term memory (summarization). One layer is enough for v1.
  • Returning users? Add long-term memory (preferences + episodic).
  • Multi-entity domain? Add entity memory (CRM, project management, sales tools).
  • Hundreds of interactions per user? Add retrieval-based memory.
  • Regulated industry or international? Add the privacy layer from day one, not after launch.

The order matters. Layer 1 first, then 2, then 3, then 4. Skipping forward without the foundation makes the upper layers brittle.

If your agent keeps forgetting things it should know

If your team has shipped an AI assistant and users keep complaining that the agent does not remember context it should, the failure is almost certainly at the memory layer, not the model.

Sapota offers a memory architecture review that takes your current setup, identifies which layers are missing, and ships the implementation as a working integration. We have done this for personal assistants, customer support agents, and sales tools. The patterns are similar; the storage choices vary.

Reach out via the AI engineering page with a description of what users are saying and what the current memory setup looks like. The diagnosis usually surfaces in the first call.

Top comments (0)