Kaicheng zhang

Posted on Mar 21

How We Built Chat Memory That Actually Works — Lessons from Shipping to 100K Users

#ai #llm #chatbot #programming

Most AI chatbots forget you exist after a few messages. Here's how we built a memory system that doesn't.

I've been building EchoMelon — a roleplay and companion chat platform — for a while now. Early on, the most common complaint we got was brutal in its simplicity:

"Why doesn't my character remember what happened last week?"

Fair question. You'd pour hours into building a relationship with an AI character, share secrets, go on adventures, name things together — and then the character would just... blank on all of it. Because under the hood, all it sees is the last handful of messages.

This post is a deep dive into how we solved that. No hand-wavy theory. Actual patterns, actual trade-offs, actual scars.

The Problem: Context Windows Are a Lie

Every LLM has a context window — the amount of text it can "see" at once. Claude gives you 200K tokens. Gemini offers a million. Sounds like a lot, right?

It's not. Here's why:

Your system prompt eats a chunk. Character personality, world-building, behavioral rules — for a rich roleplay character, this alone can be 3,000–8,000 tokens.
Cost scales linearly with context. Stuffing 200K tokens into every API call would bankrupt you before lunch.
More context ≠ better responses. Models get confused with too much raw history. They start contradicting earlier events, mixing up details, hallucinating scenes that never happened.

So you can't just dump the entire chat history into the prompt. You need to be surgical about what the model sees.

Our Approach: Memo-Based Rolling Memory

The core idea is dead simple: summarize old conversations into structured "memos" and inject those summaries into the prompt alongside recent messages.

Think of it like how your own memory works. You don't remember the exact words of a conversation from three months ago. But you remember: "That was the night she told me about her past. We were on the rooftop. Things changed after that."

That's what we're building — compressed, meaningful memories that capture the what mattered, not the what was said verbatim.

Here's how the analogy maps:

Your Brain	Our System
Last 10 minutes of conversation — crystal clear	Last 8 raw message pairs — full fidelity
Older events — fuzzy highlights, not exact words	Memo summaries — structured highlights
You forget the mundane, remember what mattered	Prompt filters routine events, keeps milestones
Memories form passively, in the background	Summaries generated async, never blocking
You don't replay every detail when reminded	No RAG — chronological summaries, not flashbacks

Step 1: The Short-Term Memory (Recent Chat History)

The simplest layer. We keep the last 8 message pairs as raw conversation — the model sees exact words, tone, nuance. This is your "working memory."

  older messages              last 8 turns
  ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄  ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
  msg  msg  msg  msg  msg  ┃  msg  msg  msg  msg  msg  msg  msg  msg
                           ┃
  forgotten by the model   ┃  ← these go to the LLM as-is

Why 8? It's a balance. Enough for conversational coherence ("wait, you just said X two messages ago"), cheap enough to not blow up our API bill, and short enough that the model doesn't lose focus.

Step 2: The Long-Term Memory (Memo Summaries)

This is where it gets interesting. Those "forgotten" older messages aren't truly lost — they've been compressed into memo summaries.

Every 8 messages, we check: "Is the recent batch full AND none of them have a memo attached?" If yes, it's time to summarize.

  expired       │     rolling window: last 15 batches summarized          │  working memory
  ┄┄┄┄┄┄┄       │     ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄         │  ┄┄┄┄┄┄┄┄┄┄┄┄┄┄
  msg msg ┄     │     msg msg ┄     msg msg ┄    ···    msg msg ┄         │  msg msg ┄
      ↓         │           ↓             ↓                   ↓           │        ↓
  Memo 4 ✕      │     Memo 5        Memo 6       ···    Memo 19 ←new      │  Sent Raw

Each batch of 8 messages gets compressed into one memo with 2-3 highlights. New memos are appended to the end. The last 8 messages stay raw. We keep only the last 15 memos — when a new one is created, the oldest rolls off. Simple as that.

How a Memo Gets Created

When triggered, here's what happens — all in the background, never blocking the user:

  ① Recent 8 messages
           │
           ▼
  ② "Summarize the above"
           │
           ▼
  ③ Cheap, fast model + summarization prompt
           │
           ▼
  ④ Structured highlights:
     【Highlight 1】: Emily named the stray cat "Mochi"
     【Highlight 2】: Kai revealed his fear of abandonment
           │
           ▼
  ⑤ Saved to DB on the chat row itself

  ⚠️ If anything fails → memo = null, move on. Chat never breaks.

The key design decisions:

1. Fire-and-forget. This whole flow runs async in the background. The user gets their chat response instantly — they never wait for summarization.

2. Use a cheap model. The summary doesn't need GPT-4-level intelligence. A fast, inexpensive model with good instruction-following works great. We're extracting facts, not generating creative fiction.

3. Fail gracefully. If summarization throws, we set memo = null and move on. The worst case is a gap in the memory timeline, not a crashed conversation.

The Summarization Prompt (The Secret Sauce)

This is where we spent months iterating. A generic "summarize this conversation" prompt produces garbage — it's either too verbose (defeating the purpose) or too vague (missing critical details).

Our prompt instructs the model to extract only structured "Journey Highlights" across four categories:

Relationship Progression — trust, affection, betrayal, power shifts
Significant Milestones — naming events, first words, emotional breakthroughs
Notable Items & Keepsakes — symbolic objects exchanged or discovered
Major Story Turning Points — plot twists, revelations, narrative pivots

And critically, it tells the model what not to record:

No routine daily activities (eating, sleeping, bathing)
No temporary emotional states ("felt nervous" doesn't make the cut)
No minor first-time events unless they trigger something bigger
No status bar changes (health, hunger — this is roleplay, not a game HUD)

The output format is structured:

【Highlight 1】: Emily named the stray cat "Mochi" — their first shared act of care.
【Highlight 2】: Kai revealed his fear of abandonment, deepening Emily's understanding.

Concise. Factual. No fluff. Each highlight is one sentence that captures a meaningful beat.

Step 3: Retrieval and Prompt Assembly

When a new message comes in, we pull the memo summaries from the DB and layer everything into a single prompt:

const promptComponents = {
  systemPrompt:            { text: systemPrompt,            isCacheable: true  },
  memoryInstructionPrompt: { text: memoryInstructionPrompt,  isCacheable: false },
  characterPrompt:         { text: characterPrompt,          isCacheable: true  },
  userPrompt:              { text: userPrompt,               isCacheable: false },
  memoriesPrompt:          { text: memoriesPrompt,           isCacheable: true  },
  // + recent 8 messages as the conversation history
  // + current user message
};

The isCacheable flags tie into API-level prompt caching (e.g., Claude's cache control). Components that change rarely — system prompt, character info — get cached so we don't pay full price for resending them every turn. The memories prompt is also cacheable because it only changes every ~8 messages when a new memo is created.

This saves us 30-40% on API costs on average. When you're processing millions of messages per month, that adds up fast.

But here's the thing — getting prompt caching to actually work with rolling memories and sliding chat windows is a genuinely hard problem. Every time the history window slides forward by one turn, or a new memo gets created, your cache can get invalidated. We've spent significant engineering effort on cache-aligned batching to keep hit rates high. That's a deep dive on its own — coming soon. Follow along if you don't want to miss it.

With 15 memos and 8 recent turns, the model effectively "remembers" the last ~128 messages in compressed form, plus the last 16 messages verbatim. For most conversations, this covers weeks or months of chatting.

Mistakes We Made (So You Don't Have To)

1. Our first summarization prompt was too permissive

Early versions would produce summaries like: "The characters had a pleasant conversation about the weather and then discussed dinner plans." Utterly useless. We had to be extremely prescriptive about what constitutes a "memorable" event and provide tons of good/bad examples in the prompt.

2. We tried summarizing in the main request path

Our first implementation generated the memo synchronously — the user had to wait for both the summary AND the response. Response times jumped from 2s to 5s. Moving to fire-and-forget was an obvious win.

3. We didn't handle memo creation failures gracefully

If the summarization call threw an error, the chat would crash. Adding a try/catch that sets memo = null on failure was embarrassingly simple but took us a production incident to learn.

Unexpected Benefits

The Memo Book as Navigation

Here's something we didn't plan for: our users love revisiting their earlier chat history. Scrolling through thousands of messages to find "that one scene where they confessed" is painful. Nobody wants to do it.

The memo summaries accidentally became a table of contents for the conversation. Each memo is attached to a specific message in the timeline, and the highlights tell you exactly what happened in that stretch. Users can scan the memo book, find the entry that mentions the event they're looking for, and jump straight to that point in the chat.

We didn't build this as a feature — it just fell out of the architecture. But it's become one of the things users mention most when they talk about why they stick around.

Users Hijacked the Memo Book (And We Love It)

We made the memo summaries editable — figured users might want to correct mistakes or add missing details. What actually happened was way more interesting.

Users started writing entirely new memories — things that never happened in the conversation. They'd add backstory, inside jokes, shared history they wanted the character to "remember." One user wrote three memos of detailed lore about a fictional road trip the characters supposedly took together.

Our users loved it, so we leaned into it. The memo book isn't just a technical artifact anymore. It's a creative tool. Users shape the character's memory the way you'd fill in a shared journal with a close friend — part real, part wishful, part world-building. And the AI picks it all up naturally.

Why Not RAG?

The first thing most people suggest is RAG — embed all your messages, do similarity search, pull in the most relevant chunks. We tried it. It felt wrong.

The problem is that RAG retrieval is too precise. It pulls in specific memories with crystal-clear detail based on keyword similarity, and the model keeps bringing up the same moments over and over. "Oh you mentioned a rooftop — let me recall every rooftop scene in perfect detail!" That's not how memory works. You don't replay a moment at full fidelity just because something vaguely related came up.

Human memory is lossy and chronological. You remember recent things clearly, older things as impressions, and ancient things as a few key beats. RAG gives you the opposite — a random grab bag of high-fidelity flashbacks regardless of when they happened. It's unnatural and users notice. The conversations feel uncanny.

So we went a different direction.

TL;DR

If you're building a chat app and want your AI to remember things:

Keep a short window of raw recent messages (8-10 turns) for conversational coherence.
Periodically summarize older messages into structured memos using a cheap model.
Store summaries on the chat records themselves — don't over-engineer a separate memory store.
Be extremely specific in your summarization prompt about what's worth remembering. Generic "summarize this" produces junk.
Run summarization asynchronously — never block the user's response.
Use prompt caching on the stable parts of your context to cut costs — but know that making it work well with rolling windows is its own challenge.
Fail gracefully — a missing memo is way better than a crashed chat.
Skip RAG for conversational memory — chronological summaries feel more natural than similarity-search flashbacks.

The whole system is maybe 200 lines of actual logic. The hard part isn't the code — it's the prompt engineering and knowing when to summarize vs. when to keep raw context.

I'm building EchoMelon — an AI companion platform where characters actually remember your story. Follow for more deep dives on the real engineering behind AI products.

X/Twitter: @launchingmonkey · Reddit: u/Calm_Appearance_7337

DEV Community