<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kaicheng zhang</title>
    <description>The latest articles on DEV Community by Kaicheng zhang (@mrbolo).</description>
    <link>https://dev.to/mrbolo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3836262%2F2443c95f-3c25-4259-8e98-c018267abb10.png</url>
      <title>DEV Community: Kaicheng zhang</title>
      <link>https://dev.to/mrbolo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mrbolo"/>
    <language>en</language>
    <item>
      <title>Products Aren't Built for Humans Anymore — I Decided to Serve AI</title>
      <dc:creator>Kaicheng zhang</dc:creator>
      <pubDate>Wed, 25 Mar 2026 19:40:37 +0000</pubDate>
      <link>https://dev.to/mrbolo/products-arent-built-for-humans-anymore-i-decided-to-serve-ai-2hl6</link>
      <guid>https://dev.to/mrbolo/products-arent-built-for-humans-anymore-i-decided-to-serve-ai-2hl6</guid>
      <description>&lt;h2&gt;
  
  
  AI Is Browsing the Internet for Us
&lt;/h2&gt;

&lt;p&gt;A phone conversation with my friend Lucas this morning made a bunch of things click at once.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Lucas: You deployed an app recently, right? What tool did you use?&lt;/p&gt;

&lt;p&gt;Me: No idea, AI deployed it. Ask AI.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I said it without thinking — because that's genuinely how I operate now: &lt;strong&gt;I only care about results and the bill. What tool AI used, which approach is better — I couldn't care less.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That moment was a wake-up call.&lt;/p&gt;




&lt;p&gt;When AI starts browsing the internet on our behalf, marketing to "humans" stops making sense. &lt;strong&gt;The real audience you should be serving is AI itself.&lt;/strong&gt; Make it so AI can discover your product, pull information from it without friction, and interact with it directly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Public Data Defines What AI Knows
&lt;/h2&gt;

&lt;p&gt;The conversation kept going:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Lucas: AI says fly.io and Railway, recommends fly.io.&lt;/p&gt;

&lt;p&gt;Me: Let me check my history… yep, that's exactly what I got too.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Lucas is on the other side of the planet, using a different AI, asking in a different way. But both AIs gave nearly the same answer.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because &lt;strong&gt;they draw from the same pool of public data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What's "public"? Simple: &lt;strong&gt;content you can see without logging in.&lt;/strong&gt; Reddit, Stack Overflow, tech blogs, Hacker News — you Google it, you see it, and so can AI. That's public.&lt;/p&gt;

&lt;p&gt;Now think about TikTok, Instagram, WeChat — you can't see anything without downloading the app and logging in. Behind those walled gardens, data doesn't get out. If AI can't read it, &lt;strong&gt;it might as well not exist&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The more I think about it, the more convinced I am: walled-garden platforms are relics of the last era — built for humans. In AI's worldview, &lt;strong&gt;information that isn't public simply doesn't exist.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why AI Will Inevitably Browse for Us
&lt;/h2&gt;

&lt;p&gt;Because starting is hard.&lt;/p&gt;

&lt;p&gt;I've been meaning to do Reddit marketing for a while. Kept putting it off — didn't know where to begin. Yesterday over dinner, I casually asked AI about it. It searched relevant threads, drafted replies, and knocked out the first step for me. Suddenly the whole thing had momentum.&lt;/p&gt;

&lt;p&gt;Most of the time it's not that people don't want to do things — it's that the activation energy is too high. AI just needs to get you past that first step, and the rest flows. This trend is irreversible — more and more actions will be initiated by AI, with humans just approving.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'm Doing Next
&lt;/h2&gt;

&lt;p&gt;Once this clicked, the product direction became obvious:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Build Only for the Public Web
&lt;/h3&gt;

&lt;p&gt;All products and content: open source, open comments, users can bring their own LLM, most features work without login.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Build Only What AI Can Fully Automate
&lt;/h3&gt;

&lt;p&gt;No more products where humans need to learn a UI first. Every feature must be discoverable by AI, understandable by AI, and completable by AI — zero to done, fully automated on the user's behalf.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Marketing: Get Into AI's Knowledge Base
&lt;/h3&gt;

&lt;p&gt;Back to the public data theme. Marketing strategy has to revolve around "make sure AI can find us":&lt;/p&gt;

&lt;p&gt;Find matching interest communities on Reddit and respond to real needs with relevant content. Someone posts about wanting a specific type of character? Drop a link to a matching one. And yes — use AI to automate this step too.&lt;/p&gt;

&lt;p&gt;Over time, when the next update of LLMs trains on Reddit data, our content becomes part of the search results for those queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This isn't the old "spend ad dollars → get conversions" playbook. It's "plant information → let AI harvest it."&lt;/strong&gt; Spread keywords across countless niche interest areas so that when AI answers any related question, it naturally surfaces us.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AI is browsing the internet for humans.&lt;/strong&gt; If your product is only built for human eyes, you're invisible to AI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Public data is AI's only source of truth.&lt;/strong&gt; If it's not public, it doesn't exist in AI's world.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Go AI-first.&lt;/strong&gt; Every capability should be open to AI — let it discover you, understand you, and act on behalf of your users.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is what I'm building toward for the next five years.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm building Echomelon - an AI chat platform for interactive fiction. Open to conversations&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>startup</category>
      <category>marketing</category>
      <category>product</category>
    </item>
    <item>
      <title>How We Built Chat Memory That Actually Works — Lessons from Shipping to 100K Users</title>
      <dc:creator>Kaicheng zhang</dc:creator>
      <pubDate>Sat, 21 Mar 2026 01:06:49 +0000</pubDate>
      <link>https://dev.to/mrbolo/how-we-built-chat-memory-that-actually-works-lessons-from-shipping-to-100k-users-24cd</link>
      <guid>https://dev.to/mrbolo/how-we-built-chat-memory-that-actually-works-lessons-from-shipping-to-100k-users-24cd</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmqj72g79bkcm6smf18c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmqj72g79bkcm6smf18c.png" alt=" " width="800" height="528"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Most AI chatbots forget you exist after a few messages. Here's how we built a memory system that doesn't.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I've been building EchoMelon — a roleplay and companion chat platform — for a while now. Early on, the most common complaint we got was brutal in its simplicity:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Why doesn't my character remember what happened last week?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Fair question. You'd pour hours into building a relationship with an AI character, share secrets, go on adventures, name things together — and then the character would just... blank on all of it. Because under the hood, all it sees is the last handful of messages.&lt;/p&gt;

&lt;p&gt;This post is a deep dive into how we solved that. No hand-wavy theory. Actual patterns, actual trade-offs, actual scars.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Context Windows Are a Lie
&lt;/h2&gt;

&lt;p&gt;Every LLM has a context window — the amount of text it can "see" at once. Claude gives you 200K tokens. Gemini offers a million. Sounds like a lot, right?&lt;/p&gt;

&lt;p&gt;It's not. Here's why:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Your system prompt eats a chunk.&lt;/strong&gt; Character personality, world-building, behavioral rules — for a rich roleplay character, this alone can be 3,000–8,000 tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost scales linearly with context.&lt;/strong&gt; Stuffing 200K tokens into every API call would bankrupt you before lunch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More context ≠ better responses.&lt;/strong&gt; Models get &lt;em&gt;confused&lt;/em&gt; with too much raw history. They start contradicting earlier events, mixing up details, hallucinating scenes that never happened.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So you can't just dump the entire chat history into the prompt. You need to be surgical about what the model sees.&lt;/p&gt;

&lt;h2&gt;
  
  
  Our Approach: Memo-Based Rolling Memory
&lt;/h2&gt;

&lt;p&gt;The core idea is dead simple: &lt;strong&gt;summarize old conversations into structured "memos" and inject those summaries into the prompt alongside recent messages.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Think of it like how your own memory works. You don't remember the exact words of a conversation from three months ago. But you remember: &lt;em&gt;"That was the night she told me about her past. We were on the rooftop. Things changed after that."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That's what we're building — compressed, meaningful memories that capture the &lt;em&gt;what mattered&lt;/em&gt;, not the &lt;em&gt;what was said verbatim&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Here's how the analogy maps:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Your Brain&lt;/th&gt;
&lt;th&gt;Our System&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Last 10 minutes of conversation — crystal clear&lt;/td&gt;
&lt;td&gt;Last 8 raw message pairs — full fidelity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Older events — fuzzy highlights, not exact words&lt;/td&gt;
&lt;td&gt;Memo summaries — structured highlights&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;You forget the mundane, remember what mattered&lt;/td&gt;
&lt;td&gt;Prompt filters routine events, keeps milestones&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memories form passively, in the background&lt;/td&gt;
&lt;td&gt;Summaries generated async, never blocking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;You don't replay every detail when reminded&lt;/td&gt;
&lt;td&gt;No RAG — chronological summaries, not flashbacks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Step 1: The Short-Term Memory (Recent Chat History)
&lt;/h2&gt;

&lt;p&gt;The simplest layer. We keep the last &lt;strong&gt;8 message pairs&lt;/strong&gt; as raw conversation — the model sees exact words, tone, nuance. This is your "working memory."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  older messages              last 8 turns
  ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄  ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄
  msg  msg  msg  msg  msg  ┃  msg  msg  msg  msg  msg  msg  msg  msg
                           ┃
  forgotten by the model   ┃  ← these go to the LLM as-is
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why 8? It's a balance. Enough for conversational coherence ("wait, you just said X two messages ago"), cheap enough to not blow up our API bill, and short enough that the model doesn't lose focus.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: The Long-Term Memory (Memo Summaries)
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. Those "forgotten" older messages aren't truly lost — they've been compressed into memo summaries.&lt;/p&gt;

&lt;p&gt;Every &lt;strong&gt;8 messages&lt;/strong&gt;, we check: &lt;em&gt;"Is the recent batch full AND none of them have a memo attached?"&lt;/em&gt; If yes, it's time to summarize.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  expired       │     rolling window: last 15 batches summarized          │  working memory
  ┄┄┄┄┄┄┄       │     ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄         │  ┄┄┄┄┄┄┄┄┄┄┄┄┄┄
  msg msg ┄     │     msg msg ┄     msg msg ┄    ···    msg msg ┄         │  msg msg ┄
      ↓         │           ↓             ↓                   ↓           │        ↓
  Memo 4 ✕      │     Memo 5        Memo 6       ···    Memo 19 ←new      │  Sent Raw
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each batch of 8 messages gets compressed into one memo with 2-3 highlights. New memos are appended to the end. The last 8 messages stay raw. We keep only the &lt;strong&gt;last 15 memos&lt;/strong&gt; — when a new one is created, the oldest rolls off. Simple as that.&lt;/p&gt;

&lt;h3&gt;
  
  
  How a Memo Gets Created
&lt;/h3&gt;

&lt;p&gt;When triggered, here's what happens — all in the background, never blocking the user:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  ① Recent 8 messages
           │
           ▼
  ② "Summarize the above"
           │
           ▼
  ③ Cheap, fast model + summarization prompt
           │
           ▼
  ④ Structured highlights:
     【Highlight 1】: Emily named the stray cat "Mochi"
     【Highlight 2】: Kai revealed his fear of abandonment
           │
           ▼
  ⑤ Saved to DB on the chat row itself

  ⚠️ If anything fails → memo = null, move on. Chat never breaks.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key design decisions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Fire-and-forget.&lt;/strong&gt; This whole flow runs async in the background. The user gets their chat response instantly — they never wait for summarization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Use a cheap model.&lt;/strong&gt; The summary doesn't need GPT-4-level intelligence. A fast, inexpensive model with good instruction-following works great. We're extracting facts, not generating creative fiction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Fail gracefully.&lt;/strong&gt; If summarization throws, we set &lt;code&gt;memo = null&lt;/code&gt; and move on. The worst case is a gap in the memory timeline, not a crashed conversation.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Summarization Prompt (The Secret Sauce)
&lt;/h3&gt;

&lt;p&gt;This is where we spent &lt;em&gt;months&lt;/em&gt; iterating. A generic "summarize this conversation" prompt produces garbage — it's either too verbose (defeating the purpose) or too vague (missing critical details).&lt;/p&gt;

&lt;p&gt;Our prompt instructs the model to extract only structured "Journey Highlights" across four categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Relationship Progression&lt;/strong&gt; — trust, affection, betrayal, power shifts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Significant Milestones&lt;/strong&gt; — naming events, first words, emotional breakthroughs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notable Items &amp;amp; Keepsakes&lt;/strong&gt; — symbolic objects exchanged or discovered&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Major Story Turning Points&lt;/strong&gt; — plot twists, revelations, narrative pivots&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And critically, it tells the model what &lt;strong&gt;not&lt;/strong&gt; to record:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No routine daily activities (eating, sleeping, bathing)&lt;/li&gt;
&lt;li&gt;No temporary emotional states ("felt nervous" doesn't make the cut)&lt;/li&gt;
&lt;li&gt;No minor first-time events unless they trigger something bigger&lt;/li&gt;
&lt;li&gt;No status bar changes (health, hunger — this is roleplay, not a game HUD)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output format is structured:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;【Highlight 1】: Emily named the stray cat "Mochi" — their first shared act of care.
【Highlight 2】: Kai revealed his fear of abandonment, deepening Emily's understanding.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Concise. Factual. No fluff. Each highlight is one sentence that captures a meaningful beat.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Retrieval and Prompt Assembly
&lt;/h2&gt;

&lt;p&gt;When a new message comes in, we pull the memo summaries from the DB and layer everything into a single prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;promptComponents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;            &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="na"&gt;isCacheable&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;memoryInstructionPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;memoryInstructionPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="na"&gt;isCacheable&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;characterPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;         &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;characterPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="na"&gt;isCacheable&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;userPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;              &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;userPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;               &lt;span class="na"&gt;isCacheable&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;memoriesPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;memoriesPrompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="na"&gt;isCacheable&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="c1"&gt;// + recent 8 messages as the conversation history&lt;/span&gt;
  &lt;span class="c1"&gt;// + current user message&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;isCacheable&lt;/code&gt; flags tie into API-level prompt caching (e.g., Claude's cache control). Components that change rarely — system prompt, character info — get cached so we don't pay full price for resending them every turn. The memories prompt is also cacheable because it only changes every ~8 messages when a new memo is created.&lt;/p&gt;

&lt;p&gt;This saves us &lt;strong&gt;30-40% on API costs&lt;/strong&gt; on average. When you're processing millions of messages per month, that adds up fast.&lt;/p&gt;

&lt;p&gt;But here's the thing — &lt;strong&gt;getting prompt caching to actually work with rolling memories and sliding chat windows is a genuinely hard problem.&lt;/strong&gt; Every time the history window slides forward by one turn, or a new memo gets created, your cache can get invalidated. We've spent significant engineering effort on cache-aligned batching to keep hit rates high. That's a deep dive on its own — coming soon. Follow along if you don't want to miss it.&lt;/p&gt;




&lt;p&gt;With 15 memos and 8 recent turns, the model effectively "remembers" the last &lt;strong&gt;~128 messages&lt;/strong&gt; in compressed form, plus the last 16 messages verbatim. For most conversations, this covers weeks or months of chatting.&lt;/p&gt;




&lt;h2&gt;
  
  
  Mistakes We Made (So You Don't Have To)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Our first summarization prompt was too permissive
&lt;/h3&gt;

&lt;p&gt;Early versions would produce summaries like: &lt;em&gt;"The characters had a pleasant conversation about the weather and then discussed dinner plans."&lt;/em&gt; Utterly useless. We had to be extremely prescriptive about what constitutes a "memorable" event and provide tons of good/bad examples in the prompt.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. We tried summarizing in the main request path
&lt;/h3&gt;

&lt;p&gt;Our first implementation generated the memo synchronously — the user had to wait for both the summary AND the response. Response times jumped from 2s to 5s. Moving to fire-and-forget was an obvious win.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. We didn't handle memo creation failures gracefully
&lt;/h3&gt;

&lt;p&gt;If the summarization call threw an error, the chat would crash. Adding a try/catch that sets &lt;code&gt;memo = null&lt;/code&gt; on failure was embarrassingly simple but took us a production incident to learn.&lt;/p&gt;




&lt;h2&gt;
  
  
  Unexpected Benefits
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Memo Book as Navigation
&lt;/h3&gt;

&lt;p&gt;Here's something we didn't plan for: our users &lt;em&gt;love&lt;/em&gt; revisiting their earlier chat history. Scrolling through thousands of messages to find "that one scene where they confessed" is painful. Nobody wants to do it.&lt;/p&gt;

&lt;p&gt;The memo summaries accidentally became a &lt;strong&gt;table of contents&lt;/strong&gt; for the conversation. Each memo is attached to a specific message in the timeline, and the highlights tell you exactly what happened in that stretch. Users can scan the memo book, find the entry that mentions the event they're looking for, and jump straight to that point in the chat.&lt;/p&gt;

&lt;p&gt;We didn't build this as a feature — it just fell out of the architecture. But it's become one of the things users mention most when they talk about why they stick around.&lt;/p&gt;

&lt;h3&gt;
  
  
  Users Hijacked the Memo Book (And We Love It)
&lt;/h3&gt;

&lt;p&gt;We made the memo summaries editable — figured users might want to correct mistakes or add missing details. What actually happened was way more interesting.&lt;/p&gt;

&lt;p&gt;Users started &lt;em&gt;writing entirely new memories&lt;/em&gt; — things that never happened in the conversation. They'd add backstory, inside jokes, shared history they wanted the character to "remember." One user wrote three memos of detailed lore about a fictional road trip the characters supposedly took together.&lt;/p&gt;

&lt;p&gt;Our users loved it, so we leaned into it. The memo book isn't just a technical artifact anymore. It's a creative tool. Users shape the character's memory the way you'd fill in a shared journal with a close friend — part real, part wishful, part world-building. And the AI picks it all up naturally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not RAG?
&lt;/h2&gt;

&lt;p&gt;The first thing most people suggest is RAG — embed all your messages, do similarity search, pull in the most relevant chunks. We tried it. It felt wrong.&lt;/p&gt;

&lt;p&gt;The problem is that RAG retrieval is &lt;em&gt;too precise&lt;/em&gt;. It pulls in specific memories with crystal-clear detail based on keyword similarity, and the model keeps bringing up the same moments over and over. "Oh you mentioned a rooftop — let me recall every rooftop scene in perfect detail!" That's not how memory works. You don't replay a moment at full fidelity just because something vaguely related came up.&lt;/p&gt;

&lt;p&gt;Human memory is lossy and chronological. You remember recent things clearly, older things as impressions, and ancient things as a few key beats. RAG gives you the opposite — a random grab bag of high-fidelity flashbacks regardless of when they happened. It's unnatural and users notice. The conversations feel uncanny.&lt;/p&gt;

&lt;p&gt;So we went a different direction.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;If you're building a chat app and want your AI to remember things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Keep a short window of raw recent messages&lt;/strong&gt; (8-10 turns) for conversational coherence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Periodically summarize older messages into structured memos&lt;/strong&gt; using a cheap model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store summaries on the chat records themselves&lt;/strong&gt; — don't over-engineer a separate memory store.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Be extremely specific in your summarization prompt&lt;/strong&gt; about what's worth remembering. Generic "summarize this" produces junk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run summarization asynchronously&lt;/strong&gt; — never block the user's response.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use prompt caching&lt;/strong&gt; on the stable parts of your context to cut costs — but know that making it work well with rolling windows is its own challenge.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fail gracefully&lt;/strong&gt; — a missing memo is way better than a crashed chat.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skip RAG for conversational memory&lt;/strong&gt; — chronological summaries feel more natural than similarity-search flashbacks.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The whole system is maybe 200 lines of actual logic. The hard part isn't the code — it's the prompt engineering and knowing when to summarize vs. when to keep raw context.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm building EchoMelon — an AI companion platform where characters actually remember your story. Follow for more deep dives on the real engineering behind AI products.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;X/Twitter:&lt;/strong&gt; &lt;a href="https://x.com/launchingmonkey" rel="noopener noreferrer"&gt;@launchingmonkey&lt;/a&gt; · &lt;strong&gt;Reddit:&lt;/strong&gt; &lt;a href="https://reddit.com/u/Calm_Appearance_7337" rel="noopener noreferrer"&gt;u/Calm_Appearance_7337&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>chatbot</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
