Your app's AI assistant re-sends the whole conversation on every message — that's the bill

#aiagentmemory #webdev #fullstack #chatbot

The chat assistant you added to your app feels cheap in the demo and expensive in production. The reason is the same one that makes long chats “forget” the start of the conversation — and it is fixable.

Why the second page of a chat costs more than the first

When a user talks to the AI assistant in your app, each message is a fresh call to the model. To stay coherent, that call re-sends the system prompt and the entire conversation so far, then the new message. Message one is cheap. Message twenty re-sends the previous nineteen. So the cost of a single reply climbs as the conversation grows, and a busy support chat or a long planning session is exactly where it climbs fastest.

It is also why long chats start dropping the earlier details: once the conversation outgrows the context window, something has to be cut, and the assistant quietly loses what the user told it ten minutes ago.

The dynamic, measured

Re-sending the whole transcript every turn makes total context spend grow far faster than the conversation itself. SAIHM measured it on a reproducible, offline benchmark and saw 62.8%–85.9% fewer context tokens across a session when the assistant recalls a compact memory instead of replaying the full history, with the gap widening the longer the session runs. The benchmark is open source and runs locally, so you can model your own chat lengths and see where the curve lands for your app.

Recall the few facts a reply needs — per user

The fix is to stop re-sending the transcript. SAIHM keeps the durable facts of each user’s history — their preferences, their account context, the decisions made earlier in the thread — as separate memory cells. Each reply recalls only the handful it needs instead of replaying the whole chat, so a reply on message twenty costs about what it did on message two. The memory persists between sessions too, so a returning user is remembered without you stuffing their entire history back into the prompt. And because the store is addressable from any model — Claude, GPT, DeepSeek, Qwen, Kimi, GLM — and through LangChain or LlamaIndex, you can switch the model behind your feature without re-teaching it your users.

It is your users' data — so hold the keys to it

A per-user assistant memory is some of the most sensitive data your app holds: what each person asked, shared, and decided. With most hosted-memory products that history lives on a vendor’s servers under the vendor’s keys — which becomes your problem the moment a user invokes their right to be forgotten or your compliance team asks where that data sits. SAIHM keeps it yours: the memory is encrypted under keys you control, and erasure is per-record and provable. When a user asks you to delete their data, that user’s cells are cryptographically destroyed with an audit trail you can show — not flagged hidden in a table you hope nobody queries. For a web app carrying real user data, that is the difference between a one-click compliance answer and an open risk.

The honest close

SAIHM is a paid product, with no free tier — that is stated up front rather than buried behind a trial. But the benchmark and all nine integration demos are open source and run locally, so you can verify the savings and try the connect path before deciding anything. The tool surface and setup steps are at /developers; pricing is at /pricing.

Join SAIHM

— Architect

Independence notice. SAIHM is an Apache-2.0 protocol authored independently. The benchmark referenced here is open source and reproducible offline; the figures are produced by the published script and depend on session length and scenario. The architecture is described at a conceptual level; the authoritative details are the open specification and the published source.

Originally published at the SAIHM blog on 2026-06-30. SAIHM is the Sovereign AI Horizontal Memory protocol — Apache 2.0, open spec at saihm.coti.global.