Juan David Gómez

Posted on Apr 19

My AI Sends 30k Tokens Per Message. 80% of Them Were Wasted.

Building AI side projects is fun until you have to pay for them.

I built Synapse, an AI companion with deep memory powered by a knowledge graph. My wife uses it daily for therapy, coaching, and reflection. The AI knows her life, her patterns, her goals, her emotional triggers. It remembers things across weeks and months.

Two weeks ago, I connected PostHog to track LLM costs. Here is what I saw:

24 in two weeks. Four users. One of her sessions hit $2.42 for 28 messages. A single conversation.

I looked at the token breakdown and the problem was obvious. Every message sends roughly 30,000 tokens of system context. Her knowledge graph has grown rich after weeks of daily use. Entities, relationships, temporal facts, emotional patterns. All of it compiled into a structured text snapshot and injected into every single message.

And 80 to 90% of those tokens are the exact same compiled knowledge repeated on every single turn.

The memory quality is great. The cost structure is not. So I made two changes: I restructured how context is assembled, and I added an explicit cache layer using Gemini's CachedContent API. Together they cut the cost per message by more than half.

Quick Context: The Hybrid Memory Architecture

If you have been following this series, you know the backstory. If not, here is the short version. (The full technical deep dive is in Scaling AI Memory: How I Tamed a 120K Token Prompt with Deterministic GraphRAG)

Synapse uses a two-layer approach to give the AI long-term memory:

1. Base Compilation (Working Memory). When a session starts, Synapse Cortex compiles the knowledge graph into a structured text summary. Entities, relationships, temporal facts. The most connected nodes always make it in. A waterfill algorithm caps the budget at roughly 120,000 characters (~30K tokens). This is the "always-on" context.

2. GraphRAG (Episodic Recall). When the graph is too large for the budget, a second layer retrieves long-tail memories per-turn using hybrid search. It uses the graph UUIDs from the compilation metadata to avoid duplicating what is already in the base. Zero-latency, deterministic, no agent loops.

This works well for quality. The AI still feels like it knows everything about you. But the cost story has a gap: that 30K compilation is the same text for the entire session, and it gets billed as fresh input tokens on every single message.

In a 28-message session, that is 28 x 30k = 840k tokens just from the base knowledge. Almost all of it identical.

The Problem: One Blob, One Bill

Before this change, the context assembly on the Convex side (the frontend backend) looked like this:

// prepareContext: before
let systemContent = session.cachedSystemPrompt;
systemContent += `\n\nCurrent date and time: ${currentDateTime}`;
if (userKnowledge) {
  systemContent += `\n\n${userKnowledge}`;
}
const apiMessages = [
  { role: "system", content: systemContent },
  ...conversationHistory,
];

One string. Persona prompt, datetime, and the entire 30K compilation concatenated together. Sent as a single system message on every request.

This design was simple and it worked fine when the graph was small. But it has two problems at scale:

You cannot cache part of a blob. AI providers (OpenAI, Google, Anthropic) offer automatic cache layers in theory. When you send the same text prefix multiple times, the provider may cache it behind the scenes and charge less. But this is unreliable. A small change anywhere in the prompt can break the matching pattern. The provider may choose not to cache for reasons you cannot see or control. You have zero visibility into whether it is working.

In my case, the systemContent blob included the current datetime on every message. That single line changing every turn was enough to break any automatic prefix matching. Even though the other 25k tokens were identical.

The persona prompt (~500 tokens) is lightweight and rarely changes. The datetime changes every turn. The knowledge compilation (~25-30K tokens) is heavy but stable for the entire session. Treating them as one string means the lightweight parts are hostage to the heavy part.

Change #1: Splitting the Context Snapshot

The first change was structural. Instead of returning one systemContent string, prepareContext now returns three separate fields:

// prepareContext: after
const systemInstruction =
  `${session.cachedSystemPrompt}\n\nCurrent date and time: ${currentDateTime}`;

const cacheName = knowledgeCache?.cacheName ?? undefined;

return {
  apiMessages,        // user/assistant turns only, no system message
  systemInstruction,  // lightweight persona + datetime (~500 tokens)
  compilation,        // heavy knowledge (~25K tokens), stable per session
  cacheName,          // Gemini cache pointer (if available)
};

The HTTP layer sends these as separate JSON parameters to Cortex:

body: JSON.stringify({
  model,
  system_instruction: systemInstruction,
  ...(compilation !== undefined && { compilation }),
  ...(cacheName !== undefined && { cache_name: cacheName }),
  messages: apiMessages,
})

This separation is what makes everything else possible. The compilation is now an independent unit that the server can handle differently from the volatile parts.

The Episodic Section

I also adjusted the compilation itself. I added a new section that summarizes the previous session. Instead of relying only on the graph's entity and relationship definitions, the model now gets a short episodic recap: "Last session you talked about X, explored Y, and mentioned Z."

This serves two purposes. First, it gives the model easy session continuity without loading raw message history. Second, it let me trim the budget for less-connected facts and concepts. The total max tokens dropped from ~30K to ~25K. That is ~5,000 fewer tokens per message before caching even enters the picture.

You can see the effect in the PostHog data. After April 10th, the average tokens per message for my wife's account dropped visibly:

Change #2: Gemini Explicit Cache

Here is where the real savings come from.

Gemini CachedContent: The Concept

Gemini offers an explicit caching API. You create a cache resource by uploading content to caches.create(). You get back a resource name like cachedContents/abc123. On subsequent requests, you pass that name and Gemini uses the cached content as a prefix instead of re-processing the input.

The economics: cached tokens cost roughly 75% less than regular input tokens. For a 25K token compilation, that means paying for about 6,250 tokens instead of 25,000. On every single turn.

The minimum is around 1,024 tokens. I use 4,000 characters as a conservative threshold.

How I Integrated It

The cache lifecycle follows the existing Synapse pipeline. No new infrastructure. No new services. Just a new step after compilation:

When a session starts (hydration), Cortex compiles the knowledge and creates a Gemini cache from it. The cacheName is returned to the client alongside the compilation. The client stores both in user_knowledge_cache (Convex table).

During the session, the client sends cache_name and compilation on every chat request. If the cache is valid, Cortex passes it to Gemini via cached_content and the compilation is served from cache. If there is no cache, Cortex inlines the compilation into the prompt. Same result, different price.

When the session closes (ingestion), new messages are processed into the knowledge graph, a fresh compilation is generated, and a new cache is created. The cycle repeats.

The entire cache manager is about 100 lines of Python:

class CacheManager:
    async def create_compilation_cache(self, user_id, compilation_text):
        if len(compilation_text) < MIN_CHARS_FOR_CACHE:
            return None, "compilation_too_small"

        cache = await self._client.aio.caches.create(
            model=self._model,
            config=types.CreateCachedContentConfig(
                display_name=f"compilation_{user_id}",
                system_instruction=compilation_text,
                ttl="3600s",
            ),
        )
        return cache.name, ""

    async def invalidate_by_name(self, cache_name):
        await self._client.aio.caches.delete(name=cache_name)

    async def refresh_ttl(self, cache_name):
        await self._client.aio.caches.update(
            name=cache_name,
            config=types.UpdateCachedContentConfig(ttl="3600s"),
        )

The backend is stateless. It does not track which user owns which cache. The client persists the cacheName and forwards it. This keeps the architecture clean and avoids a new data store.

The TTL Problem

Gemini caches expire by wall clock. Default is 1 hour. Not by usage. So a user in a long conversation would hit expiration at exactly the 60-minute mark regardless of how many messages they sent.

My solution: after every successful cache hit, I spawn a fire-and-forget task that pushes the TTL forward by another hour. Active users never expire mid-session. The task runs in the background, does not block the response, and if it fails nothing breaks. The worst case is the cache expires and the fallback kicks in.

if cache_hit and active_cache_name and cache_manager:
    _spawn_background(cache_manager.refresh_ttl(active_cache_name))

The Fallback: Engineering for the Unhappy Path

One thing I like about Gemini's implementation: if the cache is expired or not found, the request fails explicitly. It does not silently fall back to full-price tokens. It tells you. That gives you control to decide what to do next, whether that is retry with the full compilation or create a fresh cache.

But that also means a stale cache is a new way for the app to break. Caches expire by wall clock. They can get deleted upstream. The model in the cache might not match the model in the request. If I built a system that only works when the cache is hot, it would be a matter of time before my wife wakes me up at midnight asking why her precious app stopped working.

The design principle is simple: the client always sends everything. Both the cache_name and the full compilation. The server decides which to use.

If the cache is valid, use it. 75% cheaper. If the cache is stale, fall back to inlining the compilation. Full price, but it works. The user never notices.

The Peek Pattern

There is a subtlety with the Gemini SDK. It is lazy. When you open a streaming request, the actual HTTP call does not happen until you pull the first chunk. That means cache errors do not surface when you create the stream. They surface when you iterate.

So I peek the first chunk inside a try/except:

try:
    stream_iter, first_chunk = await _open_and_peek(active_cache_name)
except Exception as err:
    if use_cache and _looks_like_cache_error(err):
        # Cache is stale, invalidate it
        await cache_manager.invalidate_by_name(active_cache_name)
        # Rebuild with compilation inlined and retry
        gemini_contents = await self._build_contents(
            request, inline_compilation=True
        )
        stream_iter, first_chunk = await _open_and_peek(None)
    else:
        raise

I match against a list of known error strings: "cache expired", "cache not found", "does not match the model in the cached content", and a few more. If the error matches, I invalidate the stale cache, rebuild the contents with the compilation inlined, and retry the stream. All before any bytes reach the client.

Auto Re-Hydration

One more thing. When a fallback happens, I do not just serve the request and move on. The final SSE usage chunk includes cache_fallback_triggered: true. The Convex client watches for this flag and schedules a background re-hydration:

if (usage?.cache_fallback_triggered) {
  await ctx.scheduler.runAfter(0, internal.cortex.hydrate, {
    userId, sessionId,
  });
}

This creates a fresh cache for the next message. So only one message per expiration window pays full price. Every subsequent message in that session gets the cache benefit again.

The Numbers

Here is what the data shows after both changes went live.

Token Reduction

The episodic restructuring (Change #1) dropped the average tokens per message from the ~40K range to the ~30K range. That is visible in the PostHog chart starting around April 10th.

Cost Per Generation

From the PostHog session tracking, here is the daily breakdown:

Date	Generations	Total Cost	Cost/Generation
Apr 11	67	$1.14	$0.0170
Apr 12	86	$1.18	$0.0137
Apr 13	78	$1.32	$0.0169
Apr 14	79	$3.08	$0.0390
Apr 16	109	$1.56	$0.0143
Apr 17	54	$2.16	$0.0399
Apr 18	62	$0.55	$0.0088

April 18 is when the explicit cache was fully active. The cost per generation dropped to $0.0088. Less than half of the typical days.

The expensive days (Apr 14 and 17) are sessions with heavy ingestion: 28 and 39 generation calls respectively, including the Gemini calls for graph processing during the Sleep Cycle. Those costs include both the chat generation and the knowledge extraction, not just the user-facing messages.

The $2.42 Session Revisited

That 28-message session that cost $2.42? With caching active at the April 18 rate, the same conversation would cost roughly $0.25. That is almost a 90% reduction.

And the AI still has full access to the knowledge graph. The compilation is the same content. It is just served from cache instead of being re-tokenized on every turn.

Observability

Both changes are fully instrumented. On the Axiom side, every chat request logs cache attributes: cache.hit, cache.hit_ratio, cache.fallback_triggered, cache.skip_reason. On PostHog, I track cache_enabled, cache_hit, and cached_tokens per generation.

This means I can answer questions like: what percentage of requests hit the cache? How often does the fallback trigger? Is the TTL refresh working? No guessing.

What is Next

The cache layer is live and working. But there are a few things I want to explore:

Tighter compilation budgets. Now that the episodic summary provides session continuity, I think I can trim the base compilation further. Maybe from 25K to 20K tokens. The GraphRAG layer already handles the long tail. Less base knowledge means a smaller cache, which means even cheaper turns.

Multi-model flexibility. Gemini caches are bound to a specific model. If I switch between Flash and Pro based on the task, each model needs its own cache. That is a limitation I have not solved yet.

Building AI solutions is fun. Paying for them is the part that makes you think harder. And thinking harder usually leads to a better architecture.

This is article #5 in the Synapse series. If you want the full backstory:

The Synapse Story - What Synapse is and why I built it
Beyond RAG - Knowledge graphs as the memory foundation
Scaling AI Memory - The hybrid approach with Hydration V2 and GraphRAG
Full Circle - Giving the knowledge graph a Notion interface via MCP

The code is open source: synapse-chat-ai (frontend) and synapse-cortex (backend).

Let's connect on X or LinkedIn

DEV Community