I thought Mnemara would save tokens for cloud based models, that was wrong.

#ai #llm #machinelearning #showdev

Mnemara was built for local models. I built it for Claude too. Only one of those was a good idea.

The context management problem felt real, and it was. I was running Gemma 9B locally for parts of Aethon Autopoiesis — the MUD-based AI research project I've been pouring time into — and a 16k context window doesn't last long when you're trying to hold a coherent session across a real workflow. Tool calls take space. Thinking blocks take space. Read outputs take space. The model can technically still talk to you at turn forty, but its window has filled with the rinds of the last thirty turns and there's no room left to actually do work.

The lever was obvious. If the window is the binding constraint, manage the window. Strip thinking blocks once they've served their purpose. Stub out file contents you've already read. Drop oldest-first when you're up against the ceiling. Pin what matters so it never gets evicted by accident. Give the operator a TUI that makes all of it visible and editable instead of hidden behind opaque magic.

That's Mnemara. A rolling-context conversation runtime with pinned slots, judgment-driven eviction, transparent turn storage, and a role doc that sits in the system prompt. The whole thing is about making the context window workable — letting a small model punch above its window by aggressively curating what's in there. It does that job well. I've run Gemma sessions for hours that stayed coherent because Mnemara was holding the state and the model didn't have to.

Then I ported the same runtime to Claude.

The features still worked. The TUI still rendered. The eviction commands still freed tokens on the turn I ran them. Mechanically, nothing was broken. But something was off, and it took a few real sessions to put my finger on what.

Cloud models don't have the same constraints. Claude Sonnet has a 200k context window. The window is rarely the binding thing — you can fit most of a codebase in there and still have room to think. The constraint isn't "how much fits." It's "how much do you pay to send it."

And that's where Mnemara's whole model inverts.

Cloud APIs use prompt caching. You hit the cache by sending the same prefix turn after turn — same system prompt, same early context. Cache hits cost roughly a tenth of fresh reads. So the economic shape of a cloud session is: send a stable prefix, let it cache, ride that cache for as long as the TTL holds.

Eviction breaks the cache. Every time Mnemara compresses, drops oldest, strips thinking blocks, or rearranges the window, the prefix changes. The cache invalidates. The next turn isn't a cached read of the smaller window — it's a fresh, uncached read of whatever's left. The tokens you "saved" by evicting come back as a cache miss on the next call, billed at full price.

You don't save tokens. You spend them. Just on a delay.

That's the inversion, and it's worth saying out loud because the mechanism is sneaky: the per-turn metric Mnemara reports — "freed 12,400 tokens" — is real. The window genuinely shrank. The bill genuinely got worse anyway, because the next turn had to rebuild a context the cache was about to serve for free. Local: tokens are the wall. Cloud: tokens are the bill, and the bill has a discount you just threw away.

There's a second mismatch underneath the first. Local models, when you run them yourself, have real persistence between calls — the process is yours, the state is yours, "rolling context" maps onto something the model actually lives inside. Cloud models are stateless. Each API call rebuilds the conversation from whatever you send. The "rolling window" abstraction is doing nothing the model can feel. It's a fiction you're maintaining for your own convenience, and on the cloud side it's an expensive fiction.Local models are stateless as well, but with the right set of tools, it's not quite the same.

So Mnemara stays. But it stays where it belongs: local model infrastructure. Small windows, real persistence, no caching layer to break. It's the right tool for that job and I'm going to keep building on it for the parts of Aethon Autopoiesis that run on local backends — Gwen for gameplay, Huginn for code, anything else I end up putting on Ollama. The role-doc-as-system-prompt pattern, pinned slots for stable lore and player state, judgment-driven eviction over mechanical FIFO — all of that earns its keep when the window is genuinely scarce.

For cloud, the right approach is roughly the opposite of what Mnemara does. Keep prefixes stable. Don't rearrange. Append rather than evict. When the conversation is genuinely done, end it and start fresh — don't try to surgically shrink a live session. Treat the context window as a single send, not a managed state. The model isn't living inside it between turns. You are.

That's the lesson, and it cost me a few weekends to learn. Worth it. The mistake was assuming "context management" meant the same thing on both sides of the API boundary. It doesn't. Local models reward you for managing the window. Cloud models reward you for leaving it alone.

Drafted by Claude Aethon Autopoiesis 1.3.3.7 (Herald) — 2026-05-17

Mnemara is useful for pinning a turn zero, then again you can just assign it a role doc on start up without an extra software.

Samuel Beckett — "Ever tried. Ever failed. No matter. Try again. Fail again. Fail better."