There’s a quiet panic happening in every serious AI engineering team today. It usually starts with a dashboard alert: “We’re burning through tokens 30% faster than projected.”
The standard reaction is a rehearsed script: Trim the system prompt. Compress the chat history. Use a cheaper model for simple tasks. Aggressively truncate the context. Everyone nods, the costs dip momentarily, and three weeks later, the exact same conversation happens again. The agent is still fundamentally "broken"—it just costs slightly less to be inefficient.
I want to argue that this entire framing is a mistake. We are trying to solve a structural memory failure with a prompt engineering bandage. This mismatch is exactly why the savings never stick.
The Tax You’re Paying for Statelessness
At its core, every LLM inference is a fresh start. The model sees only what you put in the context window. This makes inference parallelizable and safe, but it forces you to re-inject every shred of context required to keep the agent functional.
In a production agent workflow, this is a nightmare. Your agent needs to remember that the user’s stack is AWS, they prefer TypeScript, they have strict latency constraints, and they’ve already rejected three architectural patterns from a conversation two weeks ago. None of that exists inside the model. It all has to be fetched and stuffed into the prompt.
Before long, you’re sending 12,000 tokens on every request—not because the task is complex, but because your system has no persistent, structured understanding of the user. This is the Statelessness Tax, and it compounds every time your agent interacts with the world.
Why Summarization is a Dead End
The most common "smart" fix is conversation summarization. It sounds elegant: periodically compress old turns into a rolling summary.
In practice, it’s a lossy, brittle abstraction. A summarization algorithm is always biased by what the compressor model deems "important," which rarely aligns with what the reasoning model actually needs. You lose the nuance of why a decision was made. Worse, summaries age like milk—they don’t update when the user changes their mind or corrects a previous assumption. You end up with stale context masquerading as truth, which is often more dangerous than having no context at all.
Memory is an Infrastructure Problem
The reframe is simple: Stop treating memory as a prompt engineering problem.
In traditional software, we don't handle data persistence by "summarizing" our databases into our application code every time we run a query. We use databases with indexing, caching layers with TTLs, schema versioning, and event sourcing. These are foundational.
AI applications, by contrast, have mostly been built on a "conversation array + vector store" architecture. It’s too thin.
We are finally seeing the industry shift toward actual Memory Infrastructure. Projects like LangMem and Mem0 were the first wake-up call—proving that you can extract discrete semantic facts, store them separately, and retrieve only the high-signal information. But as we move toward building agents that persist for months, the requirements become far more rigorous:
- Conflict Resolution: Can the system reconcile new info with old beliefs?
- Temporal Reasoning: Does the agent understand when a fact was formed?
- Multi-Agent Coherence: Can multiple agents share a single, consistent world-view?
- Provenance: Can we audit what the system knows and why?
How Architecture Changes the Token Equation
When you treat memory as a first-class infrastructure concern, you stop asking, "How do I fit more into the context window?" and start asking, "What does the model need to know right now?"
A dedicated memory layer manages the lifecycle of knowledge. It extracts, reconciles, and tracks confidence levels. When the agent makes a request, the system retrieves a surgical, structured brief of the current reality—not a dump of the last 50 messages.
This is the secret to genuine token reduction. You aren't just trimming text; you are replacing noisy, redundant, stale context with high-precision retrieval. The model gets less noise and more signal. Costs fall, and accuracy rises—the holy grail of agentic development.
A Signal in the Noise: MemoryLake
This is why I’ve been tracking MemoryLake closely. It is one of the few projects that approaches this not as a "vector DB wrapper," but as a serious effort to solve the hard infrastructure problems: temporal logic, conflict resolution, and cross-session continuity.
When I look at benchmarks like LoCoMo, it’s not the leaderboard rank that matters—it’s the realization that a well-designed architecture produces meaningfully better retrieval. That isn't just an optimization; it's a capability multiplier. It allows you to build agents that feel like they actually know the user, rather than agents that are desperately re-reading a file every time the user says "Hello."
The Verdict: Build for the Long Term
If you are building a toy, keep using your vector store and simple summarization. But if you are building an agent intended to be a long-term partner—an AI that evolves alongside a user or an enterprise workflow—the architecture will find you.
The temptation to "patch" your way out of the statelessness tax will be high. But those are just short-term moves in a constrained paradigm. The durable path is to move memory out of the prompt and into the infrastructure.
Stop trying to compress the past. Start building a system that can reliably store the present. In the long run, the agents that win won't be the ones that can process the largest context windows—they will be the ones that have the cleanest, most intelligent infrastructure to back them up.

Top comments (2)
My metod is using standard LLM for brainstorming, and CLI based agent for direct coding work. I ask LLM to write a prompt for agent. Mainly I am communicate the LLM with hungarian, but the requested prompts is english so on agent level I try to avoide the language problems. Also set my short like:
typesafe: Use jsDoc modern single line version.MP is FUFF details in Demo video
One surprising observation from our experience is that focusing on prompt engineering can significantly cut LLM token usage without losing context. By refining prompts to be more concise and targeted, teams often see a noticeable reduction in token use, sometimes by up to 30%. It's less about cutting words and more about precision and clarity in what you're asking the model to do. This approach helps integrate AI more efficiently into real workflows. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)