Mehmet TURAÇ

Posted on Apr 26

The Context Window Lie

#ai #llm #architecture #transformers

Everyone is chasing longer context windows. It's the metric on every benchmark sheet. 128k. 1M. Infinite.

But here's the truth most wrappers won't tell you: you don't want a bigger context window. You want better state management.

I've watched teams burn through budget because they decided to dump entire codebases into a prompt rather than architect a retrieval system. They treat the context window like a hard drive. It isn't. It's RAM. And it's expensive RAM.

Transformers have a memory problem. Not because they forget — they remember everything too well. Every token attends to every other token. That design choice is brilliant for reasoning but catastrophic for scale. The attention mechanism scales quadratically. Double the sequence length, quadruple the compute.

This isn't theoretical. It hits your P&L. It hits your latency SLOs. It hits your VRAM limits.

The Quadratic Tax

When you run inference on a transformer, you maintain a KV cache. This cache stores the keys and values for every token generated so far. As the conversation grows, the cache grows. Eventually, it doesn't fit on a single GPU. You shard it. You page it. You swap it.

And performance tanks.

Most engineers treat this as an infrastructure problem to solve with more hardware. That's the wrong abstraction. You cannot throw H100s at an $O(N^2)$ problem forever. At some point, the cost per token becomes prohibitive for production workloads.

I see teams building agents that hold 50k tokens of conversation history just in case. They assume the model will "know" what to focus on. It does — via attention scores. But you paid for the attention calculation on every single token pair. You taxed yourself to death for data the model ultimately ignored.

Retrieval Augmented Generation (RAG) became the industry patch for this wound. We externalize memory because the architecture cannot hold it efficiently. We chunk documents. We embed them. We retrieve top-k. It's messy. It's brittle. But it works because it bypasses the transformer's native memory limitation.

But RAG is a crutch. It's a workaround for an architectural bottleneck.

Linear State Spaces

There is a different way. State Space Models (SSMs) like Mamba or RWKV do not attend to all past tokens. They compress history into a fixed-size state vector.

The complexity is linear. $O(N)$.

This changes the economics entirely. Generating token 10,000 costs roughly the same as generating token 10. The KV cache is constant. You can run these models on edge devices. You can run them on CPUs. The inference cost decouples from sequence length.

For years, researchers thought RNNs were dead. Transformers killed them because RNNs couldn't parallelize training. You couldn't train fast. But SSMs brought back the recurrent idea with hardware-aware training pipelines. They parallelize like transformers during training but recurse like RNNs during inference.

This matters for production. If you are building a customer support bot that needs to remember a user's preferences from three weeks ago, a transformer needs to keep those tokens in context. An SSM just updates its state.

But don't pop the champagne yet. SSMs have weaknesses. They struggle with copying tasks. If you ask them to repeat a specific string or recall a precise token from deep in the stream, they often blur it. Transformers excel at content-based retrieval because attention is literally content-based retrieval.

The Hybrid Future

So we are not deleting transformers. We are muting them.

The industry is moving toward hybrid architectures. Layers of attention mixed with layers of state space. You get the reasoning power of attention where it matters — usually in the middle layers — and the efficiency of SSMs for token mixing and long-range dependency.

This is where the engineering work happens now. It's not about prompting anymore. It's about model selection and architecture awareness.

If you are building a log analysis tool, do not use a 128k context transformer. Use a hybrid or a pure SSM. You need to scan long streams, not reason about nuance. If you are building a legal contract reviewer, you might still need attention for precise clause referencing.

Stop treating models as black boxes. Read the architecture papers. Know if your model uses RoPE, ALiBi, or no positional embeddings at all. These choices dictate how your system behaves at scale.

Memory Is Not Context

We need to separate "memory" from "context". Context is what you feed the model right now. Memory is what the system retains over time.

Transformers conflate these. To remember something, you must context it. This leads to bloated prompts and lazy engineering.

The next generation of AI platforms will treat memory as a distinct substrate. Vector stores are part of it, but so are state vectors. Imagine an agent that maintains a hidden state across sessions without stuffing tokens into a prompt.

This requires changes in how we serialize state. We can't just dump JSON into a prompt. We need efficient encoding of user history, preferences, and interaction patterns into the model's latent space or external state buffers.

Some teams are already experimenting with "infinite loss" training where models learn to compress their own history. Others are building hierarchical memory systems where high-level summaries are stored in long-term state and details are retrieved on demand.

It's early. But it's necessary.

The Economic Reality

Hype tells you AGI is coming. I tell you cost per token is the real barrier.

Venture capital loves benchmarks. Engineering loves margins. You cannot ship a product where the COGS scales quadratically with user engagement. It breaks unit economics.

I've seen demos where the agent reads every email you've ever sent to answer a question. It looks magical. Then you calculate the inference cost per query. It's $0.50. You charge $0.10. You go bankrupt.

Efficiency isn't just optimization. It's product viability.

The transition to linear-time architectures isn't about making models smarter. It's about making them cheaper. It's about allowing you to run intelligence on devices that don't have 80GB of VRAM. It's about reducing latency so your user doesn't stare at a streaming cursor for ten seconds.

What Comes Next

We will see more hybrids. Mamba-2 is already pushing this. MoE (Mixture of Experts) models are sparsifying the compute. Quantization is getting aggressive.

But the biggest shift is mental. Engineers need to stop assuming context is free. It isn't.

Design your systems assuming the context window is small. Force yourself to build retrieval. Force yourself to manage state. Then, when you get a larger window, it's a bonus — not the foundation.

The transformers we have today are incredible. But they are fuel-inefficient muscle cars. We need sedans. We need hybrids. We need engines that sip tokens instead of guzzling them.

If you are architecting a platform today, ask yourself: does this rely on attention over everything? If the answer is yes, you have a scalability cliff.

Find the state. Compress it. Cache it.

The future of AI engineering isn't bigger models. It's tighter systems.

Top comments (1)

Vikrant Shukla • May 11

Excellent framing — the RAM vs. hard drive analogy is the clearest way I've seen to explain why "just increase the context window" is the wrong instinct.

A few additions worth layering on:

The quadratic scaling argument is slightly more nuanced with FlashAttention and its successors. FlashAttention-2/3 gets the memory complexity to O(N) by tiling and avoiding the full NxN attention matrix materialization, and the compute is still O(N²) theoretically but the constant is dramatically reduced through SRAM reuse. This is why 1M-token transformers are now possible at all — not because the math changed, but because the hardware-aware implementation changed. Still doesn't escape the fundamental cost cliff, but it pushes it further out.

On SSMs: the copying weakness you mention is well-documented, and it's rooted in the fixed-size state being a lossy compression of history. Mamba's selective state spaces help — the model learns which inputs to retain vs. decay — but precise verbatim recall from early context is genuinely harder than with attention. The Jamba paper (Mamba + Transformer layers interleaved by AI21) is worth reading for empirical data on where each layer type wins. The attention layers in hybrid models tend to cluster in the middle where compositional reasoning is heaviest, with SSM layers handling the long-range token mixing at the edges.

One thing that doesn't get enough attention: positional encoding choices matter enormously for the "context window lie." RoPE extrapolates much better than sinusoidal, and YaRN / LongRoPE specifically address the out-of-distribution behavior when sequence length exceeds training length. A lot of "our model does 128K!" claims rely on these extrapolation tricks, and the actual effective context (where retrieval quality holds) is often 2-4x shorter than advertised.

The point about memory vs. context as distinct substrates is where I think the field is genuinely underinvested. State serialization that isn't "JSON in a prompt" is an open research problem — MemGPT's approach (paging hot/cold memory) is interesting but still fundamentally token-centric. True latent-space state persistence across sessions without a context-window intermediary is likely going to require architectural changes rather than just prompting tricks.