Pranava Kailash Subramaniam Prema

Posted on Jan 5

Part 3: Why Transformers Still Forget

#programming #ai #llm #architecture

This is Part 3 and the final post in a three-part series on why long-context language models still struggle with memory.

In Part 1, we saw why increasing context length does not equal better memory.

In Part 2, we reframed sequence models as memory systems using the MIRAS perspective.

In this final post, we examine Titans, a concrete architecture that puts those memory principles into practice.

Why Titans Exist at All

Titans does not start by asking how to make attention cheaper or how to stretch context windows further. It starts from a more fundamental observation: short-term memory and long-term memory serve different purposes and should not be implemented by the same mechanism.

Attention excels at precise, short-range dependency modelling. It is flexible, expressive, and powerful, but expensive and fragile at scale. Long-term memory, on the other hand, must persist across long horizons, store abstractions rather than raw data, and selectively forget. Titans exist because trying to force attention to play both roles leads to unavoidable trade-offs.

Rather than replacing attention, Titans keeps it where it performs best and introduces a dedicated long-term memory module alongside it.

The Three Memory Components in Titans

At a high level, Titans separates memory into three distinct components.

The core model utilises attention as a form of short-term memory, operating within a limited window where precision is most crucial. This is where immediate reasoning and local dependency tracking happen.

The long-term memory module is implemented as a neural network rather than a fixed-size vector or matrix. Its role is to store information that should persist beyond the attention window. Crucially, this memory is not static; it can be updated as the model processes new data.

Finally, persistent memory captures task-level or global knowledge that does not change during inference. This allows the system to separate stable knowledge from context-specific learning.

This explicit separation is what allows Titans to scale memory without relying on unbounded attention.

Conceptual diagram illustrating how short-term attention and long-term memory interact in Titans.

Learning During Inference: Test-Time Memory Updates

The most distinctive feature of Titans is that it allows memory updates during inference. Instead of freezing all learning at training time, Titans treats long-term memory as something that can evolve while the model is running.

This raises an immediate concern: how does the model avoid learning noise, contradictions, or irrelevant details?

Titans addresses this by introducing a surprise-driven update mechanism. Intuitively, the model measures how unexpected a new input is based on gradient signals. Information that produces little surprise is unlikely to be written to memory, while information that generates strong learning signals is more likely to be retained.

To stabilise this process, Titans incorporates momentum so that important information remains relevant across neighbouring tokens, and adaptive forgetting so memory does not grow without bound. Forgetting is not treated as a failure mode, but as a necessary control mechanism.

How Titans Integrates Memory with Attention

Titans explores multiple ways of connecting long-term memory to the core attention mechanism, each with different trade-offs.

In Memory as Context (MAC), retrieved long-term memory is injected directly into the attention context, allowing attention to decide how much to use. This provides strong performance but increases the load on attention.

In Memory as Gate (MAG), long-term memory runs in parallel with attention, and a gating mechanism blends their outputs. This balances efficiency and expressiveness.

In Memory as Layer (MAL), memory is placed as a layer before attention. This simplifies integration but reduces interaction between short-term and long-term memory.

These variants make explicit that memory integration is a routing decision, not an afterthought.

Side-by-side diagram showing different paths for memory integration in Titans. Source: Google Research, Titans paper (arXiv:2501.00663)

Scalability and What the Results Actually Show

Titans demonstrates that this memory-first design can scale to extremely long contexts, with experiments extending beyond two million tokens. Importantly, performance does not collapse as context grows. On retrieval-heavy benchmarks, Titans maintains strong accuracy where attention-only models degrade.

The key takeaway is not the exact numbers, but the trend: explicit long-term memory changes how scaling behaves. Instead of paying quadratic costs or compressing aggressively, Titans keeps attention bounded and relies on memory to carry forward what matters.

This is a qualitative shift in how long-context modelling is approached.

Graph illustrating retrieval performance as context length increases.

Practical Trade-offs and Open Constraints

Titans is not a free win. Allowing memory to update during inference introduces additional computation and system complexity. Serving such models requires careful monitoring, memory management, and safeguards to prevent drift.

There are also open questions around evaluation. Measuring “true” long-term memory usage in realistic settings is difficult, and synthetic benchmarks can overemphasise recall patterns that do not always transfer cleanly to real-world workloads.

Titans make these challenges explicit rather than hiding them behind larger context windows.

What Titans Teaches Us Beyond This Architecture

Even if Titans itself is not the final answer, it highlights several durable lessons.

First, memory should be treated as a first-class system, not a side effect of attention or recurrence. Second, forgetting must be controlled and intentional, not incidental. Third, long-context performance improves when models learn what to store, not just what to attend to.

These insights generalise beyond Titans and point toward a broader shift in how sequence models are designed.

Conclusion: From Context Scaling to Memory Design

This series began by questioning the assumption that more context equals better memory. We then reframed sequence models as memory systems with explicit design choices. Titans provides a concrete example of what happens when those choices are made deliberately.

The future of long-context AI systems is unlikely to be defined by ever-larger windows alone. It will be defined by how memory is structured, updated, and forgotten over time.

That shift from context scaling to memory design is the real contribution of the Titans line of work.

DEV Community