Part 2: Why Transformers Still Forget

#programming #ai #llm #architecture

This is Part 2 of a three-part series on why long-context language models still struggle with memory.

In Part 1, we saw why increasing context length does not solve the memory problem.

Here, we introduce a memory-centric way of thinking that explains why models remember, forget, or fail under long context.

Why Architectural Labels Stop Being Useful

Most discussions about sequence models revolve around architectural families: Transformers, RNNs, state-space models, linear attention, and so on. While these labels are useful historically, they often hide the real reasons models behave the way they do. Two models with very different architectures can fail for the same reason, while two seemingly similar models can behave very differently under long context.

The MIRAS perspective starts from a simple shift: instead of asking what architecture is this?, it asks what kind of memory system is this model implementing? Once you adopt that lens, many long-context failures stop looking mysterious and start looking inevitable.

Memory as a System, Not a Side Effect

At a high level, any system that processes sequences over time must answer four questions, whether explicitly or implicitly:

How does information get written into memory?
How is information retrieved later?
What gets forgotten, and when?
How is memory updated as new data arrives?

Traditional models answer these questions indirectly. Recurrent models write by compressing history into a hidden state and read by exposing that state at the next step. Transformers write by appending tokens into the context and read by attending over them. Forgetting happens automatically when context limits are exceeded or when compression loses detail.

MIRAS makes these mechanisms explicit and treats them as design choices, not side effects.

The Four MIRAS Design Knobs

MIRAS (Memory-Informed Recurrent Associative Systems) characterizes sequence models using four core components. These are not tied to any single architecture; they describe how memory behaves.

The first is memory structure. This defines what form memory takes. It might be a vector, a matrix, or a more expressive neural network. Fixed-size structures force compression, while richer structures allow selective retention.

The second is attentional bias. This defines what the model considers relevant. In Transformers, this is typically dot-product similarity. MIRAS highlights that this choice strongly influences what gets retrieved and what gets ignored, especially in noisy or long sequences.

The third is the retention or forgetting mechanism. Forgetting is not a flaw; it is a necessity. The question is whether forgetting is controlled and adaptive, or implicit and uncontrolled. Many models forget simply because they have no choice.

The fourth is the memory update rule. This determines how memory changes over time. Some models update memory only during training. Others allow memory to update during inference in a controlled way.

Illustration showing the four MIRAS dimensions: memory structure, attentional bias, retention, and update rule.

Reinterpreting Familiar Models Through MIRAS

When you view common architectures through the MIRAS lens, their strengths and weaknesses become clearer.

Transformers use a vibrant memory structure (the full context window) and a strong attentional bias (similarity-based attention). However, their retention mechanism is crude: once the window is full, older information disappears entirely. Their memory update rule is static during inference.

Linear attention and state-space models modify their structure and update rules to achieve efficiency, but they often rely on aggressive compression. This explains why they scale well but struggle with precise recall over very long sequences.

The key insight is that these trade-offs are not accidental. They follow directly from the memory design choices each model makes.

Why Loss Functions and Objectives Matter

One subtle but important point in MIRAS is that memory behaviour is influenced not only by architecture, but also by the objective being optimised. Many models rely heavily on mean-squared-error-like objectives or similarity-based losses. These can be sensitive to noise and outliers, which in turn affects what memory updates are emphasised.

MIRAS uses this observation to motivate alternative formulations that change how relevance and stability are defined. The result is not just better robustness, but more predictable memory behaviour under long and noisy inputs.

This reinforces the central idea: memory is not just where information is stored, but how learning signals shape what is kept.

Why This Framework Matters Before Talking About Titans

Without a framework like MIRAS, Titans can look like a collection of clever tricks: test-time updates, surprise signals, adaptive forgetting. With MIRAS, those choices become legible. They are answers to explicit memory-design questions rather than ad-hoc optimisations.

Part 1 showed that attention alone cannot serve as long-term memory. Part 2 explains why most existing alternatives still fall short. Only after this framing does it make sense to examine Titans as a concrete instantiation of a different memory system.

What to Watch for in Real Applications

If you apply the MIRAS lens to real systems, patterns emerge quickly. Models fail when the memory structure is too rigid, when retention is uncontrolled, or when update rules are frozen despite changing inputs. Conversely, systems become more robust when memory design is intentional and aligned with task requirements.

This perspective is especially relevant for agents, streaming data, long-running processes, and any application where the model must operate continuously rather than in isolated prompts.

Looking Ahead to Part 3

Part 2 sets the conceptual groundwork. In Part 3, we will look closely at the Titans' architecture and see how it instantiates these memory principles in practice. We will examine how long-term memory is represented, how it updates during inference, and how forgetting is managed to keep the system stable.