DEV Community: Pranava Kailash Subramaniam Prema

Part 3: Why Transformers Still Forget

Pranava Kailash Subramaniam Prema — Mon, 05 Jan 2026 09:46:04 +0000

This is Part 3 and the final post in a three-part series on why long-context language models still struggle with memory.

In Part 1, we saw why increasing context length does not equal better memory.

In Part 2, we reframed sequence models as memory systems using the MIRAS perspective.

In this final post, we examine Titans, a concrete architecture that puts those memory principles into practice.

Why Titans Exist at All

Titans does not start by asking how to make attention cheaper or how to stretch context windows further. It starts from a more fundamental observation: short-term memory and long-term memory serve different purposes and should not be implemented by the same mechanism.

Attention excels at precise, short-range dependency modelling. It is flexible, expressive, and powerful, but expensive and fragile at scale. Long-term memory, on the other hand, must persist across long horizons, store abstractions rather than raw data, and selectively forget. Titans exist because trying to force attention to play both roles leads to unavoidable trade-offs.

Rather than replacing attention, Titans keeps it where it performs best and introduces a dedicated long-term memory module alongside it.

The Three Memory Components in Titans

At a high level, Titans separates memory into three distinct components.

The core model utilises attention as a form of short-term memory, operating within a limited window where precision is most crucial. This is where immediate reasoning and local dependency tracking happen.

The long-term memory module is implemented as a neural network rather than a fixed-size vector or matrix. Its role is to store information that should persist beyond the attention window. Crucially, this memory is not static; it can be updated as the model processes new data.

Finally, persistent memory captures task-level or global knowledge that does not change during inference. This allows the system to separate stable knowledge from context-specific learning.

This explicit separation is what allows Titans to scale memory without relying on unbounded attention.

Conceptual diagram illustrating how short-term attention and long-term memory interact in Titans.

Learning During Inference: Test-Time Memory Updates

The most distinctive feature of Titans is that it allows memory updates during inference. Instead of freezing all learning at training time, Titans treats long-term memory as something that can evolve while the model is running.

This raises an immediate concern: how does the model avoid learning noise, contradictions, or irrelevant details?

Titans addresses this by introducing a surprise-driven update mechanism. Intuitively, the model measures how unexpected a new input is based on gradient signals. Information that produces little surprise is unlikely to be written to memory, while information that generates strong learning signals is more likely to be retained.

To stabilise this process, Titans incorporates momentum so that important information remains relevant across neighbouring tokens, and adaptive forgetting so memory does not grow without bound. Forgetting is not treated as a failure mode, but as a necessary control mechanism.

How Titans Integrates Memory with Attention

Titans explores multiple ways of connecting long-term memory to the core attention mechanism, each with different trade-offs.

In Memory as Context (MAC), retrieved long-term memory is injected directly into the attention context, allowing attention to decide how much to use. This provides strong performance but increases the load on attention.

In Memory as Gate (MAG), long-term memory runs in parallel with attention, and a gating mechanism blends their outputs. This balances efficiency and expressiveness.

In Memory as Layer (MAL), memory is placed as a layer before attention. This simplifies integration but reduces interaction between short-term and long-term memory.

These variants make explicit that memory integration is a routing decision, not an afterthought.

Side-by-side diagram showing different paths for memory integration in Titans. Source: Google Research, Titans paper (arXiv:2501.00663)

Scalability and What the Results Actually Show

Titans demonstrates that this memory-first design can scale to extremely long contexts, with experiments extending beyond two million tokens. Importantly, performance does not collapse as context grows. On retrieval-heavy benchmarks, Titans maintains strong accuracy where attention-only models degrade.

The key takeaway is not the exact numbers, but the trend: explicit long-term memory changes how scaling behaves. Instead of paying quadratic costs or compressing aggressively, Titans keeps attention bounded and relies on memory to carry forward what matters.

This is a qualitative shift in how long-context modelling is approached.

Graph illustrating retrieval performance as context length increases.

Practical Trade-offs and Open Constraints

Titans is not a free win. Allowing memory to update during inference introduces additional computation and system complexity. Serving such models requires careful monitoring, memory management, and safeguards to prevent drift.

There are also open questions around evaluation. Measuring “true” long-term memory usage in realistic settings is difficult, and synthetic benchmarks can overemphasise recall patterns that do not always transfer cleanly to real-world workloads.

Titans make these challenges explicit rather than hiding them behind larger context windows.

What Titans Teaches Us Beyond This Architecture

Even if Titans itself is not the final answer, it highlights several durable lessons.

First, memory should be treated as a first-class system, not a side effect of attention or recurrence. Second, forgetting must be controlled and intentional, not incidental. Third, long-context performance improves when models learn what to store, not just what to attend to.

These insights generalise beyond Titans and point toward a broader shift in how sequence models are designed.

Conclusion: From Context Scaling to Memory Design

This series began by questioning the assumption that more context equals better memory. We then reframed sequence models as memory systems with explicit design choices. Titans provides a concrete example of what happens when those choices are made deliberately.

The future of long-context AI systems is unlikely to be defined by ever-larger windows alone. It will be defined by how memory is structured, updated, and forgotten over time.

That shift from context scaling to memory design is the real contribution of the Titans line of work.

Part 2: Why Transformers Still Forget

Pranava Kailash Subramaniam Prema — Sun, 28 Dec 2025 21:05:38 +0000

This is Part 2 of a three-part series on why long-context language models still struggle with memory.

In Part 1, we saw why increasing context length does not solve the memory problem.

Here, we introduce a memory-centric way of thinking that explains why models remember, forget, or fail under long context.

Why Architectural Labels Stop Being Useful

Most discussions about sequence models revolve around architectural families: Transformers, RNNs, state-space models, linear attention, and so on. While these labels are useful historically, they often hide the real reasons models behave the way they do. Two models with very different architectures can fail for the same reason, while two seemingly similar models can behave very differently under long context.

The MIRAS perspective starts from a simple shift: instead of asking what architecture is this?, it asks what kind of memory system is this model implementing? Once you adopt that lens, many long-context failures stop looking mysterious and start looking inevitable.

Memory as a System, Not a Side Effect

At a high level, any system that processes sequences over time must answer four questions, whether explicitly or implicitly:

How does information get written into memory?
How is information retrieved later?
What gets forgotten, and when?
How is memory updated as new data arrives?

Traditional models answer these questions indirectly. Recurrent models write by compressing history into a hidden state and read by exposing that state at the next step. Transformers write by appending tokens into the context and read by attending over them. Forgetting happens automatically when context limits are exceeded or when compression loses detail.

MIRAS makes these mechanisms explicit and treats them as design choices, not side effects.

The Four MIRAS Design Knobs

MIRAS (Memory-Informed Recurrent Associative Systems) characterizes sequence models using four core components. These are not tied to any single architecture; they describe how memory behaves.

The first is memory structure. This defines what form memory takes. It might be a vector, a matrix, or a more expressive neural network. Fixed-size structures force compression, while richer structures allow selective retention.

The second is attentional bias. This defines what the model considers relevant. In Transformers, this is typically dot-product similarity. MIRAS highlights that this choice strongly influences what gets retrieved and what gets ignored, especially in noisy or long sequences.

The third is the retention or forgetting mechanism. Forgetting is not a flaw; it is a necessity. The question is whether forgetting is controlled and adaptive, or implicit and uncontrolled. Many models forget simply because they have no choice.

The fourth is the memory update rule. This determines how memory changes over time. Some models update memory only during training. Others allow memory to update during inference in a controlled way.

Illustration showing the four MIRAS dimensions: memory structure, attentional bias, retention, and update rule.

Reinterpreting Familiar Models Through MIRAS

When you view common architectures through the MIRAS lens, their strengths and weaknesses become clearer.

Transformers use a vibrant memory structure (the full context window) and a strong attentional bias (similarity-based attention). However, their retention mechanism is crude: once the window is full, older information disappears entirely. Their memory update rule is static during inference.

Linear attention and state-space models modify their structure and update rules to achieve efficiency, but they often rely on aggressive compression. This explains why they scale well but struggle with precise recall over very long sequences.

The key insight is that these trade-offs are not accidental. They follow directly from the memory design choices each model makes.

Why Loss Functions and Objectives Matter

One subtle but important point in MIRAS is that memory behaviour is influenced not only by architecture, but also by the objective being optimised. Many models rely heavily on mean-squared-error-like objectives or similarity-based losses. These can be sensitive to noise and outliers, which in turn affects what memory updates are emphasised.

MIRAS uses this observation to motivate alternative formulations that change how relevance and stability are defined. The result is not just better robustness, but more predictable memory behaviour under long and noisy inputs.

This reinforces the central idea: memory is not just where information is stored, but how learning signals shape what is kept.

Why This Framework Matters Before Talking About Titans

Without a framework like MIRAS, Titans can look like a collection of clever tricks: test-time updates, surprise signals, adaptive forgetting. With MIRAS, those choices become legible. They are answers to explicit memory-design questions rather than ad-hoc optimisations.

Part 1 showed that attention alone cannot serve as long-term memory. Part 2 explains why most existing alternatives still fall short. Only after this framing does it make sense to examine Titans as a concrete instantiation of a different memory system.

What to Watch for in Real Applications

If you apply the MIRAS lens to real systems, patterns emerge quickly. Models fail when the memory structure is too rigid, when retention is uncontrolled, or when update rules are frozen despite changing inputs. Conversely, systems become more robust when memory design is intentional and aligned with task requirements.

This perspective is especially relevant for agents, streaming data, long-running processes, and any application where the model must operate continuously rather than in isolated prompts.

Looking Ahead to Part 3

Part 2 sets the conceptual groundwork. In Part 3, we will look closely at the Titans' architecture and see how it instantiates these memory principles in practice. We will examine how long-term memory is represented, how it updates during inference, and how forgetting is managed to keep the system stable.

Part 1: Why Transformers Still Forget

Pranava Kailash Subramaniam Prema — Thu, 18 Dec 2025 12:37:51 +0000

This is Part 1 of a three-part series examining why long context is not equivalent to long-term memory in modern language models. Here, we focus on why attention-based systems forget even when context windows grow dramatically. The next parts will introduce a memory-first framework and analyse how the Titans architecture approaches long-term memory explicitly.

Long-context models are everywhere now. The marketing message is simple: if a model can read more tokens, it can “remember” more. That sounds reasonable, but it is the wrong mental model. A bigger context window mostly turns a model into a better reader, not a better rememberer. The distinction matters because many real-world tasks are not about reading everything; they are about keeping what matters and using it later without constantly re-scanning a massive history.

This post introduces the core problem behind the Titans line of work from Google Research: attention is an excellent short-term memory mechanism, but it is not a complete memory system. Titans starts from this premise and proposes a way to introduce long-term memory without relying on quadratic attention over the entire past.

The False Promise of “Just Increase the Context Window”

Transformers are built around attention. Attention works by comparing queries to keys across the tokens provided in the context window and retrieving values weighted by similarity. This mechanism can feel like memory because the model can “look back” and reuse earlier information. In reality, however, the model is only conditioning on what is currently visible; it repeatedly consults the context rather than storing information in a durable internal memory.

As context length increases, the model can consult more text, but relevance becomes harder to isolate. The longer the history, the more distractors exist, and the easier it becomes for retrieval to miss the one detail that matters. This explains the common gap between “the model can technically see the information” and “the model reliably uses the information.”

In short, long context improves access, but it does not guarantee retention.

Illustration showing quadratic growth of attention cost as context length increases

Attention as Memory: Useful, but Incomplete

The Titans' paper frames attention as an associative memory block. Tokens are stored as key–value pairs and retrieved through similarity search. This explains why Transformers perform so well on many sequence tasks. However, it also clarifies the limitation: the model’s output is strictly conditioned on dependencies inside the current context window, which is fundamentally bounded.

Titans draws a clear conceptual line. Attention behaves like short-term memory, high fidelity, flexible, and powerful, but constrained by window size and computational cost. Long-term memory requires different properties: persistence over time, selective storage, and the ability to retain useful abstractions without keeping every past token accessible through attention.

This distinction is not philosophical; it is architectural. Focusing attention on handling long-term storage increases compute cost without delivering reliable recall.

short-term attention with persistent long-term neural memory

Why “Efficient” Transformers and Linear Models Still Struggle

One response to attention’s scaling limits is to replace softmax attention with linear or kernel-based alternatives. While these approaches reduce computational complexity, they often behave like linear recurrent models that compress history into a fixed-size state. This makes them efficient, but also introduces information loss.

The contradiction is clear: these models are most appealing when context is very long, yet very long histories are difficult to compress without losing important details. As a result, two imperfect strategies dominate today:

Full-attention Transformers retain rich access to recent context but are expensive and bounded. Linear or recurrent variants scale efficiently but risk forgetting critical information due to compression.

Titans is motivated by this tension. Efficiency and reliable recall clash when only a single memory mechanism is used.

“Full attention: accurate but expensive” versus “Compressed recurrence: efficient but forgetful”

The Memory Perspective That Leads to Titans

A key contribution of the Titans' work is its memory-centric view of sequence modelling. Models are described in terms of two operations: writing (or updating memory) and reading (or retrieving from memory). Recurrent models write by compressing history into a hidden state. Transformers write by appending keys and values to the context. Retrieval then happens either by reading the hidden state or attending to stored keys.

Seen through this lens, the important questions shift. Instead of asking which architecture is best, we ask how memory should be structured, how it should update, how it should retrieve information, how it should forget, and how multiple memory modules can be combined so each handles what it does best.

This framing naturally leads to Titans: an architecture where attention remains in short-term memory, and a separate module is responsible for long-term memory.

What This Means for Real Systems

For systems that involve continuous reasoning, long-running agents, extended conversations, log analysis, long documents, or time-series data, the limitation appears quickly. Increasing context length helps, but retrieval becomes fragile as the haystack grows. This is why larger windows often improve demos without fully solving reliability.

Titans is compelling because it does not claim that attending to everything is sufficient. Instead, it argues for an architecture that explicitly incorporates long-term memory and manages retention over extended horizons while remaining computationally practical.

Unresolved Questions

Part 1 deliberately leaves one major question unanswered: what should a long-term memory module look like, and how should it decide what to store and forget? The Titans paper addresses this later using a neural long-term memory updated at test time via a “surprise” signal and adaptive forgetting.

These mechanisms will be explored in Part 3. Before that, Part 2 introduces a broader memory lens that explains why forgetting and retention emerge from design choices rather than bugs.

Conclusion: We Don’t Have a Context Problem, We Have a Memory Problem

Attention is an exceptional tool for short-range dependency modelling, but treating it as the entire memory system forces trade-offs that do not disappear with larger context windows. Truly scalable long-context systems require dedicated long-term memory mechanisms, not just longer scrollback.

In Part 2, we will make this memory-first framing explicit so that Titans appears as a logical architectural step rather than an isolated idea.

I Built a Fully Local Prompt Enhancer Chrome Extension with Gemini Nano

Pranava Kailash Subramaniam Prema — Sun, 07 Dec 2025 21:00:14 +0000

Over the last few weeks, I’ve been building a Chrome extension called Prompt Enhancer that turns rough ideas into clear, structured prompts for ChatGPT in a single click. It runs fully on‑device using Chrome’s built‑in Gemini Nano via the Prompt API, so nothing ever leaves your browser.

In this post, you’ll see:

Why Prompt Enhancer exists (and the workflow problem it solves)
How it works end‑to‑end from the user’s perspective
The technical architecture: Manifest V3, content scripts, and the Prompt API
What’s shipped today and what’s coming next (v2 ideas and roadmap)

The Problem: We Underspecify Prompts

Most people use ChatGPT (or any LLM) like a search bar: dump a half‑formed thought, hit Enter, and hope for magic. The result is usually OK, but not great.

Common issues:

Vague prompts like “explain this topic for my exam” with no level, context, or constraints
Coding requests with no input/output format, no edge cases, and no performance constraints
Writing tasks with no target audience, tone, or length

As a data/AI person, this is painful to watch because prompt quality directly controls output quality. A well‑designed prompt can:

Save multiple back‑and‑forth iterations
Produce more reliable and testable outputs (especially for code and analysis)
Make AI tools actually usable in real workflows

So the idea was simple:

What if you could type as you usually do, then hit one button that rewrites your rough text into a high‑quality prompt?

That’s precisely what Prompt Enhancer does on top of ChatGPT, with almost zero friction.

What Prompt Enhancer Does (User Experience)

From the user’s point of view, Prompt Enhancer adds a tiny bit of superpower to the ChatGPT UI without changing how they use it day‑to‑day.

One‑Click Prompt Upgrade

Once installed and enabled:

Open ChatGPT in Chrome.
Start typing any rough idea into the input box.
You’ll see a small “Enhance” button appear beside or near the textarea.
Click Enhance, wait a moment, and your rough text is replaced with a refined, structured prompt.

The enhanced prompt usually:

Adds missing context (audience, constraints, format)
Clarifies the task (summarise vs critique vs generate vs explain)
Specifies outputs (bullet list, table, code with comments, etc.)

You can quickly review the rewritten prompt, tweak anything you like, then hit Enter as usual.

100% On‑Device, No External APIs

The critical design choice: Prompt Enhancer does not call any external API.

Instead, it uses Chrome’s Prompt API to talk to Gemini Nano, the small LLM that Chrome can run locally inside the browser.

That means:

No API keys
No extra backend server to maintain
No prompts going to third‑party infrastructure beyond what ChatGPT itself already uses when you submit

For anyone writing sensitive content (internal docs, research, planning), this matters a lot. The extension never ships your text to another cloud service to enhance it.

Keyboard‑First Flow

To keep things fast for power users, Prompt Enhancer also supports keyboard shortcuts (for example, pressing a combination to trigger an enhancement instead of clicking the button). This keeps the workflow:

Type → Enhance → Enter

All from the keyboard, which is how heavy users of ChatGPT typically operate.

Under the Hood: Architecture and Design

Now for the fun part: how it’s built.

Prompt Enhancer is a Manifest V3 Chrome extension that combines three main pieces:

The extension manifest (permissions + wiring)
A content script that interacts with the ChatGPT DOM
A Prompt API integration that calls Gemini Nano locally

1. Manifest V3 Basics

The manifest.json file:

Declares this as a Manifest V3 extension.
Specifies the sites where the content script should run (e.g., https://chat.openai.com/*).
Requests the minimum permissions needed to inject UI and read/write the textarea.

Keeping permissions minimal helps with both security and Chrome Web Store review. It also forces the architecture to be simple, which is a good constraint for a small but focused tool.

2. Content Script: Injecting the Enhance Button

The content script is responsible for:

Detecting when you’re on a ChatGPT page
Locating the main input textarea
Injecting the floating Enhance button into the UI
Listening for click/keyboard events and reading the user’s current text

Because ChatGPT’s UI is a React app that changes frequently, the content script needs to be robust to DOM changes:

Use stable selectors where possible
Fall back to heuristics like “the largest textarea in the chat input area”
Use MutationObserver to re‑attach the button if the UI re-renders

This prevents the extension from silently breaking when ChatGPT updates its frontend.

Once the button is clicked:

The script grabs the current textarea value (the rough prompt).
It sends this text to the Gemini Nano logic (still inside the same browser context).
When the enhanced prompt comes back, the script overwrites the textarea value and restores focus.

3. Using Chrome’s Prompt API with Gemini Nano

The most interesting part is the AI integration.

Chrome has introduced a Prompt API that lets developers send natural‑language requests to the built‑in Gemini Nano model directly inside the browser.

The typical flow is:

Check if the environment supports the AI APIs (for example, Dev or Canary builds with flags enabled if necessary).
Create a language model instance via the Prompt API (this is what connects to Gemini Nano).
Pass a “meta‑prompt” plus the user’s rough text to the model.

The meta‑prompt is where the actual prompt engineering lives. Conceptually, it looks like:

“You are a prompt‑engineering assistant. Rewrite the user’s text as a clear, detailed prompt that specifies context, output format, and constraints, without changing the core intent.”

The model then returns an enhanced version of the prompt, which the extension pipes back into the ChatGPT textarea.

Because Gemini Nano runs locally:

Latency is low and predictable (no network round-trip time).
The extension can even function with Wi‑Fi turned off, as long as ChatGPT is already loaded and the model is available in Chrome.

This architecture cleanly separates concerns:

Content script: DOM integration and UX.
AI layer: Prompt API + meta‑prompt logic.

What’s Shipped Today

As of now, Prompt Enhancer includes a solid v1 feature set focused on doing one thing exceptionally well: turn rough prompts into better ones for ChatGPT.

Shipped features:

Floating “Enhance” button on the ChatGPT prompt textarea
On‑device prompt enhancement using Gemini Nano via Chrome’s Prompt API
Full local processing with no external prompt‑enhancement APIs
Keyboard shortcut support for in‑flow power usage
A minimal, clean UX that feels native to ChatGPT

It’s live on the Chrome Web Store, so users can install it like any other extension and start using it in seconds. Link to the extension

Lessons Learned While Building It

Beyond the feature list, this project taught several practical lessons that are worth sharing.

Designing with Real‑World Constraints

Shipping to the Chrome Web Store forces you to think about:

Permissions and privacy policies
Clear explanations of what data is accessed, and why
How to communicate that everything runs locally and no extra server is involved

Because Prompt Enhancer is entirely local, the privacy story is strong, but it still needs to be communicated clearly in the listing and docs.

Working With Experimental Browser AI

Using Gemini Nano in Chrome is still relatively new, which means:

Some setups require enabling flags or using Dev/Canary builds.
The APIs and documentation are evolving, so error handling must be defensive.
It’s essential to provide a graceful fallback message if the model isn’t available yet on a user’s browser.

This is a trade‑off between being early on a powerful capability and accepting that not every user is ready out of the box.

UX Matters More Than Model Choice

One of the biggest takeaways: users care more about flow than about which model is behind it.

The small details add up:

Not breaking the user’s typing flow
Giving quick visual feedback when the enhancement is running
Respecting the original intent instead of over‑rewriting

In other words, a simple, well‑integrated UX around a local model can improve day‑to‑day AI usage more than yet another separate web app.

Roadmap: What’s Coming Next

Prompt Enhancer v1 focuses on ChatGPT, but there’s a lot of room to grow. Here are some ideas and updates planned for future versions.

1. Multi‑Platform Support

Extend the same enhancement flow beyond ChatGPT to:

Claude
Gemini web
Other AI tools with standard textareas

This would turn Prompt Enhancer into a universal prompt‑upgrade layer atop whichever LLM you prefer.

2. Mode‑Aware Enhancements

Introduce selectable modes such as:

“Coding prompt” (add test cases, clarify language, specify constraints)
“Writing prompt” (tone, audience, structure)
“Analysis prompt” (data format, assumptions, limitations)

These modes would tweak the internal meta‑prompt to better match the user’s goal without making them think about prompt engineering too much.

3. Configurable Meta‑Prompts

Expose some customisation:

Let advanced users tweak the internal meta‑prompt
Save and reuse custom enhancement styles (e.g., “consultant style”, “exam prep”, “SDE interview”)

This would bridge the gap between a plug‑and‑play extension and a more advanced power tool for prompt engineers.

4. Deeper Chrome Integration

Longer‑term experiments could include:

Enhancing prompts based on page context (e.g., selected text on a docs page)
Using side panels for history and templates
Smarter handling of very long prompts or complex workflows

All while maintaining the core principle: no extra backend, everything stays local.

Closing Thoughts

Prompt Enhancer started as a small experiment: “Can a Chrome extension use on‑device AI to quietly make prompts better without getting in the way?” It has turned into a real, shippable tool that improves day‑to‑day AI workflows while respecting privacy and keeping the architecture clean.

If you use ChatGPT regularly and want to:

Get better answers without becoming a full‑time prompt engineer
Keep sensitive text local to your device
Add a bit of extra power to your existing flow

Then Prompt Enhancer might be worth a try. Link to the extension

You can install it from the Chrome Web Store, and any feedback, bug reports, or v2 ideas are very welcome.