Mitansh Gor

Posted on Nov 17

Nested Learning — My Reflections on a Model That Learns How to Learn

#deeplearning #google #nestedlearning #hope

I recently came across a paper called Nested Learning: The Illusion of Deep Learning by Behrouz and the team — the same researchers behind Titans and Atlas. It really caught my attention because it challenges what we usually think “deep learning” means. The paper says that depth in neural networks isn’t just about stacking layers — it’s about how many layers of learning the system can apply to itself. Instead of just updating weights, this model learns how to improve its own learning process.

While reading it, I realized this isn’t just another optimization trick. It actually feels like a glimpse into what real intelligence could be — an AI that doesn’t just react but reflects, improves, and evolves how it learns over time. The authors even built a prototype called Hope, a model that modifies itself using feedback, learning not just what to learn but how to learn better.

Take on the Problems in Deep Learning and Transformers

Today’s deep learning systems — even transformers — are strong but still limited in how they actually learn.

Neural networks are called “deep” because of their layers, but their learning is flat — one optimizer like Adam or SGD updates everything the same way. Once training ends, the model stops learning, like a student who graduates and never studies again. :)

Transformers improved context handling, but they only have short-term memory. They remember what’s inside their context window and forget the rest. Even if I teach GPT something new, it won’t remember it next time — its knowledge is frozen.

To me, that’s what makes this paper exciting. It explores real, continuous learning, where models don’t just perform tasks but actually grow and evolve from their experiences — more like how the human brain learns over time.

How the Human Brain Actually Learns

When I compared this idea to how our brain works, the difference was clear. The brain doesn’t just store facts — it keeps updating how it learns from every experience. Each new moment changes not only what we know but how we learn next time.

Like when I study late and remember less, my brain quietly adjusts — it learns how to learn better. We also have different kinds of memory: fast, short-term memory for quick thoughts and slow, long-term memory for what truly matters.

What’s amazing is that all this happens at different speeds — reflexes form instantly, habits take time. That’s what makes our learning flexible and self-improving. This paper helped me realize that real intelligence isn’t just about storing knowledge — it’s about systems that can adapt the way they adapt.

Why Real Intelligence Learns at Many Speeds

“Nested Learning” mirrors the way our brain learns at multiple speeds. In real life, not all learning happens instantly — some lessons come from quick feedback, and others sink in over time through reflection and repetition.

For example, when I make a mistake in code, I fix it fast — that’s short-term learning. But when I notice a pattern of mistakes across projects and change how I approach debugging, that’s slow, higher-order learning. My brain is basically nesting layers of learning, one inside another.

This is exactly what the paper argues AI should do. Instead of having one rigid update rule for all situations, it should have systems that operate on different timescales — fast ones for reacting to the present and slow ones for improving how it learns in the future. Real intelligence, human or artificial, grows when it can learn fast, remember slow, and keep adjusting both.

Why Optimizers Are Still “Shallow”

One part that stood out to me was how the paper calls our current optimizers “shallow.” At first, that sounded odd — optimizers like Adam or SGD are what make models learn, right? But the point is deeper: they only operate at one level. They adjust the weights, but they never learn how to optimize better on their own.

Think about it like this — an optimizer is a rulebook. It says, “If error is high, change parameters this way.” That rule never changes, no matter how the model behaves or what patterns it encounters. It doesn’t evolve. It’s like a student who keeps using the same study method forever, even when it stops working.

Nested Learning challenges that. It treats the optimizer as something that can learn from its own history — almost like giving the optimizer memory and awareness. So instead of being a fixed rule, it becomes a learner itself. That’s why normal optimizers are called “shallow” — they only see one layer of the learning process, while true intelligence needs many.

What Nested Learning Really Means

When I finally got to the main idea — Nested Learning — it clicked for me that this isn’t just about deeper networks, but deeper learning loops. Normally, a model learns by updating its parameters once per round. But in Nested Learning, there are multiple layers of learning stacked inside each other, each operating at its own level.

The paper calls these “levels.”

Level 1 is the fast learner — it adjusts to new data right away.
Level 2 is slower — it learns how well Level 1 is learning and changes its strategy.
Level 3 and beyond keep zooming out, letting the model reflect on its own updates and tweak the process itself.

It’s like having a mind inside a mind inside a mind — each layer watching and improving the one below. What makes it powerful is that it never stops at one rule; it can always find a better way to learn. I realized that’s what makes it feel almost human — because that’s how we grow too, by not just learning facts, but by constantly refining how we learn them.

The Power of Associative Memory

In this paper, associative memory is what allows the model to connect surprise signals over time. Each time it encounters something unexpected, it doesn’t just correct the output; it stores that “surprise” pattern and learns from how surprises evolve. So instead of forgetting past mistakes, it builds a history of how it has been wrong before — and uses that as context for new learning.

I liked how this turns memory from a passive storage system into an active, learning part of the network. It’s not just remembering data; it’s remembering how learning felt last time. That’s what makes the system more adaptive and self-improving, just like how human intuition forms through repeated experiences.

HOPE — The Model That Learns to Learn

The paper introduces a model called HOPE (Hierarchical Optimizing Processing Ensemble), and it really ties everything together. HOPE is basically the first real example of Nested Learning in action. It builds on the older Titans architecture, which was already designed for smart memory management — storing “surprising” experiences and forgetting the rest. But Titans could only update its parameters in two layers, which meant it was still limited to first-order learning.

HOPE takes that concept and adds self-modification. It doesn’t just store knowledge — it rewrites how it learns based on what it experiences. The more it learns, the better it gets at learning itself. That’s what makes it “hierarchical” — every layer is optimizing the one below it, creating an infinite loop of improvement.

When I read that, it felt like looking at a prototype for true adaptive intelligence. HOPE doesn’t just grow its memory; it evolves its own way of thinking. It’s almost like the model is building its own brain architecture in real time — guided only by feedback and surprise.

CMS — Learning Across Multiple Memory Speeds

One of the coolest parts of the paper was the Continuum Memory System (CMS). This idea clicked with me right away because it’s inspired by how our brain manages memory at different speeds. We have fast, short-term memory for reacting to the moment, and slower, long-term memory for storing what truly matters. CMS brings that same principle to AI.

In HOPE, CMS creates layers of memory that operate on different time scales. Fast memory reacts instantly to new data, slow memory holds onto stable knowledge, and middle layers balance both. The system learns what to keep, what to adapt, and what to forget — automatically.

This makes the model more flexible and less likely to “forget” old knowledge when it learns something new, solving a big problem in continual learning. For me, CMS felt like giving the model an actual sense of time — letting it learn short-term lessons without losing its long-term wisdom. It’s memory that grows, refines, and stays balanced, just like ours.

From Titans to HOPE — How the Architecture Evolved

Before HOPE came along, there was the Titans architecture, which was already an interesting idea. Titans worked like a long-term memory system for AI — it didn’t try to remember everything, only what was surprising. Whenever the model saw something that didn’t match its expectations, it marked that as “important” and stored it. This made Titans good at keeping rare or unexpected experiences while forgetting routine ones, kind of like how our brain remembers unusual events more vividly than daily habits.

But Titans had a big limitation — it could only learn at two levels. It could store knowledge (Level 1) and slightly adjust how it stored it (Level 2), but it couldn’t modify its own learning process. It was stuck with fixed update rules, so even though its memory was smart, its way of learning stayed static.

That’s where HOPE (Hierarchical Optimizing Processing Ensemble) came in as the next step. HOPE keeps Titans’ “surprise-based memory,” but adds self-modification — meaning it can change how it learns over time. Instead of just remembering, it reflects on how it learned and improves that process.

In simple terms:

Titans learns what to remember.
HOPE learns how to learn better next time.

This shift from reactive memory (Titans) to reflective learning (HOPE) is what made the architecture truly recursive — an AI that can not only adapt but evolve its own learning rules.

My Takes on the Paper

This paper made me rethink what “deep learning” really means. It’s not just about adding layers — it’s about adding levels of learning. I liked how it pushed the idea that intelligence should evolve, not just perform.

What stood out to me was the mindset shift. Instead of models that just learn tasks, it showed a system that learns how to learn better. That’s the kind of loop real intelligence needs — self-awareness in its own process.

I also liked the balance between fast and slow learning. It reminded me of how humans think — reacting quickly to new events while slowly refining long-term habits. The whole idea felt less like training a model and more like nurturing an evolving mind.

DEV Community