We train a model on handwritten digit classification. 99% accuracy. Then we train the same model on a new task — say, fashion item recognition. We go back and test it on digits. 34% accuracy. It has completely forgotten. Not gradually, not partially — almost entirely.
What Just Happened?
We trained a CNN on MNIST digits — 99.2% accuracy. After fine‑tuning on Fashion MNIST, it reached 91.1% accuracy. But when re‑evaluated on MNIST, accuracy collapsed to 33.9%. This collapse is catastrophic forgetting: the model’s weights shifted to optimize for the new task, erasing the old solution.
Why did training on more data make the model worse at something it already knew?
MNIST is handwritten digits (0–9). Fashion MNIST is clothing items like shirts and shoes. Both are 28×28 grayscale images, but the tasks are distinct.
Why Does It Happen?
The core issue is that the model relies on the same set of weights for both tasks. There is no separation or dedicated memory; every parameter is shared. When training shifts from Task A (MNIST digits) to Task B (Fashion MNIST), gradient descent simply minimizes the loss on the data it sees at that moment. It has no awareness that Task A ever existed.
In the loss landscape, imagine two parabolic bowls: one for Task A and one for Task B. The optimum for Task A lies at , while Task B's optimum is at . As training on Task B progresses, the weights move towards . This movement inevitably raises the loss for Task A because its minimum is left behind.
The root cause is the shared weight space. Gradient descent is a stateless optimizer; it only follows the current gradient signal. Since the minima for Task A and Task B are far apart, there is no single configuration of that satisfies both tasks simultaneously. This is why catastrophic forgetting occurs.
Weight space can be visualized as an N-dimensional space, where each axis corresponds to one parameter. Every point in this space represents a full set of weights i.e., a complete model. The loss function defines a landscape over this space, and training is the process of moving through it along gradient directions.
Why "Catastrophic"?
Human memory forgets gradually and selectively; you forget less-used memories slowly over time, and the forgetting is usually partial. Neural networks, by contrast, forget catastrophically — performance can collapse within just a few epochs, dropping from near‑perfect accuracy (e.g., 99%) to far lower levels (e.g., 34%) in a single training run. This forgetting is non‑selective: the entire task distribution degrades, not just edge cases.
The underlying tension is the stability–plasticity dilemma. High plasticity allows fast learning of new tasks but destabilizes old ones, leading to catastrophic forgetting. Low plasticity preserves prior knowledge but prevents effective learning of new tasks. Continual learning research seeks to balance this trade‑off.
How Do We Measure It?
The standard evaluation protocol in continual learning is simple but powerful. We train on Task 1 and record accuracy . Then we train on Task 2 and record accuracy on both tasks, and . We continue for N tasks. This produces an accuracy matrix , where each entry is the accuracy on task j after training on task i.
From this matrix, we define Backward Transfer (BWT):
A negative BWT means catastrophic forgetting: performance on earlier tasks has degraded. In our demo, accuracy on Task A dropped from 99.2% to 33.9% after training Task B, giving a strongly negative BWT of -65.3%.
There is also Forward Transfer (FWT), which asks whether learning Task B helps Task A’s starting point. It is less central to catastrophic forgetting, but worth noting for completeness.
Why Not Just Retrain?
The first instinct is obvious: why not just keep all the data and retrain the model from scratch every time new data arrives? It sounds simple but in practice, it’s impossible.
Privacy and regulation: In domains like medical imaging, GDPR, HIPAA and other privacy acts forbid indefinite storage. Once a patient’s scan is used, it may need to be deleted. You can’t just “keep everything.”
Storage limits: Modern datasets are massive. Retaining every sample forever is infeasible.
Compute cost: Retraining a large model from scratch each time is prohibitively expensive. Imagine re‑training GPT‑scale models weekly; the cost is humongous.
Streaming data: Some data exists only in a stream (sensor feeds, financial ticks) and is never stored. If you don’t learn it as it arrives, it’s gone.
Real‑time systems: In robotics, finance, or autonomous driving, models must adapt continuously without pausing for full retraining.
This is why continual learning exists as a distinct field: to make adaptation possible under these constraints. The open question is the one that drives the research forward: can we teach a network to learn new things without forgetting old ones, the way humans do?
Try it yourself
Curious to see catastrophic forgetting in action?
Explore the full code on GitHub





Top comments (13)
Great article - the mechanics (maths) are way over my head, but I get the general idea - LLMs don't have the inherent "biological plasticity" that our brain has ...
Simple (stupid?) question - can't we "solve" (or work around) the problem by creating an "archive copy" of the model, after training it on dataset A and achieving that perfect score?
Then we proceed with further training, and find out that the performance for dataset A has 'catastrophically' dropped, but no worries - we still have that archive copy!
Or is that too simple (I expect it is)?
Thanks for reading and for the wonderful question. 😁
You're right that simply keeping an archive copy does preserve the old model, but it doesn't solve the core problem. We want a single model which can keep learning without discarding what it already knows. In real environments, the data distribution often changes - for example, an autonomous vehicle trained in urban traffic must later adapt to harsh rural conditions without forgetting how to drive in the city.
Copying checkpoints is more like rolling back than true continual learning. I didn’t go into prevention strategies in this article, but my next piece will give an overview of the landscape so stay tuned! 💙
Thanks, I see what you mean, however, in case of:
"for example, an autonomous vehicle trained in urban traffic must later adapt to harsh rural conditions without forgetting how to drive in the city"
... you could say, keep the 'archive' model (trained in the city) for usage when you're in the city, and then when you move to the countryside you switch to the 'rural' model ;-)
(kind of like specialized models, instead of "one size fits all")
But I admit that that might get unwieldy, and might not be practical :-)
P.S. let's not forget that what modern LLMs are capable of is already insanely impressive, it's doing things we wouldn't have dreamed of 5 years ago
Thanks for the thoughtful follow‑up 🙏
It's correct that a practical workaround is to keep specialized models e.g. an “urban” model and a “rural” model and switch between them depending on context. In fact, that’s essentially what task‑specific models achieve today. The challenge (and what makes catastrophic forgetting such a research focus) is that in many real‑world deployments we’d prefer a single evolving model that can adapt continuously without losing prior skills. Managing multiple archives quickly gets unwieldy, especially if the environment shifts in subtle or overlapping ways.
And I completely agree that what modern LLMs and deep learning systems can already do is astonishing compared to just a few years ago. The fact that we’re now debating how to make them retain knowledge across tasks shows how far the field has come.
"we’d prefer a single evolving model that can adapt continuously without losing prior skills" - holy grail of "AGI"?
I'm also in awe of people who are able to keep up with all these AI developments, and see the forest for the trees - it's a daunting field, moving at breakneck speed ...
Did you study (and do you understand) the theoretical/mathematical underpinnings of neural networks and LLMs? I guess you did/do ...
The bigger vision (and challenge) is to build an evolving model though AGI is much more than that. Continual learning focuses on preventing forgetting and adapting to new tasks, while AGI aims for human‑like general intelligence across domains. So, maybe it's one of the building blocks? 🤔
I also wonder how fast the field is moving. My own focus has been on understanding the concepts deeply enough to explain them clearly and I often have mixed feelings of fun and daunting. 😁
Interesting! Thanks for sharing.
Appreciate you taking the time to read it 🙌
Catastrophic forgetting is one of those deceptively simple but deep challenges in neural networks. I’ll be following up soon with an overview of strategies researchers use to tackle it, so I’d love to hear your thoughts when that’s out.
Nice, clear walk-through — the loss-landscape framing makes the "why" click in a way the usual hand-wave about "overwriting weights" never does.
What struck me is how cleanly this rhymes with a problem one layer up, at runtime. When you run a long-lived agent, you get a behavioral version of catastrophic forgetting: the "shared weight space" becomes the shared context window, and as new tool results and summaries pile in, the original goal gets optimized away the same way Task A does — no gradient step required, just attention dilution.
And the fixes rhyme too. Parameter isolation ≈ giving the agent an external memory store so old facts don't compete for the same surface. Rehearsal/replay ≈ periodically re-injecting the goal and key decisions so they don't drift out. EWC's "protect the weights that mattered" even maps to pinning certain memories as immutable.
A question for the training side: past two tasks, does replay or a regularizer like EWC hold up better in your experience? My intuition is replay scales worse on storage but degrades more gracefully — curious whether the data agrees.
Thanks for reading the article! I'm glad you liked it. I really like the way you framed the agent runtime issue as “behavioral catastrophic forgetting”. The analogy is great.
To answer your question, replay based approaches scale poorly in storage (due to exemplars requirement) but they gradually degrade as tasks accumulate compared to regularizers like EWC. The reasoning is that in replay, the model keeps some anchor points (exemplars) for past distributions, so accuracy drops gradually. While in EWC, it can protect the model from forgetting for few tasks but once you go further, you can see a sharp decline in accuracy on older tasks as the new landscape diverges.
In practice, you will see hybrid approach like iCaRL works better. It stores small exemplar sets (like replay approach) and uses regularisation/distillation loss terms (like regularisation approach).
The line doing the work here is "no awareness that Task A ever existed." Every fix for this puts that awareness back, and they differ by where they store it. Regularization like EWC stores it in the loss: measure which weights mattered for A with Fisher information, then penalize moving them. Replay stores it in the data: keep feeding A's examples so the gradient never stops seeing them. Adapters and LoRA store it in parameters: give B its own weights so the two optima stop fighting over the same θ. Same trade billed in three currencies, storage, compute, or parameter count. Which leaves one real question for a practitioner: which of those three can you afford to spend on remembering.
Thanks for reading my article! Yeah, I agree with you. There are a lot of continual learning strategies, and it all depends on what we can afford to spend on. Love the way you framed it as awareness being reintroduced in different forms.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.