DEV Community

Cover image for The Anatomy of Catastrophic Forgetting

The Anatomy of Catastrophic Forgetting

Saptarshi Sarkar on June 10, 2026

We train a model on handwritten digit classification. 99% accuracy. Then we train the same model on a new task — say, fashion item recognition. We ...
Collapse
 
leob profile image
leob

Great article - the mechanics (maths) are way over my head, but I get the general idea - LLMs don't have the inherent "biological plasticity" that our brain has ...

Simple (stupid?) question - can't we "solve" (or work around) the problem by creating an "archive copy" of the model, after training it on dataset A and achieving that perfect score?

Then we proceed with further training, and find out that the performance for dataset A has 'catastrophically' dropped, but no worries - we still have that archive copy!

Or is that too simple (I expect it is)?

Collapse
 
saptarshisarkar profile image
Saptarshi Sarkar

Thanks for reading and for the wonderful question. 😁

You're right that simply keeping an archive copy does preserve the old model, but it doesn't solve the core problem. We want a single model which can keep learning without discarding what it already knows. In real environments, the data distribution often changes - for example, an autonomous vehicle trained in urban traffic must later adapt to harsh rural conditions without forgetting how to drive in the city.

Copying checkpoints is more like rolling back than true continual learning. I didn’t go into prevention strategies in this article, but my next piece will give an overview of the landscape so stay tuned! 💙

Collapse
 
leob profile image
leob

Thanks, I see what you mean, however, in case of:

"for example, an autonomous vehicle trained in urban traffic must later adapt to harsh rural conditions without forgetting how to drive in the city"

... you could say, keep the 'archive' model (trained in the city) for usage when you're in the city, and then when you move to the countryside you switch to the 'rural' model ;-)

(kind of like specialized models, instead of "one size fits all")

But I admit that that might get unwieldy, and might not be practical :-)

P.S. let's not forget that what modern LLMs are capable of is already insanely impressive, it's doing things we wouldn't have dreamed of 5 years ago

Thread Thread
 
saptarshisarkar profile image
Saptarshi Sarkar

Thanks for the thoughtful follow‑up 🙏
It's correct that a practical workaround is to keep specialized models e.g. an “urban” model and a “rural” model and switch between them depending on context. In fact, that’s essentially what task‑specific models achieve today. The challenge (and what makes catastrophic forgetting such a research focus) is that in many real‑world deployments we’d prefer a single evolving model that can adapt continuously without losing prior skills. Managing multiple archives quickly gets unwieldy, especially if the environment shifts in subtle or overlapping ways.

And I completely agree that what modern LLMs and deep learning systems can already do is astonishing compared to just a few years ago. The fact that we’re now debating how to make them retain knowledge across tasks shows how far the field has come.

Thread Thread
 
leob profile image
leob

"we’d prefer a single evolving model that can adapt continuously without losing prior skills" - holy grail of "AGI"?

I'm also in awe of people who are able to keep up with all these AI developments, and see the forest for the trees - it's a daunting field, moving at breakneck speed ...

Did you study (and do you understand) the theoretical/mathematical underpinnings of neural networks and LLMs? I guess you did/do ...

Thread Thread
 
saptarshisarkar profile image
Saptarshi Sarkar

The bigger vision (and challenge) is to build an evolving model though AGI is much more than that. Continual learning focuses on preventing forgetting and adapting to new tasks, while AGI aims for human‑like general intelligence across domains. So, maybe it's one of the building blocks? 🤔

I also wonder how fast the field is moving. My own focus has been on understanding the concepts deeply enough to explain them clearly and I often have mixed feelings of fun and daunting. 😁

Collapse
 
marco_dev profile image
Marco Dev

Interesting! Thanks for sharing.

Collapse
 
saptarshisarkar profile image
Saptarshi Sarkar

Appreciate you taking the time to read it 🙌
Catastrophic forgetting is one of those deceptively simple but deep challenges in neural networks. I’ll be following up soon with an overview of strategies researchers use to tackle it, so I’d love to hear your thoughts when that’s out.

Collapse
 
max_quimby profile image
Max Quimby

Nice, clear walk-through — the loss-landscape framing makes the "why" click in a way the usual hand-wave about "overwriting weights" never does.

What struck me is how cleanly this rhymes with a problem one layer up, at runtime. When you run a long-lived agent, you get a behavioral version of catastrophic forgetting: the "shared weight space" becomes the shared context window, and as new tool results and summaries pile in, the original goal gets optimized away the same way Task A does — no gradient step required, just attention dilution.

And the fixes rhyme too. Parameter isolation ≈ giving the agent an external memory store so old facts don't compete for the same surface. Rehearsal/replay ≈ periodically re-injecting the goal and key decisions so they don't drift out. EWC's "protect the weights that mattered" even maps to pinning certain memories as immutable.

A question for the training side: past two tasks, does replay or a regularizer like EWC hold up better in your experience? My intuition is replay scales worse on storage but degrades more gracefully — curious whether the data agrees.

Collapse
 
saptarshisarkar profile image
Saptarshi Sarkar

Thanks for reading the article! I'm glad you liked it. I really like the way you framed the agent runtime issue as “behavioral catastrophic forgetting”. The analogy is great.
To answer your question, replay based approaches scale poorly in storage (due to exemplars requirement) but they gradually degrade as tasks accumulate compared to regularizers like EWC. The reasoning is that in replay, the model keeps some anchor points (exemplars) for past distributions, so accuracy drops gradually. While in EWC, it can protect the model from forgetting for few tasks but once you go further, you can see a sharp decline in accuracy on older tasks as the new landscape diverges.
In practice, you will see hybrid approach like iCaRL works better. It stores small exemplar sets (like replay approach) and uses regularisation/distillation loss terms (like regularisation approach).

Collapse
 
mnemehq profile image
Theo Valmis

The line doing the work here is "no awareness that Task A ever existed." Every fix for this puts that awareness back, and they differ by where they store it. Regularization like EWC stores it in the loss: measure which weights mattered for A with Fisher information, then penalize moving them. Replay stores it in the data: keep feeding A's examples so the gradient never stops seeing them. Adapters and LoRA store it in parameters: give B its own weights so the two optima stop fighting over the same θ. Same trade billed in three currencies, storage, compute, or parameter count. Which leaves one real question for a practitioner: which of those three can you afford to spend on remembering.

Collapse
 
saptarshisarkar profile image
Saptarshi Sarkar

Thanks for reading my article! Yeah, I agree with you. There are a lot of continual learning strategies, and it all depends on what we can afford to spend on. Love the way you framed it as awareness being reintroduced in different forms.