<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alex Mercer</title>
    <description>The latest articles on DEV Community by Alex Mercer (@fourwheels2512).</description>
    <link>https://dev.to/fourwheels2512</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3784948%2F5d9cf3b7-9bc9-4914-a2fa-63323f7446bb.png</url>
      <title>DEV Community: Alex Mercer</title>
      <link>https://dev.to/fourwheels2512</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/fourwheels2512"/>
    <language>en</language>
    <item>
      <title>Catastrophic Forgetting by Language models.</title>
      <dc:creator>Alex Mercer</dc:creator>
      <pubDate>Fri, 27 Feb 2026 18:03:30 +0000</pubDate>
      <link>https://dev.to/fourwheels2512/catastrophic-forgetting-by-language-models-24ai</link>
      <guid>https://dev.to/fourwheels2512/catastrophic-forgetting-by-language-models-24ai</guid>
      <description>&lt;p&gt;To all the awesome experts in AI/ML out there. i need a favor. &lt;br&gt;
I realized there is a gap in Language Models (SLMs/LLMs) remembering the data continuously which is termed as 'catastrophic forgetting'.&lt;/p&gt;

&lt;p&gt;To solve that problem I came up with an adapter called Constrained Residual Mixing Adapter (CRMA) that enables continual learning. I tested it on TinyLlama 1.1B and Mistral 7B — the result: -0.1% drift across 4 sequential&lt;br&gt;
domains. Essentially zero forgetting.&lt;/p&gt;

&lt;p&gt;CRMA: -0.1% drift. Naive: +351% forgetting. Same model, same data, same hardware.&lt;/p&gt;

&lt;p&gt;Holds at both 1.1B and 7B. No replay, no EWC, no KD needed.&lt;br&gt;
● CRMA Modular vs Naive — Mistral 7B (4 sequential domains)&lt;/p&gt;

&lt;p&gt;┌─────────┬────────────┬──────────────────┐&lt;br&gt;
  │  Task   │ CRMA Drift │ Naive Forgetting │&lt;br&gt;
  ├─────────┼────────────┼──────────────────┤&lt;br&gt;
  │ Medical │ -0.2%      │ +228%            │&lt;br&gt;
  ├─────────┼────────────┼──────────────────┤&lt;br&gt;
  │ Legal   │ -0.1%      │ +593%            │&lt;br&gt;
  ├─────────┼────────────┼──────────────────┤&lt;br&gt;
  │ Code    │ -0.1%      │ +233%            │&lt;br&gt;
  ├─────────┼────────────┼──────────────────┤&lt;br&gt;
  │ Finance │ +0.0%      │ —                │&lt;br&gt;
  ├─────────┼────────────┼──────────────────┤&lt;br&gt;
  │ Average │ -0.1%      │ +351%            │&lt;br&gt;
  └─────────┴────────────┴──────────────────┘&lt;/p&gt;

&lt;p&gt;Now the favor - If you're interested in independently verifying these results, I'd love to hear from you. DM me and I'll share what you need to reproduce it. Thank you. and best wishes&lt;/p&gt;

</description>
      <category>deeplearning</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Catastrophic Forgetting of language models.</title>
      <dc:creator>Alex Mercer</dc:creator>
      <pubDate>Sun, 22 Feb 2026 11:31:48 +0000</pubDate>
      <link>https://dev.to/fourwheels2512/why-qlora-produces-a-gradient-norm-spike-at-step-44-on-mistral-7b-and-how-to-fix-it-141h</link>
      <guid>https://dev.to/fourwheels2512/why-qlora-produces-a-gradient-norm-spike-at-step-44-on-mistral-7b-and-how-to-fix-it-141h</guid>
      <description>&lt;p&gt;To all the awesome experts in AI/ML out there. i need a favor. &lt;br&gt;
I realized there is a gap in Language Models (SLMs/LLMs) remembering the data continuously which is termed as 'catastrophic forgetting'.&lt;/p&gt;

&lt;p&gt;To solve that problem I came up with an adapter called Constrained Residual Mixing Adapter (CRMA) that enables continual learning. I tested it on TinyLlama 1.1B and Mistral 7B — the result: -0.1% drift across 4 sequential&lt;br&gt;
domains. Essentially zero forgetting.&lt;/p&gt;

&lt;p&gt;CRMA: -0.1% drift. Naive: +351% forgetting. Same model, same data, same hardware.&lt;/p&gt;

&lt;p&gt;Holds at both 1.1B and 7B. No replay, no EWC, no KD needed.&lt;br&gt;
● CRMA Modular vs Naive — Mistral 7B (4 sequential domains)&lt;/p&gt;

&lt;p&gt;┌─────────┬────────────┬──────────────────┐&lt;br&gt;
  │  Task   │ CRMA Drift │ Naive Forgetting │&lt;br&gt;
  ├─────────┼────────────┼──────────────────┤&lt;br&gt;
  │ Medical │ -0.2%      │ +228%            │&lt;br&gt;
  ├─────────┼────────────┼──────────────────┤&lt;br&gt;
  │ Legal   │ -0.1%      │ +593%            │&lt;br&gt;
  ├─────────┼────────────┼──────────────────┤&lt;br&gt;
  │ Code    │ -0.1%      │ +233%            │&lt;br&gt;
  ├─────────┼────────────┼──────────────────┤&lt;br&gt;
  │ Finance │ +0.0%      │ —                │&lt;br&gt;
  ├─────────┼────────────┼──────────────────┤&lt;br&gt;
  │ Average │ -0.1%      │ +351%            │&lt;br&gt;
  └─────────┴────────────┴──────────────────┘&lt;/p&gt;

&lt;p&gt;Now the favor - If you're interested in independently verifying these results, I'd love to hear from you. DM me and I'll share what you need to reproduce it. Thank you. and best wishes&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>ai</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
