<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aditya Mehra</title>
    <description>The latest articles on DEV Community by Aditya Mehra (@aditya_mehra).</description>
    <link>https://dev.to/aditya_mehra</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3951664%2Fd5a8f323-9067-434b-890e-b6846df43e9f.png</url>
      <title>DEV Community: Aditya Mehra</title>
      <link>https://dev.to/aditya_mehra</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aditya_mehra"/>
    <language>en</language>
    <item>
      <title>I Built a Diagnostic Toolkit for PyTorch Because I Was Tired of Guessing Why Models Fail</title>
      <dc:creator>Aditya Mehra</dc:creator>
      <pubDate>Tue, 26 May 2026 03:31:53 +0000</pubDate>
      <link>https://dev.to/aditya_mehra/i-built-a-diagnostic-toolkit-for-pytorch-because-i-was-tired-of-guessing-why-models-fail-24fl</link>
      <guid>https://dev.to/aditya_mehra/i-built-a-diagnostic-toolkit-for-pytorch-because-i-was-tired-of-guessing-why-models-fail-24fl</guid>
      <description>&lt;p&gt;Every time a PyTorch model refuses to learn, the debugging process looks the same:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Stare at the loss curve&lt;/li&gt;
&lt;li&gt;Wonder if gradients are flowing&lt;/li&gt;
&lt;li&gt;Add print statements everywhere&lt;/li&gt;
&lt;li&gt;Delete them all when it works&lt;/li&gt;
&lt;li&gt;Repeat next week&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After 17 years in distributed systems and SRE, I know this pattern — it is monitoring by vibes. In production infrastructure, we would never accept "the service seems slow" as a diagnostic. We measure. We trace. We verify.&lt;/p&gt;

&lt;p&gt;So I built &lt;strong&gt;torchdiag&lt;/strong&gt; — five diagnostic commands that answer the actual questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;torchdiag
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/AddyM" rel="noopener noreferrer"&gt;
        AddyM
      &lt;/a&gt; / &lt;a href="https://github.com/AddyM/torchdiag" rel="noopener noreferrer"&gt;
        torchdiag
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      PyTorch model health diagnostics — gradient checks, dead neuron detection, training verification. Built from an SRE perspective.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;torchdiag&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;a href="https://github.com/AddyM/torchdiag/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/AddyM/torchdiag/actions/workflows/ci.yml/badge.svg" alt="CI"&gt;&lt;/a&gt;
&lt;a href="https://pypi.org/project/torchdiag/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/556d6281b4650a6183f6ff10942cbb051483504da145c1c5afecf195ff69d065/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f746f726368646961672e737667" alt="PyPI"&gt;&lt;/a&gt;
&lt;a href="https://pytorch.org/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/d72bac763e77fcd1f240e4e57f602c6804c2b10837c8b53b89aa7d2b3d5ffd1e/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5079546f7263682d4545344332433f7374796c653d666c6174266c6f676f3d7079746f726368266c6f676f436f6c6f723d7768697465" alt="PyTorch"&gt;&lt;/a&gt;
&lt;a href="https://opensource.org/licenses/MIT" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/fdf2982b9f5d7489dcf44570e714e3a15fce6253e0cc6b5aa61a075aac2ff71b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667" alt="License: MIT"&gt;&lt;/a&gt;
&lt;a href="https://www.python.org/downloads/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/bd7bcdc70784bad7073b66850c51f4fed5dc3b2fc782277551b9013c7d27f043/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f707974686f6e2d332e382b2d626c75652e737667" alt="Python 3.8+"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;PyTorch model health diagnostics — built from an SRE perspective.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Stop guessing why your model isn't learning. &lt;code&gt;torchdiag&lt;/code&gt; gives you five diagnostic commands that answer the questions that matter: Are gradients flowing? Are neurons alive? Did the optimizer actually update weights?&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Installation&lt;/h2&gt;
&lt;/div&gt;
&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;pip install torchdiag&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Quick Start&lt;/h2&gt;
&lt;/div&gt;
&lt;div class="highlight highlight-source-python notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;torch&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;torch&lt;/span&gt;.&lt;span class="pl-s1"&gt;nn&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;nn&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;torchdiag&lt;/span&gt;
&lt;span class="pl-s1"&gt;model&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;nn&lt;/span&gt;.&lt;span class="pl-c1"&gt;Sequential&lt;/span&gt;(
    &lt;span class="pl-s1"&gt;nn&lt;/span&gt;.&lt;span class="pl-c1"&gt;Linear&lt;/span&gt;(&lt;span class="pl-c1"&gt;784&lt;/span&gt;, &lt;span class="pl-c1"&gt;256&lt;/span&gt;),
    &lt;span class="pl-s1"&gt;nn&lt;/span&gt;.&lt;span class="pl-c1"&gt;ReLU&lt;/span&gt;(),
    &lt;span class="pl-s1"&gt;nn&lt;/span&gt;.&lt;span class="pl-c1"&gt;Linear&lt;/span&gt;(&lt;span class="pl-c1"&gt;256&lt;/span&gt;, &lt;span class="pl-c1"&gt;64&lt;/span&gt;),
    &lt;span class="pl-s1"&gt;nn&lt;/span&gt;.&lt;span class="pl-c1"&gt;ReLU&lt;/span&gt;(),
    &lt;span class="pl-s1"&gt;nn&lt;/span&gt;.&lt;span class="pl-c1"&gt;Linear&lt;/span&gt;(&lt;span class="pl-c1"&gt;64&lt;/span&gt;, &lt;span class="pl-c1"&gt;10&lt;/span&gt;),
)

&lt;span class="pl-c"&gt;# 1. Model overview&lt;/span&gt;
&lt;span class="pl-s1"&gt;torchdiag&lt;/span&gt;.&lt;span class="pl-c1"&gt;summary&lt;/span&gt;(&lt;span class="pl-s1"&gt;model&lt;/span&gt;)

&lt;span class="pl-c"&gt;# 2. Check for dead neurons&lt;/span&gt;
&lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;torch&lt;/span&gt;.&lt;span class="pl-c1"&gt;randn&lt;/span&gt;(&lt;span class="pl-c1"&gt;100&lt;/span&gt;, &lt;span class="pl-c1"&gt;784&lt;/span&gt;)
&lt;span class="pl-s1"&gt;torchdiag&lt;/span&gt;.&lt;span class="pl-c1"&gt;check_dead_neurons&lt;/span&gt;(&lt;span class="pl-s1"&gt;model&lt;/span&gt;, &lt;span class="pl-s1"&gt;x&lt;/span&gt;)

&lt;span class="pl-c"&gt;# 3. Verify a full training step works&lt;/span&gt;
&lt;span class="pl-s1"&gt;torchdiag&lt;/span&gt;.&lt;span class="pl-c1"&gt;verify_step&lt;/span&gt;(
    &lt;span class="pl-s1"&gt;model&lt;/span&gt;,
    &lt;span class="pl-s1"&gt;torch&lt;/span&gt;.&lt;span class="pl-c1"&gt;optim&lt;/span&gt;.&lt;span class="pl-c1"&gt;Adam&lt;/span&gt;&lt;/pre&gt;…
&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/AddyM/torchdiag" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;h2&gt;
  
  
  1. What does my model actually look like?
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torchdiag&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch.nn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Sequential&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;784&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ReLU&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ReLU&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;torchdiag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prints parameter count per layer, total/trainable/frozen breakdown, memory footprint, device placement, and dtype distribution. Flags frozen parameters, split-device models, and dtype mismatches.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Are gradients flowing?
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CrossEntropyLoss&lt;/span&gt;&lt;span class="p"&gt;()(&lt;/span&gt;&lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;torchdiag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check_gradients&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reports gradient mean, max, and min per layer. Flags vanishing gradients (max below 1e-7), exploding gradients (max above 100), and disconnected parameters (None gradients).&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Are neurons alive?
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;784&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;torchdiag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check_dead_neurons&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A dead ReLU neuron outputs zero for every input. Its gradient is permanently zero. It will never learn again. This command tells you how many you have and where. Flags critical layers with more than 50% dead neurons.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Does one training step actually work?
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;torchdiag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;verify_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Adam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
    &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CrossEntropyLoss&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;784&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,)),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Runs one complete training step — forward, loss, backward, optimizer step — and verifies each stage works. Confirms output shape is correct, loss is finite, gradients are computed, and parameters actually change.&lt;/p&gt;

&lt;p&gt;Run this before your training loop. If something is broken, you will know in 1 step instead of 100 epochs.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. How much memory am I using?
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;torchdiag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;memory_report&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reports CPU RSS, GPU allocated/cached/peak per device, and MPS memory on Apple Silicon. Flags when GPU utilization exceeds 90%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I Built This
&lt;/h2&gt;

&lt;p&gt;I spent 11 years at VMware working on distributed systems observability. The first thing you learn in SRE: &lt;strong&gt;never trust a system you cannot measure.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PyTorch models are systems. They have inputs, internal state, and outputs. When they fail, they fail silently — the loss just stays flat. No error. No exception. Just a number that does not move.&lt;/p&gt;

&lt;p&gt;torchdiag makes the internal state visible. Five commands. No configuration. No dependencies beyond PyTorch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PyPI:&lt;/strong&gt; &lt;a href="https://pypi.org/project/torchdiag/" rel="noopener noreferrer"&gt;pypi.org/project/torchdiag&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/AddyM/torchdiag" rel="noopener noreferrer"&gt;github.com/AddyM/torchdiag&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;CI:&lt;/strong&gt; Tests pass across Python 3.9 to 3.12&lt;/p&gt;

&lt;p&gt;Contributions welcome. If you have a debugging pattern you use repeatedly, &lt;a href="https://github.com/AddyM/torchdiag/issues" rel="noopener noreferrer"&gt;open an issue&lt;/a&gt; — it probably belongs in the toolkit.&lt;/p&gt;

</description>
      <category>pytorch</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
