<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Parv Agarwal</title>
    <description>The latest articles on DEV Community by Parv Agarwal (@parvagarwal).</description>
    <link>https://dev.to/parvagarwal</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3931697%2Fbfb357b1-a766-474a-8e91-1b21a8753286.jpg</url>
      <title>DEV Community: Parv Agarwal</title>
      <link>https://dev.to/parvagarwal</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/parvagarwal"/>
    <language>en</language>
    <item>
      <title>How I Stopped Losing GPU Training Runs During Long Experiments</title>
      <dc:creator>Parv Agarwal</dc:creator>
      <pubDate>Tue, 26 May 2026 14:20:00 +0000</pubDate>
      <link>https://dev.to/parvagarwal/how-i-stopped-losing-gpu-training-runs-during-long-experiments-3ham</link>
      <guid>https://dev.to/parvagarwal/how-i-stopped-losing-gpu-training-runs-during-long-experiments-3ham</guid>
      <description>&lt;h2&gt;
  
  
  How I Stopped Losing GPU Training Runs During Long Experiments
&lt;/h2&gt;

&lt;p&gt;I left a model training on a remote GPU box on a Thursday. Twelve hours, four losses, three datasets, all the hyperparameters I'd been tweaking for a week.&lt;/p&gt;

&lt;p&gt;Friday morning I SSH'd in, ran &lt;code&gt;tmux attach&lt;/code&gt;, and saw this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The process had died at hour two. Ten hours of GPU time, gone. The dataset I needed in memory? Also gone -_-, &lt;br&gt;
The metrics-csv my eval script was supposed to emit at the end? Never written; because the job never reached the end. &lt;br&gt;
And i can't tell you how much frustrating it was after giving so much hours to the run and finding out it was a complete waste of time!!!&lt;/p&gt;

&lt;p&gt;I sat down and counted how many times this had happened in the previous month.&lt;br&gt;
*Five. Five jobs where I'd come back to find the run had quietly died and I'd lost a day, sometimes two - of compute and clock time...&lt;/p&gt;

&lt;p&gt;So I built the tool I wanted to exist. It's called &lt;strong&gt;GPUAlert&lt;/strong&gt;, it's a ~1000-line Python CLI, and it's &lt;code&gt;pip install gpualert&lt;/code&gt; away.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;gpualert
gpualert config &lt;span class="nt"&gt;--init&lt;/span&gt;
gpualert run &lt;span class="nt"&gt;--&lt;/span&gt; python train.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That last line wraps your training command. When it finishes -&amp;gt; success,crash, timeout, or you hitting Ctrl+C -&amp;gt; you get an email. The email has the full stdout/stderr logs attached. The body has a one-line summary of &lt;em&gt;what went wrong&lt;/em&gt; -&amp;gt; "GPU out-of-memory", "NaN in loss", "killed by OOMKiller","process exited with code 137" :- pulled from the stderr by a small regex classifier.&lt;/p&gt;

&lt;p&gt;If it succeeded, the body has the metrics it found in your stdout -&amp;gt; &lt;code&gt;accuracy&lt;/code&gt;, &lt;code&gt;loss&lt;/code&gt;, &lt;code&gt;F1&lt;/code&gt;, &lt;code&gt;mAP&lt;/code&gt;, the last value of each. So you glance at your phone and see "Accuracy: 0.92 | Loss: 0.123 | Epochs: 50" before you even open the email.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not just &lt;code&gt;screen&lt;/code&gt; and &lt;code&gt;mail&lt;/code&gt;?
&lt;/h2&gt;

&lt;p&gt;That's the version I'd been using. The problem isn't &lt;em&gt;availability of unix&lt;br&gt;
primitives&lt;/em&gt;. It's:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;screen&lt;/code&gt; / &lt;code&gt;tmux&lt;/code&gt; keeps the process alive - but you still have to check.&lt;/li&gt;
&lt;li&gt;Plain &lt;code&gt;mail&lt;/code&gt; can attach logs, but you have to wire up the success/failure detection yourself, every time, in every project.&lt;/li&gt;
&lt;li&gt;The stderr tail is rarely enough. CUDA OOM looks like a wall of NCCL
warnings followed by the actual error - you need to grep through dozens of lines to find the cause.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I wanted something I could install once and forget. The whole interface is six commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gpualert run         # wrap a local command
gpualert slurm       # poll a sacct job ID
gpualert config      # interactive setup wizard
gpualert test-email  # SMTP sanity check
gpualert logs        # list recent job log dirs
gpualert version     #which version(presently v0.1.0)*
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The non-obvious design promises
&lt;/h2&gt;

&lt;p&gt;Three properties I wanted in this tool that turned out to be surprisingly fiddly to actually deliver:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Logs always exist on disk, even if the job segfaults.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The launcher creates the log files &lt;em&gt;before&lt;/em&gt; it starts the subprocess. Then the streaming readers pipe stdout/stderr to those files in real time. &lt;br&gt;
If the subprocess dies in the first millisecond, the log files are still on disk - empty, but present with a &lt;code&gt;[SYSTEM]&lt;/code&gt; header explaining what happened. No more "the job died before it wrote anything so I have no idea what happened" situations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The notifier never raises.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If SMTP auth fails, if your network is down, if Gmail decides today is the day they rate-limit you -&amp;gt; the email fails, you get a printed error and the CLI still exits with the &lt;em&gt;job's&lt;/em&gt; exit code, not the notifier's.&lt;br&gt;
The logs are still on disk. You can still figure out what happened. The notification path is best-effort and isolated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Logs are always attached to failure emails.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Non-negotiable, not behind a flag. If the job failed, the logs ride along.&lt;br&gt;
The config has an &lt;code&gt;attach_logs_on_failure&lt;/code&gt; field for symmetry, but the runtime ignores it. &lt;br&gt;
Past-me opening a "your job failed" email with no attached logs was the worst case I designed against.&lt;/p&gt;
&lt;h2&gt;
  
  
  What the email actually looks like
&lt;/h2&gt;

&lt;p&gt;For a failed run with CUDA OOM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Subject: [GPUAlert] ❌ FAILED: python train.py --epochs 50

Status   : FAILED
Command  : python train.py --epochs 50
Duration : 1h 47m 12s
Exit Code: 1

ERROR SUMMARY
─────────────
GPU out-of-memory (CUDA OOM)
Suggestion: Try reducing batch size, using gradient checkpointing, or a larger GPU.

LAST 15 LINES OF STDERR
─────────────
[18:42:11] RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
...

ATTACHED FILES (3)
  - stdout.log (12.4 KB)
  - stderr.log (3.1 KB)
  - combined.log (15.8 KB)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a successful run, the body has the metrics line and a smaller payload.&lt;/p&gt;

&lt;h2&gt;
  
  
  Slurm
&lt;/h2&gt;

&lt;p&gt;If you're on a cluster, you've probably already kicked the job off with &lt;code&gt;sbatch&lt;/code&gt;.&lt;br&gt;
The wrapper-around-command pattern doesn't apply because Slurm already owns the lifecycle. So there's a separate command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sbatch my_job.sh           &lt;span class="c"&gt;# returns "Submitted batch job 12345"&lt;/span&gt;
gpualert slurm 12345
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It polls &lt;code&gt;sacct&lt;/code&gt; every 10 seconds (configurable) until the job hits a terminal state, then sends the email. Same body format, same attachment behaviour, same exit-code semantics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;gpualert
gpualert config &lt;span class="nt"&gt;--init&lt;/span&gt;
gpualert test-email
gpualert run &lt;span class="nt"&gt;--dry-run&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"hello"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--dry-run&lt;/code&gt; prints the email it would send without actually touching SMTP.&lt;br&gt;
Useful for kicking the tires before you trust it with a real overnight job.&lt;/p&gt;

&lt;p&gt;Code: &lt;a href="https://github.com/Parv-01/gpualert" rel="noopener noreferrer"&gt;https://github.com/Parv-01/gpualert&lt;/a&gt;&lt;br&gt;
PyPI: &lt;a href="https://pypi.org/project/gpualert/" rel="noopener noreferrer"&gt;https://pypi.org/project/gpualert/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;The roadmap, depending on what people ask for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Slack / Discord / Telegram notifier backends&lt;/strong&gt; -&amp;gt; the abstract
&lt;code&gt;BaseNotifier&lt;/code&gt; is already in place; each new backend is a few hundred lines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keyring integration&lt;/strong&gt; so the SMTP password lives in the OS keyring instead of &lt;code&gt;~/.gpualert/config.toml&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-job dashboards&lt;/strong&gt; -&amp;gt; a web view of recent runs across hosts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus exporter&lt;/strong&gt; for cluster-wide stats.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of those scratch your itch, open a Discussion on the repo. PRs welcome -&amp;gt; there's a 131-check end-to-end harness so it's hard to break things accidentally.&lt;/p&gt;

&lt;p&gt;But even if something's break then that is the reason we all love logs,git versioning and debugging(HIGH ON CAFFEINATED SPRINT SESSIONS HEHE... &amp;gt;_&amp;lt; )&lt;/p&gt;

</description>
      <category>python</category>
      <category>opensource</category>
      <category>productivity</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
