How I Stopped Losing GPU Training Runs During Long Experiments
I left a model training on a remote GPU box on a Thursday. Twelve hours, four losses, three datasets, all the hyperparameters I'd been tweaking for a week.
Friday morning I SSH'd in, ran tmux attach, and saw this:
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
The process had died at hour two. Ten hours of GPU time, gone. The dataset I needed in memory? Also gone -_-,
The metrics-csv my eval script was supposed to emit at the end? Never written; because the job never reached the end.
And i can't tell you how much frustrating it was after giving so much hours to the run and finding out it was a complete waste of time!!!
I sat down and counted how many times this had happened in the previous month.
*Five. Five jobs where I'd come back to find the run had quietly died and I'd lost a day, sometimes two - of compute and clock time...
So I built the tool I wanted to exist. It's called GPUAlert, it's a ~1000-line Python CLI, and it's pip install gpualert away.
pip install gpualert
gpualert config --init
gpualert run -- python train.py
That last line wraps your training command. When it finishes -> success,crash, timeout, or you hitting Ctrl+C -> you get an email. The email has the full stdout/stderr logs attached. The body has a one-line summary of what went wrong -> "GPU out-of-memory", "NaN in loss", "killed by OOMKiller","process exited with code 137" :- pulled from the stderr by a small regex classifier.
If it succeeded, the body has the metrics it found in your stdout -> accuracy, loss, F1, mAP, the last value of each. So you glance at your phone and see "Accuracy: 0.92 | Loss: 0.123 | Epochs: 50" before you even open the email.
Why not just screen and mail?
That's the version I'd been using. The problem isn't availability of unix
primitives. It's:
-
screen/tmuxkeeps the process alive - but you still have to check. - Plain
mailcan attach logs, but you have to wire up the success/failure detection yourself, every time, in every project. - The stderr tail is rarely enough. CUDA OOM looks like a wall of NCCL warnings followed by the actual error - you need to grep through dozens of lines to find the cause.
I wanted something I could install once and forget. The whole interface is six commands:
gpualert run # wrap a local command
gpualert slurm # poll a sacct job ID
gpualert config # interactive setup wizard
gpualert test-email # SMTP sanity check
gpualert logs # list recent job log dirs
gpualert version #which version(presently v0.1.0)*
The non-obvious design promises
Three properties I wanted in this tool that turned out to be surprisingly fiddly to actually deliver:
1. Logs always exist on disk, even if the job segfaults.
The launcher creates the log files before it starts the subprocess. Then the streaming readers pipe stdout/stderr to those files in real time.
If the subprocess dies in the first millisecond, the log files are still on disk - empty, but present with a [SYSTEM] header explaining what happened. No more "the job died before it wrote anything so I have no idea what happened" situations.
2. The notifier never raises.
If SMTP auth fails, if your network is down, if Gmail decides today is the day they rate-limit you -> the email fails, you get a printed error and the CLI still exits with the job's exit code, not the notifier's.
The logs are still on disk. You can still figure out what happened. The notification path is best-effort and isolated.
3. Logs are always attached to failure emails.
Non-negotiable, not behind a flag. If the job failed, the logs ride along.
The config has an attach_logs_on_failure field for symmetry, but the runtime ignores it.
Past-me opening a "your job failed" email with no attached logs was the worst case I designed against.
What the email actually looks like
For a failed run with CUDA OOM:
Subject: [GPUAlert] ❌ FAILED: python train.py --epochs 50
Status : FAILED
Command : python train.py --epochs 50
Duration : 1h 47m 12s
Exit Code: 1
ERROR SUMMARY
─────────────
GPU out-of-memory (CUDA OOM)
Suggestion: Try reducing batch size, using gradient checkpointing, or a larger GPU.
LAST 15 LINES OF STDERR
─────────────
[18:42:11] RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
...
ATTACHED FILES (3)
- stdout.log (12.4 KB)
- stderr.log (3.1 KB)
- combined.log (15.8 KB)
For a successful run, the body has the metrics line and a smaller payload.
Slurm
If you're on a cluster, you've probably already kicked the job off with sbatch.
The wrapper-around-command pattern doesn't apply because Slurm already owns the lifecycle. So there's a separate command:
sbatch my_job.sh # returns "Submitted batch job 12345"
gpualert slurm 12345
It polls sacct every 10 seconds (configurable) until the job hits a terminal state, then sends the email. Same body format, same attachment behaviour, same exit-code semantics.
Try it
pip install gpualert
gpualert config --init
gpualert test-email
gpualert run --dry-run -- echo "hello"
--dry-run prints the email it would send without actually touching SMTP.
Useful for kicking the tires before you trust it with a real overnight job.
Code: https://github.com/Parv-01/gpualert
PyPI: https://pypi.org/project/gpualert/
What's next
The roadmap, depending on what people ask for:
-
Slack / Discord / Telegram notifier backends -> the abstract
BaseNotifieris already in place; each new backend is a few hundred lines. -
Keyring integration so the SMTP password lives in the OS keyring instead of
~/.gpualert/config.toml. - Multi-job dashboards -> a web view of recent runs across hosts.
- Prometheus exporter for cluster-wide stats.
If any of those scratch your itch, open a Discussion on the repo. PRs welcome -> there's a 131-check end-to-end harness so it's hard to break things accidentally.
But even if something's break then that is the reason we all love logs,git versioning and debugging(HIGH ON CAFFEINATED SPRINT SESSIONS HEHE... >_< )
Top comments (0)