<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tanay Joshi</title>
    <description>The latest articles on DEV Community by Tanay Joshi (@tanay_joshi_04).</description>
    <link>https://dev.to/tanay_joshi_04</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3998619%2Fea87aab2-c961-4620-8647-3527bfd643cf.jpg</url>
      <title>DEV Community: Tanay Joshi</title>
      <link>https://dev.to/tanay_joshi_04</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tanay_joshi_04"/>
    <language>en</language>
    <item>
      <title>Never lose a training run again: a checkpoint-and-resume playbook for ephemeral GPUs</title>
      <dc:creator>Tanay Joshi</dc:creator>
      <pubDate>Tue, 23 Jun 2026 11:17:15 +0000</pubDate>
      <link>https://dev.to/tanay_joshi_04/never-lose-a-training-run-again-a-checkpoint-and-resume-playbook-for-ephemeral-gpus-2m1j</link>
      <guid>https://dev.to/tanay_joshi_04/never-lose-a-training-run-again-a-checkpoint-and-resume-playbook-for-ephemeral-gpus-2m1j</guid>
      <description>&lt;p&gt;▶ Prefer to play with it? There's an interactive version of this article&lt;br&gt;
where you can break things yourself: &lt;a href="https://resumable-ml-training.vercel.app" rel="noopener noreferrer"&gt;https://resumable-ml-training.vercel.app&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I train a lot of models on compute that can disappear at any second — free notebooks, pre-emptible instances, whatever I can get. Early on, a single disconnect could wipe out hours of work. So I built a pattern that makes a dropped session cost seconds instead. This is that pattern, written up generically so you can drop it into any training loop.&lt;/p&gt;
&lt;h2&gt;
  
  
  The 2 a.m. disconnect
&lt;/h2&gt;

&lt;p&gt;If you have ever trained a model on a free GPU, you know the feeling. You kick off a long run, check back later, and the session is gone. The notebook is disconnected, the runtime recycled, and every epoch since the last time you looked has evaporated. You start again from zero.&lt;/p&gt;

&lt;p&gt;Free and pre-emptible compute is one of the best deals in machine learning — but it is &lt;em&gt;ephemeral&lt;/em&gt;. The machine can vanish at any moment: idle timeouts, usage caps, spot-instance reclamation. Fighting this with keep-alive hacks treats the symptom. The real fix is to make your training &lt;strong&gt;resumable&lt;/strong&gt; and your pipeline &lt;strong&gt;idempotent&lt;/strong&gt;, so an interruption simply doesn't matter.&lt;/p&gt;

&lt;p&gt;Here is the pattern I now use for every long run. It rests on five ideas.&lt;/p&gt;
&lt;h2&gt;
  
  
  1. Checkpoint the &lt;em&gt;whole&lt;/em&gt; state — not just the weights
&lt;/h2&gt;

&lt;p&gt;The most common mistake is saving only &lt;code&gt;model.state_dict()&lt;/code&gt;. That is not enough to resume training. If you reload only the weights and start a fresh optimizer, you lose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the &lt;strong&gt;optimizer&lt;/strong&gt; state (Adam's moment estimates — momentum and variance),&lt;/li&gt;
&lt;li&gt;the &lt;strong&gt;learning-rate scheduler&lt;/strong&gt; position (so the LR jumps back to its starting value),&lt;/li&gt;
&lt;li&gt;the &lt;strong&gt;epoch counter&lt;/strong&gt; (so you re-run epochs you already finished),&lt;/li&gt;
&lt;li&gt;the &lt;strong&gt;best-so-far&lt;/strong&gt; tracking and early-stopping counter,&lt;/li&gt;
&lt;li&gt;the &lt;strong&gt;RNG state&lt;/strong&gt; (so the run is no longer reproducible across the break).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A resumable checkpoint captures all of it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;make_checkpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;epoch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rng&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;epochs_done&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;epoch&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;       &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;state_dict&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;optimizer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;state_dict&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scheduler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;state_dict&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;best_metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rng&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;torch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_rng_state&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_rng_state_all&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numpy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_state&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getstate&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save this &lt;strong&gt;every epoch&lt;/strong&gt;. The cost is milliseconds; the payoff is never losing more than one epoch of work.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Write atomically — or risk a corrupted checkpoint
&lt;/h2&gt;

&lt;p&gt;Here is a subtle trap: if the machine dies &lt;em&gt;while you are writing the checkpoint&lt;/em&gt;, you get a half-written, unreadable file — and now you have lost everything, including the good checkpoint you just overwrote.&lt;/p&gt;

&lt;p&gt;The fix is the &lt;strong&gt;write-temp-then-rename&lt;/strong&gt; trick. A rename on the same filesystem is atomic: the checkpoint file is either the complete old version or the complete new version, never a torn mix.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;save_atomic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;tmp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.tmp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# atomic on the same filesystem
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This one helper has saved me more than once.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. A "done marker" makes the entire job idempotent
&lt;/h2&gt;

&lt;p&gt;Resuming one run is good. Resuming a &lt;em&gt;sweep&lt;/em&gt; of many runs is better. If you train across several configurations, seeds, or datasets, you want to re-launch the whole batch and have it automatically skip everything already finished and resume only the one that was interrupted.&lt;/p&gt;

&lt;p&gt;The trick is a &lt;strong&gt;done marker&lt;/strong&gt;: write the final results file (metrics, summary) only when a run fully completes. Then the launcher logic is trivial:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_dir&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;results.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;           &lt;span class="c1"&gt;# done marker
&lt;/span&gt;        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[skip] already complete: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run_dir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_dir&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;results.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# ... train (resuming from checkpoint if present) ...
&lt;/span&gt;    &lt;span class="nf"&gt;save_atomic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_dir&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;results.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# write marker LAST
&lt;/span&gt;    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_dir&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpoint.pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;unlink&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;missing_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# done -&amp;gt; drop checkpoint
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now your orchestration loop is fully restartable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;seed&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;SEEDS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;CONFIGS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;run_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;run_dir_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Re-run it after any disconnect. Finished work is skipped, interrupted work resumes, nothing is ever duplicated. This is the same principle build tools like &lt;code&gt;make&lt;/code&gt; use: declare the output, and only do the work if the output is missing.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Put the state where it outlives the machine
&lt;/h2&gt;

&lt;p&gt;A checkpoint is only useful if it survives the thing that died. The number-one mistake on ephemeral compute is writing checkpoints to the node's &lt;strong&gt;local scratch disk&lt;/strong&gt; — which is wiped the instant the runtime is recycled. Your checkpoints must live on storage that is &lt;em&gt;external&lt;/em&gt; to the compute.&lt;/p&gt;

&lt;p&gt;You have two families of options.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud / network storage&lt;/strong&gt; (best for ephemeral cloud GPUs):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A mounted cloud drive (Google Drive, Dropbox, OneDrive, iCloud Drive).&lt;/li&gt;
&lt;li&gt;An object-store bucket (S3, GCS, R2, Azure Blob) you &lt;code&gt;sync&lt;/code&gt; to after each epoch.&lt;/li&gt;
&lt;li&gt;A network filesystem (NFS/SMB) on a persistent volume.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Local / self-hosted storage&lt;/strong&gt; (best when &lt;em&gt;you&lt;/em&gt; own the machine, or for hybrid setups):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An external SSD/HDD, or a second internal disk that is not part of the ephemeral root.&lt;/li&gt;
&lt;li&gt;A home server or NAS the training box can reach over the LAN.&lt;/li&gt;
&lt;li&gt;Your laptop: periodically &lt;code&gt;rsync&lt;/code&gt;/&lt;code&gt;scp&lt;/code&gt; the checkpoint directory back from a remote box, so a copy always exists on hardware you control.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A clean trick that works in both worlds: keep your &lt;strong&gt;code on the fast local disk&lt;/strong&gt; but &lt;strong&gt;symlink the checkpoint/output directory to persistent storage&lt;/strong&gt;. You get local-disk speed for reads and durable state for writes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# code stays on fast local disk; outputs live on durable storage&lt;/span&gt;
&lt;span class="nb"&gt;ln&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; /mnt/persistent/my_project/outputs  ./outputs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The principle is storage-agnostic: &lt;em&gt;the checkpoint must outlive the compute node.&lt;/em&gt; Cloud bucket, mounted drive, NAS, or an external SSD on your desk — any of them works, as long as it is not the disk that gets wiped.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Resume means &lt;em&gt;continue&lt;/em&gt;, not restart
&lt;/h2&gt;

&lt;p&gt;With state saved durably, the training loop just checks for a checkpoint on startup and continues:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;start_epoch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;checkpoint_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;ck&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;checkpoint_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;map_location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weights_only&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_state_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ck&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_state_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ck&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;optimizer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;scheduler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_state_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ck&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scheduler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;best_metric&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ck&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;best_metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;start_epoch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ck&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;epochs_done&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nf"&gt;restore_rng&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ck&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rng&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[resume] continuing from epoch &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;start_epoch&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;epoch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start_epoch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_epochs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;   &lt;span class="c1"&gt;# note: start_epoch, not 0
&lt;/span&gt;    &lt;span class="nf"&gt;train_one_epoch&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
    &lt;span class="nf"&gt;save_atomic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;checkpoint_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;make_checkpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;epoch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A good smoke test: start a run, kill it mid-training, restart it, and confirm the log shows &lt;code&gt;[resume] continuing from epoch N&lt;/code&gt; — with the learning rate picking up smoothly where it left off, not jumping back to its initial value. If the LR is continuous, your optimizer and scheduler state survived correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gotchas I hit (so you don't have to)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The &lt;code&gt;weights_only&lt;/code&gt; pickle trap.&lt;/strong&gt; Recent PyTorch (2.6+) defaults &lt;code&gt;torch.load(..., weights_only=True)&lt;/code&gt;, which &lt;em&gt;refuses&lt;/em&gt; to load checkpoints containing non-tensor objects — like the NumPy/Python RNG state above. For your own trusted checkpoint files, pass &lt;code&gt;weights_only=False&lt;/code&gt;. (Never do this for files from untrusted sources — it runs arbitrary pickle code.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Append your logs, don't truncate.&lt;/strong&gt; On resume, open the log file in append mode so you keep the full history across restarts instead of overwriting it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mind early-stopping state.&lt;/strong&gt; If you track "epochs since last improvement," checkpoint that counter too — otherwise a resume silently resets your patience.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Setup cells will re-run, and that's fine.&lt;/strong&gt; When a runtime is recycled, the interpreter's memory is gone — you &lt;em&gt;cannot&lt;/em&gt; avoid re-importing libraries or re-mounting storage. Make those steps cheap and idempotent (mount only if not mounted; install only if missing; cache any preprocessed data to durable storage). The goal is not to skip setup; it is to make the &lt;em&gt;expensive&lt;/em&gt; work resumable.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The payoff
&lt;/h2&gt;

&lt;p&gt;Once this is in place, the disconnect stops being a disaster and becomes a shrug. You reconnect, re-run the cheap setup, hit "go," and watch it print &lt;code&gt;[resume] continuing from epoch …&lt;/code&gt;. No lost epochs. No duplicate runs. No keep-alive browser hacks.&lt;/p&gt;

&lt;p&gt;A quick checklist to make any training job bulletproof:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Checkpoint model &lt;strong&gt;+ optimizer + scheduler + epoch + RNG&lt;/strong&gt;, every epoch&lt;/li&gt;
&lt;li&gt;[ ] Write checkpoints &lt;strong&gt;atomically&lt;/strong&gt; (temp file → rename)&lt;/li&gt;
&lt;li&gt;[ ] Write a &lt;strong&gt;done marker&lt;/strong&gt; only on full completion; skip finished runs&lt;/li&gt;
&lt;li&gt;[ ] Store state on something that &lt;strong&gt;outlives the compute node&lt;/strong&gt; (cloud or local)&lt;/li&gt;
&lt;li&gt;[ ] On startup, &lt;strong&gt;resume from the saved epoch&lt;/strong&gt;, not from zero&lt;/li&gt;
&lt;li&gt;[ ] Test it: interrupt, restart, confirm a clean resume&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ephemeral compute is one of the best deals in machine learning. With a resumable, idempotent pipeline, you get all of its upside and almost none of its fragility.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Found this useful?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I write about the unglamorous engineering that makes ML actually ship.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🔗 Interactive walkthrough — &lt;a href="https://resumable-ml-training.vercel.app" rel="noopener noreferrer"&gt;https://resumable-ml-training.vercel.app&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💻 Runnable code (MIT) — &lt;a href="https://github.com/TanayMjoshi/Bulletproof-training-on-ephemeral-GPUs" rel="noopener noreferrer"&gt;https://github.com/TanayMjoshi/Bulletproof-training-on-ephemeral-GPUs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💼 LinkedIn — &lt;a href="https://www.linkedin.com/in/tanay-joshi-2a3bba1ab/" rel="noopener noreferrer"&gt;https://www.linkedin.com/in/tanay-joshi-2a3bba1ab/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🐦 X / Twitter — &lt;a href="https://x.com/MysteryMan60934" rel="noopener noreferrer"&gt;https://x.com/MysteryMan60934&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If this saved you a training run, a ⭐ on the repo or a follow means a lot.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>mlops</category>
      <category>learning</category>
    </item>
  </channel>
</rss>
