My 8-hour job died at hour 3 and I had checkpointed almost nothing

#claudecode #automation #productivity #devops

For hours the job had been running clean.

It was a long grind, the kind you start and walk away from, trusting it to chew through the pile while you do something else. Hundreds of items, one after another, all fine.

Three hours in, the quota ran dry and everything stopped.

I came back to a dead run and a sinking feeling. Work had been going for hours, yet my last save sat way back near the start, because I had told myself I would commit when it was all done. So resuming meant starting again near the beginning, with three hours of finished work stranded and no way to reach it.

That is the moment I learned the interruption was never the real problem.

The cap does not care about your progress.

A quota, a timeout, a crash, a killed process. None of them check whether you are at a clean stopping point. They land whenever they land, mid-item, mid-thought, mid-everything.

Blaming the interruption is comfortable, because it sits outside you. It is also the one part you cannot control. Your cap will always arrive at the worst moment, and planning around "it will not happen this run" is planning to lose work.

So I stopped trying to dodge the interruption and started designing for it.

Resume cost is a number you choose.

Here is the idea that changed how I build long jobs. how much work an interruption can destroy is a setting, decided before the run starts.

Save progress every N items and your worst case loss is N items. Pick N of 500 and a crash can cost you 500 items of redone work. Pick N of 10 and the same crash costs you 10. Same interruption, wildly different pain, and the only difference is a decision you made up front.

That grind had effectively chosen N of everything. one commit, at the very end. So the worst case was the entire run, and the entire run is what I paid.

Why committing at the end is the bug.

A single end-commit feels efficient. fewer writes, cleaner history, less fuss while the job runs.

It is a trap. every interruption before that final commit erases all of it. You have built a system whose resume cost equals its full runtime, then handed it to an environment that interrupts at random.

My fix was unglamorous. save often, in small bounded steps, and never save a broken half-item. A failed item gets skipped, never checkpointed, so you never resume into a corrupt state. Boring, mechanical, and it turns a lost afternoon into a lost minute.

The opinion I will defend.

Here it is. blaming the cap, the timeout, or the crash for lost work is misdiagnosing your own design.

An interruption is a certainty, not an accident waiting to maybe happen. If losing the run hurts, the real fault is the unbounded resume cost you allowed, while the interruption was always going to come. Bound that cost and the same crash becomes a shrug.

What checkpoint discipline actually buys.

Nothing about my jobs runs faster now. checkpointing adds a little overhead, and I pay it gladly.

What I bought is the right to walk away. A long job can die at any moment, at hour one or hour seven, and the cost is always the same small number of items rather than the whole run. The crash still happens. I stopped letting it set fire to hours of finished work.

If you run anything long, a batch, a grind, an overnight pipeline, decide your resume cost on purpose before you start it. Your cap is coming. the only open question is how much it gets to take.

Your turn

What is the longest run you have lost, and what would a checkpoint have saved.

If this was useful

I work through this in public, the wins and the freezes both, mostly on LinkedIn and YouTube. If the real version of building in the open is useful to you, that is where it lives. LinkedIn, YouTube and X under Mirza Iqbal, and the work at next8n.com.

DEV Community

My 8-hour job died at hour 3 and I had checkpointed almost nothing

Your turn

If this was useful

Top comments (0)