Mirza Iqbal

Posted on Jun 28

83 percent done, then it died with nothing to resume from

#claudecode #automation #devops #productivity

I came back to my laptop after dinner expecting a finished job.

What I got was a dead script and one number burned into the terminal. 83 percent.

It had been running for hours. Promoting 808 items, one at a time, until it stopped somewhere past item six hundred, because my usage cap hit the wall and the session got killed mid-loop.

If you have ever left a long script running and come back to find it dead, you know the exact feeling I mean.

It is less anger than arithmetic. You stand there working out how much of all that you now have to do over.

What I am not proud of

I committed once. At the very end.

It was written to process all 808 items and then save everything in a single commit at the bottom of the loop. Clean code, I thought.

So when it died at 83 percent, my last good save point sat hours behind the actual progress. Most of the work was on disk in scattered pieces, with nothing checkpointed. I threw it away and started the grind again.

That night I stopped caring about the success rate of the job.

I was watching the wrong number

Success rate turned out to be the wrong thing to track.

Resume cost was the number that actually mattered. How much work do I lose if this dies right now.

Looking back at my own grinding habits, I averaged roughly one commit for every three and a half hours of work. Any failure at hour three erased almost everything since the last save.

That job looked productive the entire time it ran. Quietly, it was building a bigger and bigger thing to lose.

You cannot retry a wall

Everyone reaches for retries first. Retries did nothing for me here.

A retry rescues you from a flaky network or a transient timeout. My session got killed by a hard cap. There is no retrying your way past a limit that has already shut the door.

Cleverer error handling will not save you. Move the save point instead, so it travels with the work.

You pick a batch size. Every N items, the loop checkpoints and commits what is done so far.

Resume cost stops being hours and becomes, at most, N items. Mine is ten now. If it dies at item 617, I lose the handful since the last checkpoint at 610.

That is the whole idea. Bound the loss to a fixed N instead of letting it grow with the clock.

One opinion most people fight at first

Here is what builders of long jobs tend to disagree with until it bites them.

Success rate, for a long job, is close to vanity.

A 99 percent success rate with one commit at the end is more dangerous than an 80 percent success rate that checkpoints every ten items.

One loses everything the moment it trips. The other loses ten items.

Reliability for a long-running job lives in how cheaply it resumes. Finishing rate barely enters into it.

Patterns that feel like fixes and are not

When this failure mode shows up, three reflexes come out first. None of them shrink the resume cost.

A bigger queue buffer holds more work in memory, so you lose more when it dies.

A watchdog restarts the process, usually from the top, which is the problem you started with.

More retries fight a transient enemy when the real one is a hard limit.

All three look like progress because the visible error rate drops. They convert a loud failure into a silent one, and silent failures cost more because they hide longer.

A trap hiding inside the fix

I learned one more thing the hard way.

Your checkpoint has to skip the broken item. Saving it is the mistake.

If item 412 is malformed and you commit it anyway, you have saved broken state, and your next resume starts from a lie. Count an item toward the batch only after it actually passed. A failed item gets skipped. It never reaches the checkpoint.

That single rule is the difference between a resume you can trust and one that quietly carries the bug forward.

Where I see the same shape now

Once you start looking, this turns up everywhere.

Think of the automation feeding a Monday morning dashboard. The overnight batch refreshing lead data. A migration grinding for hours against a real table.

They are written to run and finish. Almost nobody asks what happens at hour three when a quota rotates or a machine reboots.

Most of the time the answer is a person re-running yesterday by hand while everyone waits.

Calm, at the end

My grind finished the second time around. It took the same number of hours.

What changed was that I stopped being afraid of it dying, because the most it could ever cost me was ten items.

That is a strange kind of calm to feel about a script. It came entirely from changing which number I was watching.

Your turn

What is the longest job you have left running, and what would it cost you if it died at 83 percent right now?

If this was useful

I work through this in public, the wins and the freezes both, mostly on LinkedIn and YouTube. If the real version of building in the open is useful to you, that is where it lives. Find me on X, GitHub, and the work at next8n.com.

DEV Community