My Pipeline Crashed Mid-Generation 3 Times — Here's What I Learned About Crash Recovery
🎁 Free resource: AI Publishing Checklist — 7 steps to ship a technical ebook with Python (free, no email required) · Full pipeline + 10 scripts: germy5.gumroad.com/l/xhxkzz (pay what you want, min $9.99)
My laptop died mid-chapter on the third generation run.
Not a graceful shutdown. A hard power loss. The process was in RUNNING state on chapter 7 of 10 when the battery hit zero.
When I reopened the laptop: chapters 1–6 were in DONE state. Chapter 7 was in RUNNING. The pipeline resumed from chapter 7, treated the RUNNING state as an incomplete run, reset it to PENDING, and re-generated cleanly.
No manual intervention. No duplicate API calls for chapters 1–6. No data loss.
This is what a crash-safe state machine looks like in practice.
The full pipeline (including this state machine) is at germy5.gumroad.com/l/xhxkzz — $12.99, 30-day refund.
The Three Crashes
Crash 1: Hard Power Loss (Battery)
Chapter: 7 of 10
State at crash: RUNNING
Recovery: Chapter 7 reset to PENDING on next startup. Chapters 1–6 untouched (DONE). Re-generated chapter 7 from scratch.
API calls saved: 6 chapters × ~$0.05 each = $0.30
Small amount. But the principle matters — the system didn't re-do finished work.
Crash 2: API Rate Limit (429)
Chapter: 4 of 10
State at crash: RUNNING (generation was mid-stream when the 429 hit)
Initial behavior (before fix): The chapter was half-generated when the exception propagated. The state stayed RUNNING.
This exposed a bug: I wasn't resetting RUNNING to PENDING on startup. On restart, the pipeline skipped chapter 4 entirely (saw it as RUNNING = in progress) and moved to chapter 5.
Fix:
def on_startup(chapters):
"""
Any chapter in RUNNING state on startup was interrupted.
Reset it to PENDING so it gets re-generated.
"""
for chapter in chapters:
if chapter.state == ChapterState.RUNNING:
print(f"[startup] Resetting chapter {chapter.id} from RUNNING → PENDING")
chapter.state = ChapterState.PENDING
save_state(chapter)
This became the canonical startup hook. Every run now calls on_startup() before processing begins.
Crash 3: Keyboard Interrupt (Ctrl+C)
Chapter: 2 of 10
State at crash: RUNNING
Recovery: Same as crash 1 after the startup fix. Chapter 2 reset to PENDING. Clean re-generation.
The keyboard interrupt (SIGINT) propagates through Python as a KeyboardInterrupt exception. I added a handler:
import signal, sys
def signal_handler(sig, frame):
print("\n[interrupt] Caught SIGINT. State saved. Run again to resume.")
sys.exit(0)
signal.signal(signal.SIGINT, signal_handler)
The sys.exit(0) triggers finally blocks and ensures the state file is written before the process exits.
The State Machine That Makes This Work
Four states. Strict transition rules. Disk write on every transition.
from enum import Enum
import json, os
class ChapterState(Enum):
PENDING = "pending"
RUNNING = "running"
DONE = "done"
NEEDS_REVIEW = "needs_review"
# Legal transitions
TRANSITIONS = {
ChapterState.PENDING: [ChapterState.RUNNING],
ChapterState.RUNNING: [ChapterState.DONE, ChapterState.NEEDS_REVIEW],
ChapterState.DONE: [], # terminal
ChapterState.NEEDS_REVIEW: [ChapterState.PENDING], # after human fix
}
STATE_FILE = "pipeline_state.json"
def transition(chapter, new_state: ChapterState):
if new_state not in TRANSITIONS[chapter.state]:
raise ValueError(
f"Invalid transition: {chapter.state} → {new_state} "
f"for chapter {chapter.id}"
)
chapter.state = new_state
_persist(chapter) # disk write happens here, before the next line
def _persist(chapter):
"""Write state to disk immediately. If this fails, we want to know."""
state_data = load_all_state()
state_data[chapter.id] = chapter.state.value
with open(STATE_FILE, "w") as f:
json.dump(state_data, f, indent=2)
def load_all_state() -> dict:
if not os.path.exists(STATE_FILE):
return {}
with open(STATE_FILE) as f:
return json.load(f)
Why _persist() before continuing:
If you write state after the operation completes, a crash between "operation done" and "state written" leaves you with work that happened but isn't recorded. On restart, you repeat the work.
If you write state before the operation, a crash between "state written" and "operation done" leaves you with a chapter in RUNNING state that will reset to PENDING on startup — and re-run cleanly.
The failure mode of "write before" is always recoverable. The failure mode of "write after" is sometimes not.
The Complete Chapter Processing Loop
def process_all_chapters(chapters):
on_startup(chapters) # reset any RUNNING → PENDING
for chapter in chapters:
if chapter.state == ChapterState.DONE:
print(f"[skip] Chapter {chapter.id}: already DONE")
continue
if chapter.state == ChapterState.NEEDS_REVIEW:
print(f"[block] Chapter {chapter.id}: NEEDS_REVIEW — fix and re-run")
continue
# Process PENDING chapters
transition(chapter, ChapterState.RUNNING)
try:
content = generate_chapter(chapter) # LLM API call
code = extract_code(content)
if not validate_syntax(code):
transition(chapter, ChapterState.NEEDS_REVIEW)
continue
if not validate_execution(code):
transition(chapter, ChapterState.NEEDS_REVIEW)
continue
chapter.content = content
transition(chapter, ChapterState.DONE)
except RateLimitError:
# Rate limit: put back to PENDING, wait, retry next run
transition(chapter, ChapterState.PENDING)
print(f"Rate limit hit on chapter {chapter.id}. Re-run to continue.")
break
except Exception as e:
transition(chapter, ChapterState.NEEDS_REVIEW)
print(f"Unexpected error on chapter {chapter.id}: {e}")
What I'd Do Differently
1. Finer-grained state per operation.
The current system treats chapter generation as atomic. If generation succeeds but validation fails, the chapter goes to NEEDS_REVIEW. A future improvement: track which validation gate failed (FAILED_SYNTAX vs FAILED_EXECUTION) so the error message is more specific.
2. Distributed state lock for parallel runs.
If you ever run two pipeline instances simultaneously (for parallel chapter generation), you need a lock on the state file. SQLite is better than JSON for this case — it handles concurrent writes safely.
3. Automatic retry with backoff for rate limits.
Currently, a rate limit sends the chapter back to PENDING and halts. A better behavior: catch the Retry-After header, sleep, and retry the same chapter in the same run.
The complete pipeline with all recovery logic: germy5.gumroad.com/l/xhxkzz ($12.99, 30-day refund).
Top comments (0)