DEV Community

German Yamil
German Yamil

Posted on

My Pipeline Crashed Mid-Generation 3 Times — Here's What I Learned About Crash Recovery

My Pipeline Crashed Mid-Generation 3 Times — Here's What I Learned About Crash Recovery


🎁 Free resource: AI Publishing Checklist — 7 steps to ship a technical ebook with Python (free, no email required) · Full pipeline + 10 scripts: germy5.gumroad.com/l/xhxkzz (pay what you want, min $9.99)

My laptop died mid-chapter on the third generation run.

Not a graceful shutdown. A hard power loss. The process was in RUNNING state on chapter 7 of 10 when the battery hit zero.

When I reopened the laptop: chapters 1–6 were in DONE state. Chapter 7 was in RUNNING. The pipeline resumed from chapter 7, treated the RUNNING state as an incomplete run, reset it to PENDING, and re-generated cleanly.

No manual intervention. No duplicate API calls for chapters 1–6. No data loss.

This is what a crash-safe state machine looks like in practice.


The full pipeline (including this state machine) is at germy5.gumroad.com/l/xhxkzz — $12.99, 30-day refund.


The Three Crashes

Crash 1: Hard Power Loss (Battery)

Chapter: 7 of 10
State at crash: RUNNING
Recovery: Chapter 7 reset to PENDING on next startup. Chapters 1–6 untouched (DONE). Re-generated chapter 7 from scratch.
API calls saved: 6 chapters × ~$0.05 each = $0.30

Small amount. But the principle matters — the system didn't re-do finished work.

Crash 2: API Rate Limit (429)

Chapter: 4 of 10
State at crash: RUNNING (generation was mid-stream when the 429 hit)
Initial behavior (before fix): The chapter was half-generated when the exception propagated. The state stayed RUNNING.

This exposed a bug: I wasn't resetting RUNNING to PENDING on startup. On restart, the pipeline skipped chapter 4 entirely (saw it as RUNNING = in progress) and moved to chapter 5.

Fix:

def on_startup(chapters):
    """
    Any chapter in RUNNING state on startup was interrupted.
    Reset it to PENDING so it gets re-generated.
    """
    for chapter in chapters:
        if chapter.state == ChapterState.RUNNING:
            print(f"[startup] Resetting chapter {chapter.id} from RUNNING → PENDING")
            chapter.state = ChapterState.PENDING
            save_state(chapter)
Enter fullscreen mode Exit fullscreen mode

This became the canonical startup hook. Every run now calls on_startup() before processing begins.

Crash 3: Keyboard Interrupt (Ctrl+C)

Chapter: 2 of 10
State at crash: RUNNING
Recovery: Same as crash 1 after the startup fix. Chapter 2 reset to PENDING. Clean re-generation.

The keyboard interrupt (SIGINT) propagates through Python as a KeyboardInterrupt exception. I added a handler:

import signal, sys

def signal_handler(sig, frame):
    print("\n[interrupt] Caught SIGINT. State saved. Run again to resume.")
    sys.exit(0)

signal.signal(signal.SIGINT, signal_handler)
Enter fullscreen mode Exit fullscreen mode

The sys.exit(0) triggers finally blocks and ensures the state file is written before the process exits.

The State Machine That Makes This Work

Four states. Strict transition rules. Disk write on every transition.

from enum import Enum
import json, os

class ChapterState(Enum):
    PENDING = "pending"
    RUNNING = "running"
    DONE = "done"
    NEEDS_REVIEW = "needs_review"

# Legal transitions
TRANSITIONS = {
    ChapterState.PENDING:       [ChapterState.RUNNING],
    ChapterState.RUNNING:       [ChapterState.DONE, ChapterState.NEEDS_REVIEW],
    ChapterState.DONE:          [],  # terminal
    ChapterState.NEEDS_REVIEW:  [ChapterState.PENDING],  # after human fix
}

STATE_FILE = "pipeline_state.json"

def transition(chapter, new_state: ChapterState):
    if new_state not in TRANSITIONS[chapter.state]:
        raise ValueError(
            f"Invalid transition: {chapter.state}{new_state} "
            f"for chapter {chapter.id}"
        )
    chapter.state = new_state
    _persist(chapter)  # disk write happens here, before the next line

def _persist(chapter):
    """Write state to disk immediately. If this fails, we want to know."""
    state_data = load_all_state()
    state_data[chapter.id] = chapter.state.value
    with open(STATE_FILE, "w") as f:
        json.dump(state_data, f, indent=2)

def load_all_state() -> dict:
    if not os.path.exists(STATE_FILE):
        return {}
    with open(STATE_FILE) as f:
        return json.load(f)
Enter fullscreen mode Exit fullscreen mode

Why _persist() before continuing:

If you write state after the operation completes, a crash between "operation done" and "state written" leaves you with work that happened but isn't recorded. On restart, you repeat the work.

If you write state before the operation, a crash between "state written" and "operation done" leaves you with a chapter in RUNNING state that will reset to PENDING on startup — and re-run cleanly.

The failure mode of "write before" is always recoverable. The failure mode of "write after" is sometimes not.

The Complete Chapter Processing Loop

def process_all_chapters(chapters):
    on_startup(chapters)  # reset any RUNNING → PENDING

    for chapter in chapters:
        if chapter.state == ChapterState.DONE:
            print(f"[skip] Chapter {chapter.id}: already DONE")
            continue

        if chapter.state == ChapterState.NEEDS_REVIEW:
            print(f"[block] Chapter {chapter.id}: NEEDS_REVIEW — fix and re-run")
            continue

        # Process PENDING chapters
        transition(chapter, ChapterState.RUNNING)

        try:
            content = generate_chapter(chapter)  # LLM API call
            code = extract_code(content)

            if not validate_syntax(code):
                transition(chapter, ChapterState.NEEDS_REVIEW)
                continue

            if not validate_execution(code):
                transition(chapter, ChapterState.NEEDS_REVIEW)
                continue

            chapter.content = content
            transition(chapter, ChapterState.DONE)

        except RateLimitError:
            # Rate limit: put back to PENDING, wait, retry next run
            transition(chapter, ChapterState.PENDING)
            print(f"Rate limit hit on chapter {chapter.id}. Re-run to continue.")
            break

        except Exception as e:
            transition(chapter, ChapterState.NEEDS_REVIEW)
            print(f"Unexpected error on chapter {chapter.id}: {e}")
Enter fullscreen mode Exit fullscreen mode

What I'd Do Differently

1. Finer-grained state per operation.

The current system treats chapter generation as atomic. If generation succeeds but validation fails, the chapter goes to NEEDS_REVIEW. A future improvement: track which validation gate failed (FAILED_SYNTAX vs FAILED_EXECUTION) so the error message is more specific.

2. Distributed state lock for parallel runs.

If you ever run two pipeline instances simultaneously (for parallel chapter generation), you need a lock on the state file. SQLite is better than JSON for this case — it handles concurrent writes safely.

3. Automatic retry with backoff for rate limits.

Currently, a rate limit sends the chapter back to PENDING and halts. A better behavior: catch the Retry-After header, sleep, and retry the same chapter in the same run.


The complete pipeline with all recovery logic: germy5.gumroad.com/l/xhxkzz ($12.99, 30-day refund).


Further Reading

Top comments (0)