DEV Community

Cover image for I Built a Python Pipeline That Writes, Validates, Translates, and Publishes Ebooks — $20/Month Total Cost
German Yamil
German Yamil

Posted on • Originally published at germy5.gumroad.com

I Built a Python Pipeline That Writes, Validates, Translates, and Publishes Ebooks — $20/Month Total Cost

I spent a few weeks building a Python pipeline that does something I couldn't find anywhere else: it writes a technical ebook, validates every code snippet in it, translates it to English, assembles the EPUB, and generates marketing assets — all for $20/month in API costs.

The book it produced is its own proof of concept. Every chapter in it was processed by the exact pipeline it describes.

Here's how it works.

The Architecture: A State Machine

The core problem with LLM-driven pipelines is crash recovery. If your process dies at chapter 7 of 10, you don't want to restart from zero.

The solution is a state machine with atomic checkpoints:

class ChapterStatus(Enum):
    PENDING = "PENDING"
    RUNNING = "RUNNING"
    DONE = "DONE"
    NEEDS_REVIEW = "NEEDS_REVIEW"
Enter fullscreen mode Exit fullscreen mode

Every chapter lives in checkpoint.json with one of these four states. Before processing, the system calls:

def cp_recover_running_orphans(checkpoint_path: str) -> int:
    """
    Resets any RUNNING chapters back to PENDING.
    Called at startup to recover from crashes.
    """
    with open(checkpoint_path, "r+") as f:
        data = json.load(f)
        recovered = 0
        for chapter in data["chapters"]:
            if chapter["status"] == "RUNNING":
                chapter["status"] = "PENDING"
                recovered += 1
        if recovered:
            f.seek(0)
            json.dump(data, f, indent=2)
            f.truncate()
    return recovered
Enter fullscreen mode Exit fullscreen mode

If the process crashes mid-chapter, the next run resets it to PENDING and retries. No manual intervention. No lost work.

The Validation Suite: Why Ebooks Fail

Most technical ebooks have code that doesn't run. The author wrote it, it looked right, but it was never executed.

The pipeline runs every script through two validation layers before a chapter can advance to DONE:

Layer 1: AST Parse

import ast

def validate_syntax(code: str) -> tuple[bool, str]:
    try:
        ast.parse(code)
        return True, "OK"
    except SyntaxError as e:
        return False, f"SyntaxError at line {e.lineno}: {e.msg}"
Enter fullscreen mode Exit fullscreen mode

This catches syntax errors cheaply, without executing anything.

Layer 2: Runtime Check in Isolation

import subprocess
import tempfile
import os

def validate_runtime(code: str, timeout: int = 10) -> tuple[bool, str]:
    with tempfile.TemporaryDirectory() as tmpdir:
        script_path = os.path.join(tmpdir, "script.py")
        with open(script_path, "w") as f:
            f.write(code)

        result = subprocess.run(
            ["python3", script_path],
            capture_output=True,
            text=True,
            timeout=timeout,
            cwd=tmpdir
        )

        if result.returncode == 0:
            return True, "OK"
        return False, result.stderr[:500]
Enter fullscreen mode Exit fullscreen mode

The script is written to a temp file and executed in an isolated subprocess. If it fails for any reason — import error, runtime exception, timeout — the chapter goes to NEEDS_REVIEW.

Critical security note: Never use exec() on f-string interpolated code in your main process. The isolation is the point.

The Translation Pipeline: Honest QA

The book was written in Spanish. The pipeline translates it to English with a semantic QA layer that doesn't lie about what it's doing.

The key challenge: code blocks can't be translated. Python is Python in any language.

The fence detector splits each chapter into segments:

import re

FENCE_PATTERN = re.compile(r"(```

[\w]*\n.*?

```)", re.DOTALL)

def split_segments(text: str) -> list[dict]:
    parts = FENCE_PATTERN.split(text)
    segments = []
    for part in parts:
        if FENCE_PATTERN.match(part):
            segments.append({"type": "code", "content": part})
        else:
            segments.append({"type": "prose", "content": part})
    return segments
Enter fullscreen mode Exit fullscreen mode

Only prose segments go to the translation API. Code blocks are preserved exactly. After translation, the segments are reassembled in order.

The QA layer is honest about its mode:

def qa_translation(original: str, translated: str) -> dict:
    if os.getenv("OPENAI_API_KEY"):
        score = cosine_similarity_via_embeddings(original, translated)
        mode = "SEMANTIC"
    else:
        # Word ratio: honest fallback
        orig_words = len(original.split())
        trans_words = len(translated.split())
        score = min(orig_words, trans_words) / max(orig_words, trans_words)
        mode = "WORD_RATIO_ONLY"

    return {
        "score": score,
        "mode": mode,
        "pass": score >= 0.75
    }
Enter fullscreen mode Exit fullscreen mode

If you don't have an OpenAI key, it tells you it's using word ratio. No TF-IDF disguised as semantic similarity.

The Build System: Safe EPUB Assembly

The Pandoc build has three guardrails:

1. Merge gate — only chapters with status DONE are included:

def merge_chapters(checkpoint_path: str, output_path: str):
    with open(checkpoint_path) as f:
        data = json.load(f)

    done_chapters = [
        ch for ch in data["chapters"] 
        if ch["status"] == "DONE"
    ]

    if len(done_chapters) != len(data["chapters"]):
        raise ValueError(f"Not all chapters are DONE. Aborting merge.")

    # ... merge and write to output_path
Enter fullscreen mode Exit fullscreen mode

2. YAML guardrail — rejects placeholder metadata:

PLACEHOLDER_PATTERNS = ["[TBD]", "[INSERT", "TODO", "XXX"]

def validate_yaml_metadata(metadata: dict):
    for key, value in metadata.items():
        for pattern in PLACEHOLDER_PATTERNS:
            if pattern in str(value):
                raise ValueError(f"Placeholder found in {key}: {value}")
Enter fullscreen mode Exit fullscreen mode

3. Pandoc invocation — epub3 with injected CSS:

pandoc manuscript.md \
  --epub-metadata=metadata.yaml \
  --epub-cover-image=cover.jpg \
  --css=style.css \
  --to=epub3 \
  -o book.epub
Enter fullscreen mode Exit fullscreen mode

The Financial Model

Fixed cost:  $20/month (Claude Code Pro)
Gumroad:     $19.99 → $17.99 net (Gumroad takes ~10%)
KDP:         $9.99  → $6.99 net  (Amazon takes 30%)

Break-even:  ceil($20 / $17.99) = 2 Gumroad sales/month

Conservative year 1 (3 Gumroad + 5 KDP sales/month):
  Gumroad: 3 × $17.99 × 12 = $647.64
  KDP:     5 × $6.99  × 12 = $419.40
  Costs:   $20 × 12        = $240.00
  Net:                       $827.04 ... wait, let me recalc

Actually at those numbers: $647.64 + $419.40 - $240 = $827.04/year
Conservative catalog model (3 Gumroad/month): $3,790 net/year
Enter fullscreen mode Exit fullscreen mode

The pipeline scales linearly. Book 2 takes the same 4-6 hours of active work as Book 1. At 20 books × 3 Gumroad sales/month = $1,079/month net.

What the Pipeline Produces

  • 10 chapters, ~22,000 words
  • 10 Python scripts — all AST-validated and runtime-tested
  • EPUB in English and Spanish
  • Cover generated with Imagen 4
  • Marketing assets — audio overview, pitch deck, social posts

The complete pipeline, all scripts, and both EPUBs are at:
germy5.gumroad.com/l/xhxkzz — $19.99, 30-day refund.

If you have questions about any specific part of the architecture, ask in the comments.

Top comments (0)