I spent a few weeks building a Python pipeline that does something I couldn't find anywhere else: it writes a technical ebook, validates every code snippet in it, translates it to English, assembles the EPUB, and generates marketing assets — all for $20/month in API costs.
The book it produced is its own proof of concept. Every chapter in it was processed by the exact pipeline it describes.
Here's how it works.
The Architecture: A State Machine
The core problem with LLM-driven pipelines is crash recovery. If your process dies at chapter 7 of 10, you don't want to restart from zero.
The solution is a state machine with atomic checkpoints:
class ChapterStatus(Enum):
PENDING = "PENDING"
RUNNING = "RUNNING"
DONE = "DONE"
NEEDS_REVIEW = "NEEDS_REVIEW"
Every chapter lives in checkpoint.json with one of these four states. Before processing, the system calls:
def cp_recover_running_orphans(checkpoint_path: str) -> int:
"""
Resets any RUNNING chapters back to PENDING.
Called at startup to recover from crashes.
"""
with open(checkpoint_path, "r+") as f:
data = json.load(f)
recovered = 0
for chapter in data["chapters"]:
if chapter["status"] == "RUNNING":
chapter["status"] = "PENDING"
recovered += 1
if recovered:
f.seek(0)
json.dump(data, f, indent=2)
f.truncate()
return recovered
If the process crashes mid-chapter, the next run resets it to PENDING and retries. No manual intervention. No lost work.
The Validation Suite: Why Ebooks Fail
Most technical ebooks have code that doesn't run. The author wrote it, it looked right, but it was never executed.
The pipeline runs every script through two validation layers before a chapter can advance to DONE:
Layer 1: AST Parse
import ast
def validate_syntax(code: str) -> tuple[bool, str]:
try:
ast.parse(code)
return True, "OK"
except SyntaxError as e:
return False, f"SyntaxError at line {e.lineno}: {e.msg}"
This catches syntax errors cheaply, without executing anything.
Layer 2: Runtime Check in Isolation
import subprocess
import tempfile
import os
def validate_runtime(code: str, timeout: int = 10) -> tuple[bool, str]:
with tempfile.TemporaryDirectory() as tmpdir:
script_path = os.path.join(tmpdir, "script.py")
with open(script_path, "w") as f:
f.write(code)
result = subprocess.run(
["python3", script_path],
capture_output=True,
text=True,
timeout=timeout,
cwd=tmpdir
)
if result.returncode == 0:
return True, "OK"
return False, result.stderr[:500]
The script is written to a temp file and executed in an isolated subprocess. If it fails for any reason — import error, runtime exception, timeout — the chapter goes to NEEDS_REVIEW.
Critical security note: Never use exec() on f-string interpolated code in your main process. The isolation is the point.
The Translation Pipeline: Honest QA
The book was written in Spanish. The pipeline translates it to English with a semantic QA layer that doesn't lie about what it's doing.
The key challenge: code blocks can't be translated. Python is Python in any language.
The fence detector splits each chapter into segments:
import re
FENCE_PATTERN = re.compile(r"(```
[\w]*\n.*?
```)", re.DOTALL)
def split_segments(text: str) -> list[dict]:
parts = FENCE_PATTERN.split(text)
segments = []
for part in parts:
if FENCE_PATTERN.match(part):
segments.append({"type": "code", "content": part})
else:
segments.append({"type": "prose", "content": part})
return segments
Only prose segments go to the translation API. Code blocks are preserved exactly. After translation, the segments are reassembled in order.
The QA layer is honest about its mode:
def qa_translation(original: str, translated: str) -> dict:
if os.getenv("OPENAI_API_KEY"):
score = cosine_similarity_via_embeddings(original, translated)
mode = "SEMANTIC"
else:
# Word ratio: honest fallback
orig_words = len(original.split())
trans_words = len(translated.split())
score = min(orig_words, trans_words) / max(orig_words, trans_words)
mode = "WORD_RATIO_ONLY"
return {
"score": score,
"mode": mode,
"pass": score >= 0.75
}
If you don't have an OpenAI key, it tells you it's using word ratio. No TF-IDF disguised as semantic similarity.
The Build System: Safe EPUB Assembly
The Pandoc build has three guardrails:
1. Merge gate — only chapters with status DONE are included:
def merge_chapters(checkpoint_path: str, output_path: str):
with open(checkpoint_path) as f:
data = json.load(f)
done_chapters = [
ch for ch in data["chapters"]
if ch["status"] == "DONE"
]
if len(done_chapters) != len(data["chapters"]):
raise ValueError(f"Not all chapters are DONE. Aborting merge.")
# ... merge and write to output_path
2. YAML guardrail — rejects placeholder metadata:
PLACEHOLDER_PATTERNS = ["[TBD]", "[INSERT", "TODO", "XXX"]
def validate_yaml_metadata(metadata: dict):
for key, value in metadata.items():
for pattern in PLACEHOLDER_PATTERNS:
if pattern in str(value):
raise ValueError(f"Placeholder found in {key}: {value}")
3. Pandoc invocation — epub3 with injected CSS:
pandoc manuscript.md \
--epub-metadata=metadata.yaml \
--epub-cover-image=cover.jpg \
--css=style.css \
--to=epub3 \
-o book.epub
The Financial Model
Fixed cost: $20/month (Claude Code Pro)
Gumroad: $19.99 → $17.99 net (Gumroad takes ~10%)
KDP: $9.99 → $6.99 net (Amazon takes 30%)
Break-even: ceil($20 / $17.99) = 2 Gumroad sales/month
Conservative year 1 (3 Gumroad + 5 KDP sales/month):
Gumroad: 3 × $17.99 × 12 = $647.64
KDP: 5 × $6.99 × 12 = $419.40
Costs: $20 × 12 = $240.00
Net: $827.04 ... wait, let me recalc
Actually at those numbers: $647.64 + $419.40 - $240 = $827.04/year
Conservative catalog model (3 Gumroad/month): $3,790 net/year
The pipeline scales linearly. Book 2 takes the same 4-6 hours of active work as Book 1. At 20 books × 3 Gumroad sales/month = $1,079/month net.
What the Pipeline Produces
- 10 chapters, ~22,000 words
- 10 Python scripts — all AST-validated and runtime-tested
- EPUB in English and Spanish
- Cover generated with Imagen 4
- Marketing assets — audio overview, pitch deck, social posts
The complete pipeline, all scripts, and both EPUBs are at:
germy5.gumroad.com/l/xhxkzz — $19.99, 30-day refund.
If you have questions about any specific part of the architecture, ask in the comments.
Top comments (0)