Pandoc is excellent at producing valid EPUB3 from Markdown. Where it falls short is telling you that your metadata still says "TBD" or that three chapters are not finished yet. That is your job, and the right place to do it is before Pandoc ever runs.
This article covers the guardrail layer I put between raw source files and the final EPUB build.
The Problem with Running Pandoc Too Early
An EPUB with [TBD] in the subtitle field, or with a chapter that ends mid-sentence because a writer placeholder was never filled in, is not a draft — it is a broken product. The fix is not discipline; it is a gate that fails loudly when the source is not ready.
I enforce two gates:
- Metadata validation — reject any YAML front matter containing placeholder strings.
-
Completion gate — read a
checkpoint.jsonfile where each chapter has a status, and refuse to build if any chapter is notDONE.
Gate 1: Metadata Validator
import yaml
import sys
from pathlib import Path
PLACEHOLDER_PATTERNS = ["[TBD]", "TODO", "XXX", "FIXME", "PLACEHOLDER"]
def load_metadata(metadata_path: Path) -> dict:
with open(metadata_path, "r", encoding="utf-8") as f:
return yaml.safe_load(f)
def validate_metadata(metadata: dict, source_path: Path) -> list[str]:
"""
Walk every string value in the metadata dict and flag placeholders.
Returns a list of error strings; empty list means clean.
"""
errors = []
def check_value(key_path: str, value):
if isinstance(value, str):
for pattern in PLACEHOLDER_PATTERNS:
if pattern in value:
errors.append(
f" [{source_path}] '{key_path}' contains '{pattern}': {value!r}"
)
elif isinstance(value, dict):
for k, v in value.items():
check_value(f"{key_path}.{k}", v)
elif isinstance(value, list):
for i, item in enumerate(value):
check_value(f"{key_path}[{i}]", item)
for key, val in metadata.items():
check_value(key, val)
return errors
def run_metadata_gate(metadata_path: Path) -> None:
metadata = load_metadata(metadata_path)
errors = validate_metadata(metadata, metadata_path)
if errors:
print("[FAIL] Metadata contains placeholders:")
for e in errors:
print(e)
sys.exit(1)
print(f"[OK] Metadata clean: {metadata_path}")
A typical metadata.yaml looks like:
title: "Building Production APIs with FastAPI"
subtitle: "A Practical Guide for Backend Engineers"
author: "Yamil Salinas"
language: en
rights: "© 2026 Yamil Salinas. All rights reserved."
publisher: "Self-published"
If someone adds subtitle: "[TBD]" and forgets to come back to it, the build stops immediately with a clear message.
Gate 2: Chapter Completion Check
The checkpoint.json file is maintained by the writing pipeline — each chapter entry gets updated to DONE when it passes review. The build gate reads this file and refuses to proceed if anything is still in progress.
import json
from pathlib import Path
def load_checkpoint(checkpoint_path: Path) -> dict:
with open(checkpoint_path, "r", encoding="utf-8") as f:
return json.load(f)
def run_completion_gate(checkpoint_path: Path) -> None:
"""
Expects checkpoint.json of the form:
{
"chapters": {
"01_introduction.md": "DONE",
"02_setup.md": "DONE",
"03_routing.md": "IN_PROGRESS"
}
}
Exits with code 1 if any chapter is not DONE.
"""
checkpoint = load_checkpoint(checkpoint_path)
chapters = checkpoint.get("chapters", {})
not_done = {k: v for k, v in chapters.items() if v != "DONE"}
if not_done:
print("[FAIL] Completion gate: the following chapters are not DONE:")
for chapter, status in not_done.items():
print(f" {chapter}: {status}")
sys.exit(1)
print(f"[OK] All {len(chapters)} chapter(s) are DONE.")
This keeps the checkpoint file as the single source of truth for build readiness. Writers update it; the build script reads it.
The Pandoc Build Step
Once both gates pass, Pandoc runs as a subprocess. I always pin the output format to epub3 and pass a CSS file for consistent styling:
import subprocess
from pathlib import Path
def build_epub(
chapter_files: list[Path],
metadata_path: Path,
cover_image: Path,
css_path: Path,
output_path: Path,
) -> None:
"""
Invoke Pandoc to assemble an EPUB3 from the validated chapter files.
All paths must be absolute to avoid CWD surprises.
"""
cmd = [
"pandoc",
"--from", "markdown+smart",
"--to", "epub3",
"--metadata-file", str(metadata_path),
"--css", str(css_path),
"--epub-cover-image", str(cover_image),
"--toc",
"--toc-depth=2",
"--output", str(output_path),
] + [str(f) for f in chapter_files]
print(f"[BUILD] Running Pandoc → {output_path.name}")
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
print("[FAIL] Pandoc error:")
print(result.stderr)
sys.exit(1)
print("[OK] Pandoc build succeeded.")
Passing --metadata-file separately from the chapter files means the YAML never contaminates the Markdown source. Each chapter file is pure prose.
Gate 3: File Size Validation
A successful Pandoc exit code does not mean the EPUB is sane. An EPUB under a certain size threshold almost certainly lost assets — images did not embed, CSS was missing, or chapters were accidentally excluded.
def validate_output_size(output_path: Path, min_size_kb: int = 50) -> None:
"""
Fail the build if the output EPUB is suspiciously small.
Adjust min_size_kb based on your typical book size.
"""
size_kb = output_path.stat().st_size / 1024
if size_kb < min_size_kb:
print(
f"[FAIL] Output file is only {size_kb:.1f} KB "
f"(expected at least {min_size_kb} KB). "
f"Check that all chapters and assets were included."
)
sys.exit(1)
print(f"[OK] Output size: {size_kb:.1f} KB")
Putting It Together
def main():
root = Path("/path/to/your/book")
chapter_files = sorted((root / "chapters").glob("*.md"))
run_metadata_gate(root / "metadata.yaml")
run_completion_gate(root / "checkpoint.json")
build_epub(
chapter_files=chapter_files,
metadata_path=root / "metadata.yaml",
cover_image=root / "assets" / "cover.png",
css_path=root / "assets" / "epub.css",
output_path=root / "dist" / "book.epub",
)
validate_output_size(root / "dist" / "book.epub", min_size_kb=100)
print("[DONE] Build complete.")
if __name__ == "__main__":
main()
Run this in CI or as a pre-publish step. The script is entirely deterministic — same inputs produce the same EPUB, and broken inputs exit non-zero so no downstream step proceeds.
Why This Matters
Every time I skipped one of these gates during development, something slipped through that I caught only while doing a final read of the assembled EPUB. Adding 15 minutes to write the gate saves hours of post-production cleanup. It also means I can hand off the repository to anyone and the build either works cleanly or tells them exactly why it did not.
If you want to see the full pipeline that wraps these gates — including the translation step, KDP formatting, and distribution automation — it is documented in a practical guide at https://germy5.gumroad.com/l/xhxkzz for $19.99, with a 30-day refund guarantee.
Top comments (0)