DEV Community

German Yamil
German Yamil

Posted on

Assembling Validated EPUB3 Ebooks with Pandoc and Python — Metadata Guardrails and Build Gates

Pandoc is excellent at producing valid EPUB3 from Markdown. Where it falls short is telling you that your metadata still says "TBD" or that three chapters are not finished yet. That is your job, and the right place to do it is before Pandoc ever runs.

This article covers the guardrail layer I put between raw source files and the final EPUB build.

The Problem with Running Pandoc Too Early

An EPUB with [TBD] in the subtitle field, or with a chapter that ends mid-sentence because a writer placeholder was never filled in, is not a draft — it is a broken product. The fix is not discipline; it is a gate that fails loudly when the source is not ready.

I enforce two gates:

  1. Metadata validation — reject any YAML front matter containing placeholder strings.
  2. Completion gate — read a checkpoint.json file where each chapter has a status, and refuse to build if any chapter is not DONE.

Gate 1: Metadata Validator

import yaml
import sys
from pathlib import Path

PLACEHOLDER_PATTERNS = ["[TBD]", "TODO", "XXX", "FIXME", "PLACEHOLDER"]

def load_metadata(metadata_path: Path) -> dict:
    with open(metadata_path, "r", encoding="utf-8") as f:
        return yaml.safe_load(f)

def validate_metadata(metadata: dict, source_path: Path) -> list[str]:
    """
    Walk every string value in the metadata dict and flag placeholders.
    Returns a list of error strings; empty list means clean.
    """
    errors = []

    def check_value(key_path: str, value):
        if isinstance(value, str):
            for pattern in PLACEHOLDER_PATTERNS:
                if pattern in value:
                    errors.append(
                        f"  [{source_path}] '{key_path}' contains '{pattern}': {value!r}"
                    )
        elif isinstance(value, dict):
            for k, v in value.items():
                check_value(f"{key_path}.{k}", v)
        elif isinstance(value, list):
            for i, item in enumerate(value):
                check_value(f"{key_path}[{i}]", item)

    for key, val in metadata.items():
        check_value(key, val)

    return errors

def run_metadata_gate(metadata_path: Path) -> None:
    metadata = load_metadata(metadata_path)
    errors = validate_metadata(metadata, metadata_path)
    if errors:
        print("[FAIL] Metadata contains placeholders:")
        for e in errors:
            print(e)
        sys.exit(1)
    print(f"[OK] Metadata clean: {metadata_path}")
Enter fullscreen mode Exit fullscreen mode

A typical metadata.yaml looks like:

title: "Building Production APIs with FastAPI"
subtitle: "A Practical Guide for Backend Engineers"
author: "Yamil Salinas"
language: en
rights: "© 2026 Yamil Salinas. All rights reserved."
publisher: "Self-published"
Enter fullscreen mode Exit fullscreen mode

If someone adds subtitle: "[TBD]" and forgets to come back to it, the build stops immediately with a clear message.

Gate 2: Chapter Completion Check

The checkpoint.json file is maintained by the writing pipeline — each chapter entry gets updated to DONE when it passes review. The build gate reads this file and refuses to proceed if anything is still in progress.

import json
from pathlib import Path

def load_checkpoint(checkpoint_path: Path) -> dict:
    with open(checkpoint_path, "r", encoding="utf-8") as f:
        return json.load(f)

def run_completion_gate(checkpoint_path: Path) -> None:
    """
    Expects checkpoint.json of the form:
    {
      "chapters": {
        "01_introduction.md": "DONE",
        "02_setup.md": "DONE",
        "03_routing.md": "IN_PROGRESS"
      }
    }
    Exits with code 1 if any chapter is not DONE.
    """
    checkpoint = load_checkpoint(checkpoint_path)
    chapters = checkpoint.get("chapters", {})
    not_done = {k: v for k, v in chapters.items() if v != "DONE"}

    if not_done:
        print("[FAIL] Completion gate: the following chapters are not DONE:")
        for chapter, status in not_done.items():
            print(f"  {chapter}: {status}")
        sys.exit(1)

    print(f"[OK] All {len(chapters)} chapter(s) are DONE.")
Enter fullscreen mode Exit fullscreen mode

This keeps the checkpoint file as the single source of truth for build readiness. Writers update it; the build script reads it.

The Pandoc Build Step

Once both gates pass, Pandoc runs as a subprocess. I always pin the output format to epub3 and pass a CSS file for consistent styling:

import subprocess
from pathlib import Path

def build_epub(
    chapter_files: list[Path],
    metadata_path: Path,
    cover_image: Path,
    css_path: Path,
    output_path: Path,
) -> None:
    """
    Invoke Pandoc to assemble an EPUB3 from the validated chapter files.
    All paths must be absolute to avoid CWD surprises.
    """
    cmd = [
        "pandoc",
        "--from", "markdown+smart",
        "--to", "epub3",
        "--metadata-file", str(metadata_path),
        "--css", str(css_path),
        "--epub-cover-image", str(cover_image),
        "--toc",
        "--toc-depth=2",
        "--output", str(output_path),
    ] + [str(f) for f in chapter_files]

    print(f"[BUILD] Running Pandoc → {output_path.name}")
    result = subprocess.run(cmd, capture_output=True, text=True)

    if result.returncode != 0:
        print("[FAIL] Pandoc error:")
        print(result.stderr)
        sys.exit(1)

    print("[OK] Pandoc build succeeded.")
Enter fullscreen mode Exit fullscreen mode

Passing --metadata-file separately from the chapter files means the YAML never contaminates the Markdown source. Each chapter file is pure prose.

Gate 3: File Size Validation

A successful Pandoc exit code does not mean the EPUB is sane. An EPUB under a certain size threshold almost certainly lost assets — images did not embed, CSS was missing, or chapters were accidentally excluded.

def validate_output_size(output_path: Path, min_size_kb: int = 50) -> None:
    """
    Fail the build if the output EPUB is suspiciously small.
    Adjust min_size_kb based on your typical book size.
    """
    size_kb = output_path.stat().st_size / 1024

    if size_kb < min_size_kb:
        print(
            f"[FAIL] Output file is only {size_kb:.1f} KB "
            f"(expected at least {min_size_kb} KB). "
            f"Check that all chapters and assets were included."
        )
        sys.exit(1)

    print(f"[OK] Output size: {size_kb:.1f} KB")
Enter fullscreen mode Exit fullscreen mode

Putting It Together

def main():
    root = Path("/path/to/your/book")
    chapter_files = sorted((root / "chapters").glob("*.md"))

    run_metadata_gate(root / "metadata.yaml")
    run_completion_gate(root / "checkpoint.json")
    build_epub(
        chapter_files=chapter_files,
        metadata_path=root / "metadata.yaml",
        cover_image=root / "assets" / "cover.png",
        css_path=root / "assets" / "epub.css",
        output_path=root / "dist" / "book.epub",
    )
    validate_output_size(root / "dist" / "book.epub", min_size_kb=100)
    print("[DONE] Build complete.")

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Run this in CI or as a pre-publish step. The script is entirely deterministic — same inputs produce the same EPUB, and broken inputs exit non-zero so no downstream step proceeds.

Why This Matters

Every time I skipped one of these gates during development, something slipped through that I caught only while doing a final read of the assembled EPUB. Adding 15 minutes to write the gate saves hours of post-production cleanup. It also means I can hand off the repository to anyone and the build either works cleanly or tells them exactly why it did not.


If you want to see the full pipeline that wraps these gates — including the translation step, KDP formatting, and distribution automation — it is documented in a practical guide at https://germy5.gumroad.com/l/xhxkzz for $19.99, with a 30-day refund guarantee.

Top comments (0)