DEV Community

Ayi NEDJIMI
Ayi NEDJIMI

Posted on

I automated PDF generation for 1,600 security guides — WeasyPrint lessons

Last year I made a decision I slightly regret: I promised PDF downloads for every security guide on my site. At the time I had around 400 articles. By the time I finished building the pipeline, I had over 1,600. I am not proud of everything I wrote to make this work, but it works reliably in production and I learned a lot along the way.

Why WeasyPrint

I looked at several options: Puppeteer (Node, I wanted to avoid it), wkhtmltopdf (abandoned, known rendering issues), and WeasyPrint. WeasyPrint is a Python library that converts HTML/CSS to PDF using CSS Pint and Cairo under the hood. It is not the fastest, but it respects CSS print media queries, handles Unicode properly, and does not require a browser binary with a display server.

For a cybersecurity consulting site generating PDFs from server-rendered HTML, that tradeoff was acceptable.

The basic pipeline

Each article lives at a URL like https://ayinedjimi-consultants.fr/articles/{slug}. The generation script fetches that URL directly via localhost:4001 (the Go Fiber backend) and writes a PDF to /var/www/ayinedjimi-prod/public/static/pdf/{slug}.pdf.

import subprocess
import os
from pathlib import Path

def generate_pdf(slug: str, base_url: str = "http://localhost:4001") -> Path:
    url = f"{base_url}/articles/{slug}"
    output_path = Path(f"/var/www/ayinedjimi-prod/public/static/pdf/{slug}.pdf")
    output_path.parent.mkdir(parents=True, exist_ok=True)

    result = subprocess.run(
        [
            "weasyprint",
            "--optimize-images",
            "--uncompressed-pdf",  # easier to merge later
            url,
            str(output_path),
        ],
        capture_output=True,
        text=True,
        timeout=120,
    )

    if result.returncode != 0:
        raise RuntimeError(f"WeasyPrint failed for {slug}:\n{result.stderr}")

    return output_path
Enter fullscreen mode Exit fullscreen mode

Then, for guides that needed a branded cover page, I used pikepdf to merge a static cover PDF in front of the generated content:

import pikepdf

def prepend_cover(pdf_path: Path, cover_path: Path) -> None:
    with pikepdf.open(cover_path) as cover, pikepdf.open(pdf_path) as content:
        merged = pikepdf.Pdf.new()
        merged.pages.extend(cover.pages)
        merged.pages.extend(content.pages)
        # overwrite in place
        merged.save(pdf_path)
Enter fullscreen mode Exit fullscreen mode

This is simple but effective. The cover is a static PDF I designed once in Inkscape and never touch again.

The print CSS problem

The biggest time sink was not the Python code — it was CSS.

Web CSS and print CSS are two different problems. My site uses Tailwind CSS, and Tailwind does almost nothing useful for @media print. Navigation bars, sticky headers, cookie banners, and dark-mode backgrounds all came through in the PDF looking terrible.

I added a dedicated @media print block to my stylesheet:

@media print {
  header, footer, nav, .cookie-banner, #mobile-cta-bar, .comments-section {
    display: none !important;
  }

  body {
    background: white !important;
    color: black !important;
    font-size: 11pt;
    line-height: 1.5;
  }

  pre, code {
    white-space: pre-wrap;
    word-break: break-word;
    border: 1px solid #ccc;
    padding: 0.5em;
    font-size: 9pt;
  }

  h2 {
    page-break-before: auto;
    page-break-after: avoid;
  }

  table {
    page-break-inside: avoid;
  }

  a[href]::after {
    content: " (" attr(href) ")";
    font-size: 8pt;
    color: #555;
  }
}
Enter fullscreen mode Exit fullscreen mode

The page-break-after: avoid on headings prevents the embarrassing situation where a section title appears at the very bottom of a page with its content on the next one. The a[href]::after rule appends URLs in parentheses, which matters for printed security checklists where readers might want to look something up.

Font handling on a headless server

On my development machine, WeasyPrint found all fonts fine. On the server (Ubuntu 22.04, no desktop environment), it fell back to a generic serif and the output looked wrong.

The fix was to install fonts explicitly and ensure WeasyPrint could find them:

sudo apt-get install -y fonts-open-sans fonts-liberation fontconfig
fc-cache -fv
Enter fullscreen mode Exit fullscreen mode

I also added a @font-face rule pointing to locally hosted font files rather than a CDN, because WeasyPrint makes HTTP requests to fetch external resources and CDN latency adds up at 1,600 articles.

File permissions and www-data

My Go Fiber backend runs as www-data. The PDF generation script also needed to write to the same directory. I initially ran the script as root in a cron job, which worked but created files owned by root that the web server could still read — until I started needing to regenerate individual PDFs from within the application itself.

The clean solution was to run the cron job as www-data:

# in crontab for www-data
0 2 * * * /opt/ayinedjimi-src/scripts/generate-pdfs.py >> /var/log/pdf-gen.log 2>&1
Enter fullscreen mode Exit fullscreen mode

And ensure the output directory is owned by www-data from the start. If you mix ownership in a directory that a process needs to write to, you will chase confusing permission errors for longer than you want to admit.

What I wish I had done differently

Incremental generation from day one. My first version regenerated all PDFs every night. That takes about 4 hours for 1,600 articles. The smarter approach, which I eventually built, is to track a pdf_generated_at timestamp in the database and only regenerate articles updated since the last run.

Separate the generation from the serving. WeasyPrint is memory-hungry. On my 4GB VPS it sometimes consumed over 1GB for complex articles with many images. Running generation as a background process with a queue rather than inline would have been cleaner.

Test print CSS early. I spent three evenings fixing print CSS issues I could have caught by pressing Ctrl+P in my browser on day one.

The full pipeline is now stable. PDF coverage is above 54% of published articles (I exclude short news items and blog posts). The security checklists section was actually the original motivation — users needed printable, offline-usable versions of hardening guides for firewall audits where internet access is restricted.

If you are building something similar, WeasyPrint is a legitimate choice. Just budget more time for print CSS than you think you need.


I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish security hardening checklists for FortiGate, Palo Alto, Active Directory, and more — free PDF and Excel.

Top comments (0)