Michael Lip

Posted on Mar 20 • Originally published at zovo.one

Generating PDFs from HTML: Why It Is Harder Than You Think

#webdev #javascript #programming #tutorial

The first time a client asked me to add a "Download as PDF" button to a web app, I thought it would take an afternoon. It took a week. The HTML looked perfect in the browser. The PDF output had broken layouts, missing fonts, cut-off tables, and pages that split paragraphs in the middle of sentences. Converting HTML to PDF is one of those problems that seems solved until you actually try to do it well.

Why browsers and PDFs are different

A browser renders HTML into a continuous, scrollable viewport. A PDF is a fixed-size, paginated document. This fundamental difference creates every problem in HTML-to-PDF conversion:

Page breaks. A browser never needs to split content across pages. A PDF does. Where do you break a 3,000-pixel-tall page into letter-sized sheets? The naive approach breaks wherever the page boundary falls, even if that is in the middle of a table row, an image, or a heading that should stay with its following paragraph.

Fonts. Browsers have access to system fonts and web fonts loaded via CSS. PDF generators need to embed fonts in the document. If the font is not embedded, the viewer substitutes a fallback, and the layout shifts.

Dynamic content. JavaScript-rendered content (React components, Chart.js graphs, dynamically loaded data) does not exist in the HTML source. A server-side HTML-to-PDF converter sees the raw HTML, not the rendered DOM.

Print stylesheets. CSS has a @media print rule for print-specific styles, but most web applications do not have one. The screen layout is not the print layout.

The three approaches

1. Browser-based: Puppeteer / Playwright

Use a headless browser to render the page, then call page.pdf(). This is the most accurate approach because you get the full browser rendering engine.

const puppeteer = require("puppeteer");

async function htmlToPDF(html) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.setContent(html, { waitUntil: "networkidle0" });
  const pdf = await page.pdf({
    format: "A4",
    margin: { top: "20mm", bottom: "20mm", left: "15mm", right: "15mm" },
    printBackground: true,
  });
  await browser.close();
  return pdf;
}

Pros: Full CSS support, JavaScript execution, web font rendering, the PDF looks like the web page.

Cons: Requires a Chromium binary (200+ MB), slow startup, high memory usage, not suitable for serverless without optimization.

The printBackground option is critical -- without it, background colors and images are stripped, which makes most modern designs look broken.

2. Library-based: wkhtmltopdf, WeasyPrint

wkhtmltopdf uses an older WebKit engine. It is faster and lighter than Puppeteer but supports less CSS (no Flexbox, limited Grid). It works well for simple, text-heavy documents like invoices and reports.

wkhtmltopdf --margin-top 20mm --margin-bottom 20mm input.html output.pdf

WeasyPrint is a Python library that implements its own CSS rendering engine focused on print layout. It handles page breaks, headers, footers, and multi-column layouts better than most browser-based tools.

from weasyprint import HTML

HTML(string=html_content).write_pdf("output.pdf")

3. Client-side: jsPDF, html2canvas

jsPDF generates PDFs entirely in the browser. Combined with html2canvas (which screenshots the DOM to a canvas), you can create PDFs without a server.

import html2canvas from "html2canvas";
import jsPDF from "jspdf";

const element = document.getElementById("content");
const canvas = await html2canvas(element);
const imgData = canvas.toDataURL("image/png");
const pdf = new jsPDF("p", "mm", "a4");
pdf.addImage(imgData, "PNG", 0, 0, 210, 297);
pdf.save("document.pdf");

Pros: No server required, works offline.

Cons: The PDF contains a rasterized image, not selectable text. File sizes are large. Quality depends on screen resolution. Page breaks are manual.

Controlling page breaks

CSS provides properties specifically for print layout:

/* Avoid breaking inside this element */
.keep-together {
  break-inside: avoid;
}

/* Always start a new page before this element */
.new-page {
  break-before: page;
}

/* Avoid breaking after headings */
h2, h3 {
  break-after: avoid;
}

/* Ensure at least 3 lines before a page break (widows)
   and at least 3 lines after (orphans) */
p {
  widows: 3;
  orphans: 3;
}

The break-inside: avoid property is the most useful. Apply it to table rows, cards, or any element that looks wrong when split across pages. Puppeteer and WeasyPrint both respect these properties. wkhtmltopdf has partial support.

Headers, footers, and page numbers

Puppeteer supports header and footer templates:

const pdf = await page.pdf({
  format: "A4",
  displayHeaderFooter: true,
  headerTemplate: '<div style="font-size:10px; text-align:center; width:100%;">Company Report</div>',
  footerTemplate: '<div style="font-size:10px; text-align:center; width:100%;">Page <span class="pageNumber"></span> of <span class="totalPages"></span></div>',
  margin: { top: "40mm", bottom: "30mm", left: "15mm", right: "15mm" },
});

The pageNumber and totalPages class names are special -- Puppeteer replaces them with the actual values. The margin must be large enough to accommodate the header and footer content.

Common mistakes

Not waiting for content to load. If your HTML loads images, fonts, or data asynchronously, the PDF generator needs to wait. Puppeteer's waitUntil: "networkidle0" waits until there are no network requests for 500ms. For JavaScript-rendered content, you might need page.waitForSelector() targeting a specific element.

Forgetting print-specific CSS. Add a @media print stylesheet that hides navigation, adjusts margins, forces black text on white background, and removes unnecessary decorative elements.

Ignoring PDF accessibility. A PDF generated from an image (the html2canvas approach) is not accessible. Screen readers cannot read it. If accessibility matters, use a text-based PDF generator.

Not testing with real data. A PDF that looks perfect with 10 rows in a table breaks when there are 500 rows. Test with production-scale data.

For quick HTML-to-PDF conversion without setting up a build pipeline -- converting a single page, testing a print layout, or generating a one-off document -- I keep a converter at zovo.one/free-tools/html-to-pdf-converter that handles the rendering and page formatting.

HTML-to-PDF conversion is a solved problem in the sense that tools exist. It is an unsolved problem in the sense that every project discovers its own edge cases. Choose the right approach for your constraints, handle page breaks explicitly, and test with real data.

I'm Michael Lip. I build free developer tools at zovo.one. 350+ tools, all private, all free.

DEV Community