How PDFs Work Under the Hood (and Why Merging Them Is Harder Than You Think)

#webdev #javascript #programming #tutorial

PDF looks simple from the outside. Open a file, see pages, print or share. But under the surface, PDF is one of the most complex document formats in widespread use. The specification is 1,000 pages long. A single PDF file can contain fonts, images, JavaScript, 3D models, multimedia, form fields, digital signatures, and embedded files. It's not a page description format -- it's a container format that happens to describe pages.

Understanding a little about PDF internals makes you a better developer whenever you need to generate, merge, split, or parse PDFs. Here's what's actually inside the file.

The four sections of a PDF

Every PDF has four structural components:

1. Header. The first line identifies the PDF version: %PDF-1.7 or %PDF-2.0. The second line is usually a comment with high-bit characters that tell text editors the file is binary, not text.

2. Body. The objects that make up the document's content. Each object has a number and a generation (usually 0): 1 0 obj. Objects can be dictionaries, arrays, streams (binary data), strings, numbers, or references to other objects.

3. Cross-reference table. A table that maps each object number to its byte offset in the file. This is what allows random access -- a PDF reader can jump directly to any object without reading the entire file sequentially.

4. Trailer. Points to the root object (the document catalog) and to the cross-reference table. The reader starts here and works backward.

When you open a large PDF, the reader doesn't load the whole file. It reads the trailer, finds the cross-reference table, and then loads only the objects needed for the current page. This is why a 500-page PDF opens to page 1 almost instantly.

Why merging is non-trivial

If PDF files were just concatenated pages, merging would be trivial: concatenate the bytes and update the page count. But they're not. Here's what a merge operation actually needs to handle:

Object number conflicts. Each PDF numbers its objects starting from 1. When you merge two files, both have an "object 1," an "object 2," etc. The merger must renumber all objects in the second file to avoid collisions.

Cross-reference rebuilding. After renumbering objects, every cross-reference entry and every internal object reference must be updated. A reference like 5 0 R (meaning "object 5, generation 0") might need to become 105 0 R in the merged file.

Font subsetting. PDFs embed only the characters they use from a font (subset embedding). If document A uses Helvetica with characters A-M and document B uses Helvetica with characters N-Z, the merged document needs a combined font subset. In practice, most mergers keep both subsets as separate font objects, which increases file size but avoids the complexity of font merging.

Resource dictionaries. Each page has a resource dictionary listing the fonts, images, and color spaces it uses. These resources are often shared across pages. The merger must ensure that shared resources in the source files remain accessible and that naming conflicts are resolved.

Bookmarks and outlines. If either source PDF has a table of contents (outline), the merger should combine them. Page number references in the outlines need to be updated to reflect the new page positions.

Form fields. If both PDFs contain form fields, field names might conflict. Two files might both have a field called "name" with different values and properties.

Merging PDFs in JavaScript

The most popular Node.js library for PDF manipulation is pdf-lib:

import { PDFDocument } from 'pdf-lib';
import fs from 'fs';

async function mergePDFs(paths) {
  const merged = await PDFDocument.create();

  for (const path of paths) {
    const bytes = fs.readFileSync(path);
    const doc = await PDFDocument.load(bytes);
    const pages = await merged.copyPages(doc, doc.getPageIndices());
    pages.forEach(page => merged.addPage(page));
  }

  const mergedBytes = await merged.save();
  fs.writeFileSync('merged.pdf', mergedBytes);
}

mergePDFs(['doc1.pdf', 'doc2.pdf', 'doc3.pdf']);

The copyPages method handles the object renumbering and reference updating internally. It's doing significant work behind those two lines.

Merging in Python

Python's go-to library is PyPDF:

from pypdf import PdfWriter

writer = PdfWriter()

for path in ['doc1.pdf', 'doc2.pdf', 'doc3.pdf']:
    writer.append(path)

writer.write('merged.pdf')
writer.close()

PyPDF's append method handles page-by-page copying, object renumbering, and resource deduplication. You can also append specific page ranges:

writer.append('doc1.pdf', pages=(0, 5))   # first 5 pages
writer.append('doc2.pdf', pages=(2, 10))   # pages 3-10

Common problems when merging

File size explosion. If both source PDFs embed the same fonts, the merged file contains both copies. A two-page merge of files that each embed 500KB of fonts produces a file with 1MB of fonts. Some libraries offer font deduplication, but it's not always enabled by default.
Encrypted PDFs. PDFs can be encrypted with owner passwords (restricting editing/printing) or user passwords (restricting opening). Most merge libraries can handle owner-password-encrypted files but require the user password for user-password-encrypted files. Attempting to merge an encrypted PDF without handling the encryption produces corrupted output.
Linearized PDFs. Some PDFs are "linearized" (optimized for web viewing) with a special structure that allows page-at-a-time downloading. Merging breaks linearization. The output is still a valid PDF but isn't optimized for progressive loading.
Annotations and links. Internal links (like "go to page 5") contain absolute page references. After merging, page 5 in the original might be page 15 in the merged file. Good merge libraries update these references; basic ones don't, leaving you with broken internal links.
Mixed page sizes. There's no requirement that pages in a PDF be the same size. Document A might be Letter and document B might be A4. The merged file will have pages of different sizes, which is valid but can surprise users when printing.

Splitting: the reverse operation

Splitting a PDF is simpler than merging because you don't have object conflicts. You extract pages into a new document, copying only the objects those pages reference:

async function splitPDF(path, pageRanges) {
  const bytes = fs.readFileSync(path);
  const doc = await PDFDocument.load(bytes);

  for (const [start, end] of pageRanges) {
    const newDoc = await PDFDocument.create();
    const indices = Array.from(
      { length: end - start + 1 },
      (_, i) => start + i
    );
    const pages = await newDoc.copyPages(doc, indices);
    pages.forEach(page => newDoc.addPage(page));

    const newBytes = await newDoc.save();
    fs.writeFileSync(`pages_${start}-${end}.pdf`, newBytes);
  }
}

The main pitfall with splitting is shared resources. If pages 1 and 10 share a large image, and you extract only page 1, the resulting file must include the full image even though the source file stored it only once.

Client-side PDF merging

With pdf-lib running in the browser, you can merge PDFs entirely on the client side. No upload to a server, no privacy concerns:

async function mergeInBrowser(files) {
  const merged = await PDFDocument.create();

  for (const file of files) {
    const bytes = await file.arrayBuffer();
    const doc = await PDFDocument.load(bytes);
    const pages = await merged.copyPages(doc, doc.getPageIndices());
    pages.forEach(page => merged.addPage(page));
  }

  const blob = new Blob([await merged.save()], { type: 'application/pdf' });
  const url = URL.createObjectURL(blob);

  const a = document.createElement('a');
  a.href = url;
  a.download = 'merged.pdf';
  a.click();
}

This is the approach I used when building the PDF merger at zovo.one/free-tools/pdf-merger. Everything runs in your browser -- the files never leave your machine.

PDFs are deceptively complex. The format's flexibility is what makes it the universal document standard, but that same flexibility is what makes operations like merging, splitting, and editing non-trivial. When something goes wrong with a PDF operation, understanding the internal structure -- objects, cross-references, resource dictionaries -- is usually the fastest path to a fix.

I'm Michael Lip. I build free developer tools at zovo.one. 350+ tools, all private, all free.