PDF Is Still the Hardest File Format to Work With. Here's Why.

#tarui #rust #programming #webdev

All tests run on an 8-year-old MacBook Air.

I've spent months building a PDF tool. PDF is the most infuriating file format I've ever worked with.

Not because it's poorly designed. Because it's been accumulating complexity since 1993 and the spec is 750 pages long.

It's not a document format. It's a programming language.

PDF files contain operators, operands, a stack-based execution model, and a coordinate system. A "page" is a program that draws itself.

BT                    % Begin text
/F1 12 Tf            % Set font F1 at 12pt
100 700 Td           % Move to position
(Hello, World.) Tj   % Draw text string
ET                    % End text

That's not markup. That's code. Parsing it correctly requires implementing an interpreter.

Font handling will break you

Fonts in PDF come in 9 different subtypes. Each has different encoding rules.

Type1, TrueType, OpenType, CIDFont Type0, CIDFont Type2, Type3, MMType1, CIDFontType0C, CIDFontType2...

A scanned PDF from a 2003 copier might use a custom encoding table that maps byte values to glyph IDs in a way that made sense to that specific device in 2003.

Text extraction from these files produces garbage unless you implement the full encoding resolution chain. Most libraries don't.

"Deleted" content isn't deleted

PDF uses an incremental update model. When you edit a PDF, the original objects stay in the file. New objects are appended, and the XREF table is updated to point to the new versions.

The old versions are still in the bytes. They're just not indexed.

[Original PDF bytes]
[Updated object - new version of page 1]
[New XREF table pointing to updated objects]
%%EOF

A forensic tool (or a hex editor) can read the original content. This is why "save as" doesn't erase metadata.

The spec is advisory, not enforced

PDF viewers are so permissive that malformed PDFs became common. Files that violate the spec in multiple ways open fine in Adobe Reader, so nobody fixed them.

Now every PDF parser has to handle malformed files gracefully, because real-world PDFs are full of spec violations.

lopdf will fail on files that Adobe Reader opens without complaint. You end up writing recovery code for files that technically shouldn't exist.