After 2,900 tests, here's the CSS subset that actually matters for PDF

Vladyslav Kyslenko — Sun, 03 May 2026 23:28:18 +0000

Two years ago I decided that every PDF editor on the market was wrong, and that I was going to fix it. The fixing took longer than expected.

This is the story of why, what I learned about CSS that I wish someone had told me on day one, and the architectural choices that turned out to be the right ones (and the ones that didn't).

The thing I noticed

Open any PDF in Smallpdf, iLovePDF, pdfFiller, DocHub, Sejda, even Adobe Acrobat. Try to add a row to a table. Watch what happens.

In every single case, the editor doesn't actually edit the table. It just inserts a floating, absolutely-positioned text box on top of the page. The original PDF is completely untouched underneath. Move the new "row" anywhere. There's no table. There's no document structure. There's a page-shaped image with text boxes pasted on top.

This is not a bug. This is how all of them work, because it's the only thing that's tractable when you're starting from a finished PDF. The PDF format doesn't really have tables. It has positioned glyphs. To get back to a table, you have to reconstruct one. Nobody wants to do that, so nobody does.

I wanted to do that.

The "just use a library" phase

For about three weeks I assumed someone had already solved this and I just had to find the right library. I tried:

Puppeteer / Playwright. These work great for HTML→PDF if you control the HTML. Useless for the reverse direction. Also a 200MB Chromium dependency on your server. Hard pass.

wkhtmltopdf. Deprecated, based on an ancient WebKit, breaks on anything modern. Flexbox? Forget it. Grid? Lol. The maintainers have explicitly asked people to stop using it.

Dompdf. PHP, more or less abandoned, no flexbox, no grid, broken table layouts. Fine for invoices in 2012.

mPDF. Better than Dompdf, decent box model, but tables fall apart on anything non-trivial and the page break logic has had open issues for years.

Prince XML. Actually good. Closed-source, $3,800 per license, not viable for a SaaS that needs to scale.

WeasyPrint. Python, surprisingly capable, but the rendering doesn't match Chrome and there's no good way to bring it into a PHP/Node environment without subprocess gymnastics.

None of them solved my actual problem, which was bidirectional: I needed to render HTML to PDF and parse PDF back into HTML that wasn't div soup. Even if I picked the best of the above for one direction, the other direction was still unsolved.

So I started writing my own.

Why this was technically a stupid idea

Writing a CSS rendering engine is one of those things people warn you about. There's a reason browsers have hundreds of contributors and decades of work behind them. The CSS specification is roughly 3,000 pages of edge cases. Real-world HTML breaks every assumption you make.

Every senior engineer I told about this said some version of "that's a five-year project, you'll never ship." They were almost right. Two years in, I have an engine that handles maybe 90% of the CSS that actually matters for PDF, and that "actually matters" qualifier is doing a lot of work.

Which brings me to the most useful thing I learned.

CSS for PDF is a much smaller spec than CSS for the web

This is the realization that made the project possible. The full CSS specification is enormous. The CSS specification that applies to a printed page is much, much smaller. Once you accept that, the scope collapses from "impossible" to "hard but finite."

Things that don't exist in PDF:

:hover, :focus, :active and all interaction pseudo-classes
Animations and transitions
position: sticky
Scroll behavior, overflow scrolling
Viewport units in any meaningful sense (a PDF page has fixed dimensions)
Most of the modern container query stuff
JavaScript-driven dynamic styles

You can throw out maybe 30% of CSS immediately. Nobody is hovering on a printed invoice.

Things that do exist and that you have to get right:

The full box model: margins (including collapsing), padding, borders, dimensions, box-sizing
Typography: line-height, letter-spacing, word-spacing, font metrics, vertical alignment, baseline calculation
Tables: the entire table layout algorithm, which is its own beast
Flexbox: yes, in PDF, and almost no PDF tool supports this properly
Grid: same
Page breaks: break-before, break-after, break-inside, widows, orphans
Print media queries (@media print, @page)
Multi-column layout

The PDF-relevant subset of CSS is roughly 60-70% of the full spec. That's a lot, but it's bounded. You can finish it.

I have around 2,900 tests just on the engine, organized roughly along these lines. Each one is a piece of HTML+CSS rendered to PDF and compared byte-for-byte against an expected output. When I find a CSS edge case in the wild that breaks something, it goes in as a test. The test count is mostly a function of how much CSS I've encountered in real documents, not how clever I am.

The page break rabbit hole

This is the part nobody warns you about, so I'll warn you here.

Page breaks are deceptively simple in the spec. break-inside: avoid means "don't break inside this element." widows: 3 means "if you break a paragraph, leave at least 3 lines on the next page." Easy.

In practice, page break logic is a constraint solver pretending to be a CSS property. Consider a real document: a header, two paragraphs, a table with break-inside: avoid, a figure with a caption that should stay together, a footnote. The page is almost full. Where do you break?

The naive answer is "wherever you run out of vertical space." This is wrong, and it's wrong in different ways depending on which element you hit first. If you break in the middle of the table when it had avoid, you have to push the entire table to the next page, which leaves a huge gap, which might trigger different rules about minimum content per page. If you push it, the figure that was right after it might now fit awkwardly. If you start moving the figure, its caption rule kicks in.

The browser engines (Blink, WebKit, Gecko) have all converged on something like a multi-pass algorithm: lay out everything assuming infinite vertical space, then break iteratively, allowing layout to reflow at each break point. You can't just paginate after layout. Layout has to be aware of pagination.

I rewrote my page break code three times. The first version was greedy (break when full). The second version was lookahead-based (try to find a "good" break point N pixels back). The third version, which actually works, is a constraint-based pass that knows about all the avoid-break and widow/orphan rules together and finds breaks that satisfy as many as possible.

The CSS Fragmentation spec describes all of this, but in a way that assumes you already know what they mean. You don't, until you've implemented it.

If anyone reading this is about to start a similar project: do not skip the spec. Read CSS Fragmentation Module Level 3 cover to cover. Then read it again. Then write your tests before you write your algorithm.

Cross-language byte-identical output

The engine is primarily PHP (the rendering side), with Python doing extraction work and TypeScript doing some of the editor logic. Keeping these in sync is harder than it sounds.

The reason is floating point. CSS layout involves a lot of arithmetic on dimensions: width = 100% - margin-left - margin-right, multiplied across nested elements, and small differences accumulate. PHP's float is IEEE 754 double, Python's float is IEEE 754 double, JavaScript's number is IEEE 754 double. Same standard. Different libc rounding. Different operation orders. Different results.

You can converge them, but only by being deliberate. The relevant moves:

All layout math goes through a single, deterministic order of operations
All font metric lookups use a single shared font cache, not per-language font tables (those drift)
All transcendental operations (rare in layout, but they happen) use lookup tables, not native math libraries
Output is rounded to a fixed precision (1/1000 of a point, in my case) at well-defined boundaries

I have a test that takes the same HTML and renders it through the PHP path and the Python path and asserts the outputs are byte-identical. 161 Python tests, 51 TypeScript tests, all gating on byte-identity. Every time I think I've found a clever optimization, this test catches a regression.

The harder direction: PDF back into semantic HTML

Rendering HTML to PDF is the easy half. The hard half is going the other way.

Take a PDF you didn't create. There's no DOM. There are positioned glyphs. There might be a /Table structure tag if the document was tagged for accessibility, but in practice 95% of PDFs in the wild aren't tagged. You're left with: "here's a glyph at coordinates (143.2, 287.4) in font Helvetica-Bold size 11."

To get from there to a real <table><tr><td> structure, you have to reconstruct intent. This is fundamentally an inference problem.

For tables specifically, there are two main approaches. Lattice-based detection looks for ruling lines (the borders of cells). Stream-based detection looks at columnar alignment of text without borders. Both have failure modes. Tables with merged cells break lattice detection. Tables with irregular spacing break stream detection. Production-quality table reconstruction needs both, plus heuristics for when each is right.

I built the deterministic version first: a multi-stage pipeline that does coordinate analysis, identifies ruling lines via pymupdf, detects multi-column layouts, infers list structures, and produces a tagged tree. This handles maybe 70-80% of real-world documents well. The remaining cases are where AI vision helps. I use AI vision for the structural extraction on documents that the deterministic pipeline can't fully reconstruct, with the deterministic output as a strong prior.

The combination matters more than either alone. A pure-LLM approach on a 15-page financial document hallucinates numbers. A pure-deterministic approach fails on visually-structured documents that don't have ruling lines. Running both, with the deterministic output anchoring the AI vision pass, gets you to "good enough that an accountant won't quit your product on document one."

There's a content-skipping problem with LLMs on long documents that I won't go into in detail here, but the short version: don't try to extract a 15-page PDF in a single call. Chunk it, validate row counts against pymupdf's coordinate analysis, retry on mismatch. The "just use a bigger context window" answer is wrong.

The weird architectural choice: source in metadata

For documents that are saved through ReflowPDF (the editor I built on top of the engine), I do something most PDF tools don't: I take the original semantic HTML, AES-256 encrypt it, and embed it directly in the PDF's metadata.

What this means in practice: the PDF is its own source file. You can re-open it in ReflowPDF, the engine reads the metadata, decrypts the HTML, and you're editing the original semantic structure with no information loss and no server-side storage of anything.

This is a weird choice. The standard play is to save the source on the server and let the user "open from cloud." That requires me to store user files. I didn't want to store user files. PDFs from accountants and lawyers contain things I don't want sitting in my database.

The downside is the PDFs get a bit bigger (the embedded HTML is well under the size of the rendered content for most documents, but it's still extra bytes). The upside is the PDF is portable, self-contained, and can be opened back into a real editable state by anyone with my tool.

I haven't seen anyone else do this. There might be a reason. So far, two years in, the reason hasn't shown up.

What I'd do differently

A few honest answers.

I'd start the test suite from day one, not month four. Adding tests retroactively to a layout engine is a special kind of pain.

I'd pick the cross-language strategy upfront. I added Python and TypeScript implementations after PHP was already deep, and unifying them was more work than it would have been to design for it.

I'd write the page break algorithm before writing anything else. It touches every other part of the engine, and not having it sorted made every layout decision provisional.

I'd skip about six months of "make it work for arbitrary HTML from the wild" effort. The HTML I actually need to render is HTML I generate from semantic PDF reconstruction. It's a much narrower target than "the entire web." I optimized for the wrong distribution for too long.

I would not change the decision to build the engine. Every senior engineer who told me not to do this was operating from the assumption that the existing libraries are good enough. They aren't, for what I'm trying to do. The project that uses Puppeteer or wkhtmltopdf as a primitive cannot offer structural reflow, period. The architecture forecloses it. The only way to get there is to own the rendering.

The product

I'm building this into ReflowPDF, a browser-based PDF editor where editing a table actually edits a table. Not a floating text box. A real <tr>. Add a row, the layout reflows correctly. Pre-launch, waitlist is at https://reflowpdf.com if any of this resonates.

If you're working on something in this space, or you have a PDF that breaks the reconstruction, my contact is on the site. I want to see the documents that break. Those are the most useful bug reports I get.

The takeaway

If there's a single thing I'd want a developer reading this to leave with, it's this: the reason every PDF tool feels broken in the same way is because they share the same architectural assumption, and the assumption is wrong. Treating a PDF as a finished image with text boxes pasted on top is the wrong layer of abstraction for editing. The right layer is the document structure that produced the PDF in the first place. Reconstructing that structure is hard but tractable. Once you have it, the rest follows.

It took me two years to get here and I'd be lying if I said it was a comfortable two years. But the engine works, the tests pass, and the documents come out the other side as documents and not as div soup. That, I'll take.

DEV Community: Vladyslav Kyslenko