PDF to LaTeX Conversion: Why It's Hard and What Actually Works

#latex #pdf #academic #tutorial

Why automated PDF to LaTeX tools produce unusable output - and the right approach for academic documents.

PDF to LaTeX is a harder problem than Word to LaTeX, and it's worth understanding why before you spend time trying to automate it.

At The LaTeX Lab, PDF conversions make up a significant portion of the projects we handle - researchers who only have a final PDF of their paper, no original source file. Here's what we've learned about where the process breaks and what to do about it.

Why PDFs Are a Poor Source for LaTeX Reconstruction

A PDF is a rendering format. It stores instructions for placing glyphs on a page at precise coordinates. It does not store:

Semantic structure (what is a heading vs. body text)
Mathematical relationships (what is a fraction, what is a subscript, what is an operator)
Table structure (where rows and columns begin and end)
Bibliography metadata (author, journal, DOI - just the rendered string)

When a PDF to LaTeX converter processes a document, it's reverse-engineering rendered output back into structural markup. For plain body text, this works tolerably. For everything else, it doesn't.

What Automated PDF to LaTeX Tools Actually Produce

Take a simple display equation rendered in a PDF - say, a fraction inside a summation. In the PDF, this is a set of glyph positions: a sigma character at coordinate (x1, y1), a fraction bar at (x2, y2), numerator glyphs above it, denominator below. There's no metadata saying "this is a summation with lower limit n=0 and upper limit N."

An automated converter has to guess the mathematical structure from the spatial arrangement of glyphs. For simple equations it sometimes guesses correctly. For anything involving:

Nested fractions
Matrix environments
Aligned multi-line derivations
Custom operators or symbols
Subscripts and superscripts on top of each other

...the output ranges from wrong to completely absent. The equation either gets skipped, rendered as an image extracted from the PDF, or reconstructed incorrectly.

% What pdf2latex-style tools give you for a complex equation:
\includegraphics[width=0.8\textwidth]{eq_extracted_01.png}
% Or worse, nothing at all.

% What the equation actually needs to be:
\begin{equation}
  \hat{y} = \sum_{n=0}^{N} w_n \cdot \phi\!\left(\frac{x - \mu_n}{\sigma_n}\right)
\end{equation}

Tables: Even Harder Than Equations

Table reconstruction from PDF is genuinely unsolved as an automated problem. The PDF stores each cell's text as positioned glyphs - it has no concept of rows, columns, or cell boundaries except as inferred from whitespace.

For simple two-column tables with clear spacing, automated tools produce something usable. For tables with:

Merged cells (colspan/rowspan)
Ruled lines between specific rows
Multi-line cell content
Rotated headers

...the output is a jumbled list of strings, not a table. We've never seen an automated tool handle a longtable or a tabularx with multi-line cells correctly.

The Text Layer Problem on Scanned PDFs

If the PDF was produced by scanning a physical document, there's an additional layer: OCR. Scanned PDFs don't have a text layer at all - they're images. Any text extraction first requires OCR, which introduces its own error rate.

For academic documents with domain-specific notation, OCR error rates on equations are high. A scanned fraction is frequently OCR'd as something like H/2 when it should be \hbar / 2 - visually similar characters, semantically different.

If you're working from a scanned PDF, the realistic path is:

OCR the document with a high-accuracy engine (ABBYY FineReader or similar)
Manually verify all equations and symbols against the original
Reconstruct the LaTeX from scratch using the OCR output as a reference

There's no shortcut here that produces submission-ready output.

What the Right Approach Looks Like

For academic documents, the only reliable approach to PDF to LaTeX is semi-manual reconstruction:

Extract clean body text from the PDF (where the text layer exists)
Read every equation directly from the PDF and typeset it in LaTeX math mode from scratch
Reconstruct every table in booktabs format from scratch
Extract and rebuild the bibliography as a .bib file - either from the PDF's reference list or by looking up each reference in CrossRef/Google Scholar to get clean metadata
Apply the journal or university template and compile

It's time-intensive precisely because the hard parts can't be automated. But the output is a properly structured .tex file that compiles cleanly - not a pile of image fallbacks with broken formatting.