You paste a PDF into ChatGPT, ask a simple question about a number on page 4, and get a confidently wrong answer. The instinct is to blame the model. Usually the real problem is upstream: by the time the model reads your document, the text is already broken.
Here is what actually happens, and how to fix the part you control.
What a PDF becomes when you extract its text
A PDF is a layout format, not a text format. It stores glyphs at coordinates, not sentences. When you (or a library) pull the text out, you get something like this:
Q3 Financial
Re port Revenue4.2M(cid:32)+18%YoY oper ating
marg in 31% ●●● page 1 of 12 confidential——
Look at what happened:
-
Words split across line breaks:
Re port,oper ating,marg in. To a tokenizer these are now two tokens each, and semantically they are not the words you meant. -
Numbers fused to text:
Revenue4.2Mwith no separator. -
Encoding artifacts:
(cid:32)is a font glyph that never mapped back to a character. -
Page furniture as content:
page 1 of 12,confidential, headers and footers repeated on every page, now interleaved with real sentences.
Feed that to an LLM and ask for the operating margin. It has to guess which 31% is the answer and which is noise. Guesses become wrong answers.
Tables are worse
Most business questions are about tables, and tables are where naive extraction fails hardest. A clean table like:
| Quarter | Revenue | Margin |
|---|---|---|
| Q3 | $4.2M | 31% |
often extracts as a flat run of numbers with the column structure gone:
Quarter Revenue Margin Q3 4.2 31 Q2 3.9 28
Now "which quarter had 31% margin" is genuinely ambiguous to the model, because the row/column relationship that carried the meaning is gone.
Why this also costs you money
Every token you send is billed, whether it carries meaning or not. Repeated headers/footers, broken hyphenation (which doubles token count on split words), and layout padding can be 30 to 60 percent of a raw document's tokens. If you send the same document many times (one call per question, or per user in a RAG loop), that waste compounds fast.
How to fix the input
You do not need a smarter model. You need cleaner input. In order of impact:
- Strip page furniture. Remove repeated headers, footers, page numbers and watermarks before the model sees them.
-
Rejoin broken words. Fix hyphenated line breaks so
oper\natingbecomesoperating. One word, one token. - Reconstruct tables as Markdown. A Markdown table keeps rows and columns aligned, and LLMs read it reliably. This single change fixes most "wrong number" answers.
- OCR the scanned pages. If a PDF is image-based, text extraction returns nothing. Run OCR so those pages are not silently empty.
- Measure tokens with a real tokenizer. Count before and after (using the tokenizer your model actually uses, not a word count) so you can see the reduction.
Here is the shape of a minimal pipeline in Python:
text = extract_text(pdf) # your extractor of choice
text = remove_repeated_headers(text)
text = dehyphenate(text) # rejoin words split across lines
text = tables_to_markdown(text) # the high-value step
# now send `text` to the model, and cache it so you don't redo this every call
The hard part is not the loop, it is tables_to_markdown and reliable header removal across the messy variety of real documents. That is where most home-grown pipelines quietly break.
The takeaway
The next time an LLM gives you a wrong answer from a document, check the extracted text before blaming the model. Nine times out of ten, the text was already broken. Fix the input and the output follows.
I got tired of maintaining this preprocessing by hand, so I built PackForAI to do it: it converts PDF, Word, Excel, PowerPoint, CSV and JSON into clean, compact Markdown, reconstructs tables, recovers scanned pages with OCR, and shows the token count before and after. There is a free tier and a REST API. If you deal with documents and LLMs, it might save you the pipeline. Feedback welcome, especially on formats to add next.
Top comments (0)