DEV Community

İbrahim tok
İbrahim tok

Posted on

Why ChatGPT and Claude give wrong answers from your PDFs (and how to fix the input)

You paste a PDF into ChatGPT, ask a simple question about a number on page 4, and get a confidently wrong answer. The instinct is to blame the model. Usually the real problem is upstream: by the time the model reads your document, the text is already broken.

Here is what actually happens, and how to fix the part you control.

What a PDF becomes when you extract its text

A PDF is a layout format, not a text format. It stores glyphs at coordinates, not sentences. When you (or a library) pull the text out, you get something like this:

Q3 Financial
Re port  Revenue4.2M(cid:32)+18%YoY oper ating
marg in 31% ●●● page 1 of 12 confidential——
Enter fullscreen mode Exit fullscreen mode

Look at what happened:

  • Words split across line breaks: Re port, oper ating, marg in. To a tokenizer these are now two tokens each, and semantically they are not the words you meant.
  • Numbers fused to text: Revenue4.2M with no separator.
  • Encoding artifacts: (cid:32) is a font glyph that never mapped back to a character.
  • Page furniture as content: page 1 of 12, confidential, headers and footers repeated on every page, now interleaved with real sentences.

Feed that to an LLM and ask for the operating margin. It has to guess which 31% is the answer and which is noise. Guesses become wrong answers.

Tables are worse

Most business questions are about tables, and tables are where naive extraction fails hardest. A clean table like:

Quarter Revenue Margin
Q3 $4.2M 31%

often extracts as a flat run of numbers with the column structure gone:

Quarter Revenue Margin Q3 4.2 31 Q2 3.9 28
Enter fullscreen mode Exit fullscreen mode

Now "which quarter had 31% margin" is genuinely ambiguous to the model, because the row/column relationship that carried the meaning is gone.

Why this also costs you money

Every token you send is billed, whether it carries meaning or not. Repeated headers/footers, broken hyphenation (which doubles token count on split words), and layout padding can be 30 to 60 percent of a raw document's tokens. If you send the same document many times (one call per question, or per user in a RAG loop), that waste compounds fast.

How to fix the input

You do not need a smarter model. You need cleaner input. In order of impact:

  1. Strip page furniture. Remove repeated headers, footers, page numbers and watermarks before the model sees them.
  2. Rejoin broken words. Fix hyphenated line breaks so oper\nating becomes operating. One word, one token.
  3. Reconstruct tables as Markdown. A Markdown table keeps rows and columns aligned, and LLMs read it reliably. This single change fixes most "wrong number" answers.
  4. OCR the scanned pages. If a PDF is image-based, text extraction returns nothing. Run OCR so those pages are not silently empty.
  5. Measure tokens with a real tokenizer. Count before and after (using the tokenizer your model actually uses, not a word count) so you can see the reduction.

Here is the shape of a minimal pipeline in Python:

text = extract_text(pdf)            # your extractor of choice
text = remove_repeated_headers(text)
text = dehyphenate(text)            # rejoin words split across lines
text = tables_to_markdown(text)     # the high-value step
# now send `text` to the model, and cache it so you don't redo this every call
Enter fullscreen mode Exit fullscreen mode

The hard part is not the loop, it is tables_to_markdown and reliable header removal across the messy variety of real documents. That is where most home-grown pipelines quietly break.

The takeaway

The next time an LLM gives you a wrong answer from a document, check the extracted text before blaming the model. Nine times out of ten, the text was already broken. Fix the input and the output follows.


I got tired of maintaining this preprocessing by hand, so I built PackForAI to do it: it converts PDF, Word, Excel, PowerPoint, CSV and JSON into clean, compact Markdown, reconstructs tables, recovers scanned pages with OCR, and shows the token count before and after. There is a free tier and a REST API. If you deal with documents and LLMs, it might save you the pipeline. Feedback welcome, especially on formats to add next.

Top comments (0)