<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: İbrahim tok</title>
    <description>The latest articles on DEV Community by İbrahim tok (@ibrahim_tok_634ace81a8b6).</description>
    <link>https://dev.to/ibrahim_tok_634ace81a8b6</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4013379%2F5377c5a0-92d0-4851-a40d-61894d72594e.png</url>
      <title>DEV Community: İbrahim tok</title>
      <link>https://dev.to/ibrahim_tok_634ace81a8b6</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ibrahim_tok_634ace81a8b6"/>
    <language>en</language>
    <item>
      <title>Why ChatGPT and Claude give wrong answers from your PDFs (and how to fix the input)</title>
      <dc:creator>İbrahim tok</dc:creator>
      <pubDate>Fri, 03 Jul 2026 09:50:12 +0000</pubDate>
      <link>https://dev.to/ibrahim_tok_634ace81a8b6/why-chatgpt-and-claude-give-wrong-answers-from-your-pdfs-and-how-to-fix-the-input-2oll</link>
      <guid>https://dev.to/ibrahim_tok_634ace81a8b6/why-chatgpt-and-claude-give-wrong-answers-from-your-pdfs-and-how-to-fix-the-input-2oll</guid>
      <description>&lt;p&gt;You paste a PDF into ChatGPT, ask a simple question about a number on page 4, and get a confidently wrong answer. The instinct is to blame the model. Usually the real problem is upstream: by the time the model reads your document, the text is already broken.&lt;/p&gt;

&lt;p&gt;Here is what actually happens, and how to fix the part you control.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a PDF becomes when you extract its text
&lt;/h2&gt;

&lt;p&gt;A PDF is a layout format, not a text format. It stores glyphs at coordinates, not sentences. When you (or a library) pull the text out, you get something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Q3 Financial
Re port  Revenue4.2M(cid:32)+18%YoY oper ating
marg in 31% ●●● page 1 of 12 confidential——
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look at what happened:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Words split across line breaks&lt;/strong&gt;: &lt;code&gt;Re port&lt;/code&gt;, &lt;code&gt;oper ating&lt;/code&gt;, &lt;code&gt;marg in&lt;/code&gt;. To a tokenizer these are now two tokens each, and semantically they are not the words you meant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Numbers fused to text&lt;/strong&gt;: &lt;code&gt;Revenue4.2M&lt;/code&gt; with no separator.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Encoding artifacts&lt;/strong&gt;: &lt;code&gt;(cid:32)&lt;/code&gt; is a font glyph that never mapped back to a character.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Page furniture as content&lt;/strong&gt;: &lt;code&gt;page 1 of 12&lt;/code&gt;, &lt;code&gt;confidential&lt;/code&gt;, headers and footers repeated on every page, now interleaved with real sentences.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feed that to an LLM and ask for the operating margin. It has to guess which &lt;code&gt;31%&lt;/code&gt; is the answer and which is noise. Guesses become wrong answers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tables are worse
&lt;/h2&gt;

&lt;p&gt;Most business questions are about tables, and tables are where naive extraction fails hardest. A clean table like:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Quarter&lt;/th&gt;
&lt;th&gt;Revenue&lt;/th&gt;
&lt;th&gt;Margin&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Q3&lt;/td&gt;
&lt;td&gt;$4.2M&lt;/td&gt;
&lt;td&gt;31%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;often extracts as a flat run of numbers with the column structure gone:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csvs"&gt;&lt;code&gt;&lt;span class="k"&gt;Quarter&lt;/span&gt; &lt;span class="k"&gt;Revenue&lt;/span&gt; &lt;span class="k"&gt;Margin&lt;/span&gt; &lt;span class="k"&gt;Q&lt;/span&gt;&lt;span class="mf"&gt;3&lt;/span&gt; &lt;span class="mf"&gt;4.2&lt;/span&gt; &lt;span class="mf"&gt;31&lt;/span&gt; &lt;span class="k"&gt;Q&lt;/span&gt;&lt;span class="mf"&gt;2&lt;/span&gt; &lt;span class="mf"&gt;3.9&lt;/span&gt; &lt;span class="mf"&gt;28&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now "which quarter had 31% margin" is genuinely ambiguous to the model, because the row/column relationship that carried the meaning is gone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this also costs you money
&lt;/h2&gt;

&lt;p&gt;Every token you send is billed, whether it carries meaning or not. Repeated headers/footers, broken hyphenation (which doubles token count on split words), and layout padding can be &lt;strong&gt;30 to 60 percent&lt;/strong&gt; of a raw document's tokens. If you send the same document many times (one call per question, or per user in a RAG loop), that waste compounds fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to fix the input
&lt;/h2&gt;

&lt;p&gt;You do not need a smarter model. You need cleaner input. In order of impact:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Strip page furniture.&lt;/strong&gt; Remove repeated headers, footers, page numbers and watermarks before the model sees them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rejoin broken words.&lt;/strong&gt; Fix hyphenated line breaks so &lt;code&gt;oper\nating&lt;/code&gt; becomes &lt;code&gt;operating&lt;/code&gt;. One word, one token.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reconstruct tables as Markdown.&lt;/strong&gt; A Markdown table keeps rows and columns aligned, and LLMs read it reliably. This single change fixes most "wrong number" answers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OCR the scanned pages.&lt;/strong&gt; If a PDF is image-based, text extraction returns nothing. Run OCR so those pages are not silently empty.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure tokens with a real tokenizer.&lt;/strong&gt; Count before and after (using the tokenizer your model actually uses, not a word count) so you can see the reduction.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here is the shape of a minimal pipeline in Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;            &lt;span class="c1"&gt;# your extractor of choice
&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;remove_repeated_headers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dehyphenate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;            &lt;span class="c1"&gt;# rejoin words split across lines
&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tables_to_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# the high-value step
# now send `text` to the model, and cache it so you don't redo this every call
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The hard part is not the loop, it is &lt;code&gt;tables_to_markdown&lt;/code&gt; and reliable header removal across the messy variety of real documents. That is where most home-grown pipelines quietly break.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;The next time an LLM gives you a wrong answer from a document, check the extracted text before blaming the model. Nine times out of ten, the text was already broken. Fix the input and the output follows.&lt;/p&gt;




&lt;p&gt;I got tired of maintaining this preprocessing by hand, so I built &lt;a href="https://packforai.com" rel="noopener noreferrer"&gt;&lt;strong&gt;PackForAI&lt;/strong&gt;&lt;/a&gt; to do it: it converts PDF, Word, Excel, PowerPoint, CSV and JSON into clean, compact Markdown, reconstructs tables, recovers scanned pages with OCR, and shows the token count before and after. There is a free tier and a REST API. If you deal with documents and LLMs, it might save you the pipeline. Feedback welcome, especially on formats to add next.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>chatgpt</category>
      <category>llm</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
