<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Simone Cocca</title>
    <description>The latest articles on DEV Community by Simone Cocca (@simonec_dev).</description>
    <link>https://dev.to/simonec_dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4000628%2F6f89b9e6-9f75-47fb-90e4-3bf04665c8ca.jpeg</url>
      <title>DEV Community: Simone Cocca</title>
      <link>https://dev.to/simonec_dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/simonec_dev"/>
    <language>en</language>
    <item>
      <title>How to Fix PDF Table Duplication in RAG / LLM Pipelines (Python)</title>
      <dc:creator>Simone Cocca</dc:creator>
      <pubDate>Wed, 24 Jun 2026 13:17:27 +0000</pubDate>
      <link>https://dev.to/simonec_dev/how-to-fix-pdf-table-duplication-in-rag-llm-pipelines-python-5fii</link>
      <guid>https://dev.to/simonec_dev/how-to-fix-pdf-table-duplication-in-rag-llm-pipelines-python-5fii</guid>
      <description>&lt;p&gt;Building RAG (Retrieval-Augmented Generation) pipelines is a great way to supercharge LLMs with custom data. However, if your pipeline relies on parsing standard PDFs, you've probably hit a massive roadblock: &lt;strong&gt;table text duplication&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Most open-source PDF parsers extract table data twice. First, they extract it as a messy, misaligned block of standard prose text. Then, they extract the raw strings from the table cells. &lt;/p&gt;

&lt;p&gt;This behavior completely destroys the LLM's understanding of the document layout and inflates your token usage by 3x or 4x.&lt;/p&gt;

&lt;p&gt;Here is how I solved this issue in Python, and how you can implement the same logic in your data pipelines.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Strategy: Bounding-Box Masking
&lt;/h2&gt;

&lt;p&gt;Instead of running a blind text extraction across the entire page, the logic needs to be split into a coordinated two-step process using libraries like &lt;code&gt;pdfplumber&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Table Detection:&lt;/strong&gt; Locate the exact coordinates (&lt;code&gt;bbox&lt;/code&gt;) of every table on the PDF page.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Markdown Conversion:&lt;/strong&gt; Extract the data inside those coordinates and format it into clean, structured GitHub-Flavored Markdown tables (&lt;code&gt;|---|---|&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Masking Trick:&lt;/strong&gt; Before running the general text extraction on the page, you must dynamically crop or filter out the characters falling inside those table bounding boxes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By masking those areas, the final text stream contains clean prose and perfectly structured Markdown tables, with zero duplicate strings.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production-Ready Implementation
&lt;/h2&gt;

&lt;p&gt;If you don't want to spend days writing custom bounding-box filters, handling PDF edge cases, and managing serverless infrastructure memory leaks, I have wrapped this exact architecture into two hosted micro-services.&lt;/p&gt;

&lt;p&gt;I published them on RapidAPI with a &lt;strong&gt;permanent free tier&lt;/strong&gt; so you can stress-test them with your own pipelines:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. 📄 Universal PDF to Clean Markdown API
&lt;/h3&gt;

&lt;p&gt;This endpoint processes the PDF entirely in-memory, applies the bounding-box masking logic described above, and returns a clean Markdown layout with headers and nested lists properly formatted.&lt;br&gt;
👉 &lt;a href="https://rapidapi.com/SimoneC31/api/universal-pdf-to-clean-markdown-converter" rel="noopener noreferrer"&gt;Test the PDF Parser Endpoint Here&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. ✂️ LLM Token Optimizer &amp;amp; Cleaner API
&lt;/h3&gt;

&lt;p&gt;A fast companion utility designed to strip out formatting artifacts, excessive whitespaces, and system noise from raw text strings to drastically shrink your final prompt payload before hitting OpenAI or Claude.&lt;br&gt;
👉 &lt;a href="https://rapidapi.com/SimoneC31/api/llm-ready-web-purifier-and-token-optimizer" rel="noopener noreferrer"&gt;Test the Token Optimizer Endpoint Here&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;How are you currently handling complex PDF structures (like nested cells or multi-page tables) in your AI apps? Let's discuss in the comments below!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>learning</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
