<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: intercepted16</title>
    <description>The latest articles on DEV Community by intercepted16 (@intercepted16).</description>
    <link>https://dev.to/intercepted16</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1358653%2F4e4784e9-6e3d-44f5-a6d0-a8e57c097fc0.png</url>
      <title>DEV Community: intercepted16</title>
      <link>https://dev.to/intercepted16</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/intercepted16"/>
    <language>en</language>
    <item>
      <title>Won't LLMs eventually train on themselves? It'll slowly decline in output..</title>
      <dc:creator>intercepted16</dc:creator>
      <pubDate>Thu, 08 Jan 2026 01:06:10 +0000</pubDate>
      <link>https://dev.to/intercepted16/wont-llms-eventually-train-on-themselves-itll-slowly-decline-in-output-6d7</link>
      <guid>https://dev.to/intercepted16/wont-llms-eventually-train-on-themselves-itll-slowly-decline-in-output-6d7</guid>
      <description>&lt;p&gt;TL;DR: LLMs train on stuff like documentation, GitHub repositories, StackOverflow, and Reddit. But as we keep using LLMs, their own output goes into these platforms. Which means.. they'll train on themselves at one point. Each time, maybe the quality is 0.1% worse. This adds up, exponentially.&lt;/p&gt;

&lt;p&gt;LLMs do have good output. But this is because they trained on human data. You can tell that the AI output is &lt;em&gt;slightly&lt;/em&gt; worse, at times. Sometimes, majorly worse.&lt;/p&gt;

&lt;p&gt;Kinda like that Telephone game.. the message slowly gets diluted. Each time, maybe small, like 0.0001%. But then, it exponentially gets worse. 0.1% of 0.1% ...&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>I made a fast, structured PDF extractor for RAG; 300 pages a second</title>
      <dc:creator>intercepted16</dc:creator>
      <pubDate>Tue, 06 Jan 2026 07:18:43 +0000</pubDate>
      <link>https://dev.to/intercepted16/i-made-a-fast-structured-pdf-extractor-for-rag-300-pages-a-second-34d1</link>
      <guid>https://dev.to/intercepted16/i-made-a-fast-structured-pdf-extractor-for-rag-300-pages-a-second-34d1</guid>
      <description>&lt;p&gt;Hi all,&lt;/p&gt;

&lt;p&gt;I hope you're doing well. I'd like to share (what I believe) may be a useful tool I've made.&lt;/p&gt;

&lt;p&gt;I was recently helping develop a cybersecurity RAG assistant with my dad (I'm 15). He don't really care about speed, but I did. In fact, I got annoyed. I couldn't find a single lightning fast PDF parser for RAG, with quality intact. I had this weird itch to scratch.. I wanted to change my chunking pipeline and see results INSTANTLY.&lt;/p&gt;

&lt;p&gt;And so, I ended up porting (kind of, though it has a different output format) &lt;code&gt;pymupdf4llm&lt;/code&gt; to C, then binding it back to Python. Just by changing the language and fixing algorithms, it made such a big difference..&lt;/p&gt;

&lt;p&gt;~300 pages a second. 30x faster than &lt;code&gt;pymupdf4llm&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  what exactly is it?
&lt;/h2&gt;

&lt;p&gt;A fast PDF extractor for Python. I used most of &lt;code&gt;pymupdf4llm&lt;/code&gt;'s features and heuristics for detection and parsing as a reference, then wrote it in C for speed. However, unlike &lt;code&gt;pymupdf4llm&lt;/code&gt; and many others, for RAG, I chose to output structured JSON with &lt;strong&gt;a lot&lt;/strong&gt; of data: geometry, typography, document structure, etc.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;speed:&lt;/strong&gt; ~300 pages/second on CPU. no GPU needed. 1 million pages in ~55 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  the problem
&lt;/h2&gt;

&lt;p&gt;Most PDF extractors give you either raw text (fast but unusable) or &lt;em&gt;full on&lt;/em&gt; OCR and ML kinda stuff. &lt;br&gt;
However, for RAG, a middle ground of fidelity and speed is needed; especially for larger volumes.&lt;br&gt;
This tool gives you structured data, and allows for smarter chunks; chunks not just based on word counts are very important, while keeping fast speeds.&lt;/p&gt;

&lt;p&gt;Also, chunking matters more than people think. I'm serious here.. not even related to my tool, but I used to have 200 word slivers of text.. and bigger embedding models were NOT helping, lol.&lt;/p&gt;
&lt;h2&gt;
  
  
  what you get
&lt;/h2&gt;

&lt;p&gt;JSON output with metadata for every element:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"heading"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Step 1. Gather threat intelligence"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"bbox"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;64.00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;173.74&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;491.11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;218.00&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"font_size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;21.64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"font_weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bold"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example, instead of splitting on word counts and overlaps, you can, now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;use bounding boxes to find semantic boundaries (&lt;em&gt;where does this chunk probably end&lt;/em&gt;, &lt;strong&gt;literally&lt;/strong&gt;.. instead of guessing for each document)&lt;/li&gt;
&lt;li&gt;filter out headers and footers from the top &amp;amp; bottom of pages&lt;/li&gt;
&lt;li&gt;and lots more. you've got ALL the data!&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  comparison table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Speed (pps)&lt;/th&gt;
&lt;th&gt;Tables&lt;/th&gt;
&lt;th&gt;Images (Figures)&lt;/th&gt;
&lt;th&gt;OCR (Y/N)&lt;/th&gt;
&lt;th&gt;JSON Output&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;pymupdf4llm-C&lt;/td&gt;
&lt;td&gt;~300&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (WIP)&lt;/td&gt;
&lt;td&gt;N&lt;/td&gt;
&lt;td&gt;Yes (structured)&lt;/td&gt;
&lt;td&gt;RAG, high volume&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pymupdf4llm&lt;/td&gt;
&lt;td&gt;~10&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (but not ML to get contents)&lt;/td&gt;
&lt;td&gt;N&lt;/td&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;General extraction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pymupdf (alone)&lt;/td&gt;
&lt;td&gt;~250&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No, not by itself, requires more effort I believe&lt;/td&gt;
&lt;td&gt;N&lt;/td&gt;
&lt;td&gt;No (text only)&lt;/td&gt;
&lt;td&gt;basic text extraction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;marker&lt;/td&gt;
&lt;td&gt;~0.5-1&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (contents with ML?)&lt;/td&gt;
&lt;td&gt;Y (optional?)&lt;/td&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;Maximum fidelity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;docling&lt;/td&gt;
&lt;td&gt;~2-5&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Y&lt;/td&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;Document intelligence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PaddleOCR&lt;/td&gt;
&lt;td&gt;~20-50&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Y&lt;/td&gt;
&lt;td&gt;Text&lt;/td&gt;
&lt;td&gt;Scanned documents&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;the tradeoff:&lt;/strong&gt; speed and control over automatic extraction. marker and docling give higher fidelity if you have time; this is built for when you don't.gg&lt;/p&gt;

&lt;h2&gt;
  
  
  what it handles well
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;high volume PDF ingestion (millions of pages)&lt;/li&gt;
&lt;li&gt;RAG pipelines where document structure matters for chunking&lt;/li&gt;
&lt;li&gt;custom downstream processing; you own the logic&lt;/li&gt;
&lt;li&gt;cost sensitive deployments; CPU only, no expensive inference&lt;/li&gt;
&lt;li&gt;iteration speed; refine your chunking strategy in minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  what it doesn't handle
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;scanned or image heavy PDFs (no OCR)&lt;/li&gt;
&lt;li&gt;99%+ accuracy on complex edge cases; this trades some precision for speed&lt;/li&gt;
&lt;li&gt;figues or image extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  why i built this
&lt;/h2&gt;

&lt;p&gt;Dumb reason. I just got bored of waiting for chunking the PDFs every time I made a minor change. I couldn't find anything with even 50% of the quality that would be faster. And anyway, my chunks were trash. So it was either: raw text, or ML, and I didn't want either of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  links
&lt;/h2&gt;

&lt;p&gt;repo: &lt;a href="https://github.com/intercepted16/pymupdf4llm-C" rel="noopener noreferrer"&gt;https://github.com/intercepted16/pymupdf4llm-C&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;pip: &lt;code&gt;pip install pymupdf4llm-C&lt;/code&gt; (&lt;a href="https://pypi.org/project/pymupdf4llm-C" rel="noopener noreferrer"&gt;https://pypi.org/project/pymupdf4llm-C&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;note: prebuilt wheels from 3.9 -&amp;gt; 3.14 (inclusive) (macOS ARM, macOS x64, Linux (glibc &amp;gt; 2011)). no Windows. pain to build for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;small disclamer&lt;/strong&gt;: in making the project, AI had been used for assistance. if you've got a problem with that, that's OK.&lt;/p&gt;

&lt;p&gt;docs and examples in the repo. Feedback would be nice!&lt;/p&gt;

</description>
      <category>programming</category>
      <category>rag</category>
      <category>pdf</category>
      <category>python</category>
    </item>
  </channel>
</rss>
