I made a fast, structured PDF extractor for RAG; 300 pages a second

#programming #rag #pdf #python

Hi all,

I hope you're doing well. I'd like to share (what I believe) may be a useful tool I've made.

I was recently helping develop a cybersecurity RAG assistant with my dad (I'm 15). He don't really care about speed, but I did. In fact, I got annoyed. I couldn't find a single lightning fast PDF parser for RAG, with quality intact. I had this weird itch to scratch.. I wanted to change my chunking pipeline and see results INSTANTLY.

And so, I ended up porting (kind of, though it has a different output format) pymupdf4llm to C, then binding it back to Python. Just by changing the language and fixing algorithms, it made such a big difference..

~300 pages a second. 30x faster than pymupdf4llm.

what exactly is it?

A fast PDF extractor for Python. I used most of pymupdf4llm's features and heuristics for detection and parsing as a reference, then wrote it in C for speed. However, unlike pymupdf4llm and many others, for RAG, I chose to output structured JSON with a lot of data: geometry, typography, document structure, etc.

speed: ~300 pages/second on CPU. no GPU needed. 1 million pages in ~55 minutes.

the problem

Most PDF extractors give you either raw text (fast but unusable) or full on OCR and ML kinda stuff.
However, for RAG, a middle ground of fidelity and speed is needed; especially for larger volumes.
This tool gives you structured data, and allows for smarter chunks; chunks not just based on word counts are very important, while keeping fast speeds.

Also, chunking matters more than people think. I'm serious here.. not even related to my tool, but I used to have 200 word slivers of text.. and bigger embedding models were NOT helping, lol.

what you get

JSON output with metadata for every element:

{
  "type": "heading",
  "text": "Step 1. Gather threat intelligence",
  "bbox": [64.00, 173.74, 491.11, 218.00],
  "font_size": 21.64,
  "font_weight": "bold"
}

For example, instead of splitting on word counts and overlaps, you can, now:

use bounding boxes to find semantic boundaries (where does this chunk probably end, literally.. instead of guessing for each document)
filter out headers and footers from the top & bottom of pages
and lots more. you've got ALL the data!

comparison table

Tool	Speed (pps)	Tables	Images (Figures)	OCR (Y/N)	JSON Output	Best For
pymupdf4llm-C	~300	Yes	No (WIP)	N	Yes (structured)	RAG, high volume
pymupdf4llm	~10	Yes	Yes (but not ML to get contents)	N	Markdown	General extraction
pymupdf (alone)	~250	No	No, not by itself, requires more effort I believe	N	No (text only)	basic text extraction
marker	~0.5-1	Yes	Yes (contents with ML?)	Y (optional?)	Markdown	Maximum fidelity
docling	~2-5	Yes	Yes	Y	JSON	Document intelligence
PaddleOCR	~20-50	Yes	Yes	Y	Text	Scanned documents

the tradeoff: speed and control over automatic extraction. marker and docling give higher fidelity if you have time; this is built for when you don't.gg

what it handles well

high volume PDF ingestion (millions of pages)
RAG pipelines where document structure matters for chunking
custom downstream processing; you own the logic
cost sensitive deployments; CPU only, no expensive inference
iteration speed; refine your chunking strategy in minutes

what it doesn't handle

scanned or image heavy PDFs (no OCR)
99%+ accuracy on complex edge cases; this trades some precision for speed
figues or image extraction

why i built this

Dumb reason. I just got bored of waiting for chunking the PDFs every time I made a minor change. I couldn't find anything with even 50% of the quality that would be faster. And anyway, my chunks were trash. So it was either: raw text, or ML, and I didn't want either of them.