DEV Community

intercepted16
intercepted16

Posted on • Edited on

I made a fast, structured PDF extractor for RAG; 300 pages a second

Hi all,

I hope you're doing well. I'd like to share (what I believe) may be a useful tool I've made.

I was recently helping develop a cybersecurity RAG assistant with my dad (I'm 15). He don't really care about speed, but I did. In fact, I got annoyed. I couldn't find a single lightning fast PDF parser for RAG, with quality intact. I had this weird itch to scratch.. I wanted to change my chunking pipeline and see results INSTANTLY.

And so, I ended up porting (kind of, though it has a different output format) pymupdf4llm to C, then binding it back to Python. Just by changing the language and fixing algorithms, it made such a big difference..

~300 pages a second. 30x faster than pymupdf4llm.

what exactly is it?

A fast PDF extractor for Python. I used most of pymupdf4llm's features and heuristics for detection and parsing as a reference, then wrote it in C for speed. However, unlike pymupdf4llm and many others, for RAG, I chose to output structured JSON with a lot of data: geometry, typography, document structure, etc.

speed: ~300 pages/second on CPU. no GPU needed. 1 million pages in ~55 minutes.

the problem

Most PDF extractors give you either raw text (fast but unusable) or full on OCR and ML kinda stuff.
However, for RAG, a middle ground of fidelity and speed is needed; especially for larger volumes.
This tool gives you structured data, and allows for smarter chunks; chunks not just based on word counts are very important, while keeping fast speeds.

Also, chunking matters more than people think. I'm serious here.. not even related to my tool, but I used to have 200 word slivers of text.. and bigger embedding models were NOT helping, lol.

what you get

JSON output with metadata for every element:

{
  "type": "heading",
  "text": "Step 1. Gather threat intelligence",
  "bbox": [64.00, 173.74, 491.11, 218.00],
  "font_size": 21.64,
  "font_weight": "bold"
}
Enter fullscreen mode Exit fullscreen mode

For example, instead of splitting on word counts and overlaps, you can, now:

  • use bounding boxes to find semantic boundaries (where does this chunk probably end, literally.. instead of guessing for each document)
  • filter out headers and footers from the top & bottom of pages
  • and lots more. you've got ALL the data!

comparison table

Tool Speed (pps) Tables Images (Figures) OCR (Y/N) JSON Output Best For
pymupdf4llm-C ~300 Yes No (WIP) N Yes (structured) RAG, high volume
pymupdf4llm ~10 Yes Yes (but not ML to get contents) N Markdown General extraction
pymupdf (alone) ~250 No No, not by itself, requires more effort I believe N No (text only) basic text extraction
marker ~0.5-1 Yes Yes (contents with ML?) Y (optional?) Markdown Maximum fidelity
docling ~2-5 Yes Yes Y JSON Document intelligence
PaddleOCR ~20-50 Yes Yes Y Text Scanned documents

the tradeoff: speed and control over automatic extraction. marker and docling give higher fidelity if you have time; this is built for when you don't.gg

what it handles well

  • high volume PDF ingestion (millions of pages)
  • RAG pipelines where document structure matters for chunking
  • custom downstream processing; you own the logic
  • cost sensitive deployments; CPU only, no expensive inference
  • iteration speed; refine your chunking strategy in minutes

what it doesn't handle

  • scanned or image heavy PDFs (no OCR)
  • 99%+ accuracy on complex edge cases; this trades some precision for speed
  • figues or image extraction

why i built this

Dumb reason. I just got bored of waiting for chunking the PDFs every time I made a minor change. I couldn't find anything with even 50% of the quality that would be faster. And anyway, my chunks were trash. So it was either: raw text, or ML, and I didn't want either of them.

links

repo: https://github.com/intercepted16/pymupdf4llm-C

pip: pip install pymupdf4llm-C (https://pypi.org/project/pymupdf4llm-C)

note: prebuilt wheels from 3.9 -> 3.14 (inclusive) (macOS ARM, macOS x64, Linux (glibc > 2011)). no Windows. pain to build for.

small disclamer: in making the project, AI had been used for assistance. if you've got a problem with that, that's OK.

docs and examples in the repo. Feedback would be nice!

Top comments (0)