Hi all,
I hope you're doing well. I'd like to share (what I believe) may be a useful tool I've made.
I was recently helping develop a cybersecurity RAG program with my dad (I'm 15). He probably didn't care about speed, but I did. In fact, I got annoyed. I couldn't find a single lightning fast PDF parser for RAG, with quality intact. I had this weird itch to scratch.. I wanted to change my chunking pipeline and see results INSTANTLY.
And so, I ended up porting (kind of, though it has a different output format) pymupdf4llm to C, then binding it back to Python. And the results were insane.. not even as much as I thought.
~300 pages a second. 30x faster than pymupdf4llm.
what exactly is it?
A fast PDF extractor for Python. I used most of pymupdf4llm's features and heuristics for detection and parsing as a reference, then wrote it in C for speed. However, unlike pymupdf4llm and many others, for RAG, I chose to output structured JSON with heaps of data: geometry, typography, document structure, etc.
speed: ~300 pages/second on CPU. no GPU needed. 1 million pages in ~55 minutes.
the problem
Most PDF extractors give you either raw text (fast but unusable) or full on OCR and ML kinda stuff. Anyway, for RAG, I believe you'd want structured data you can control; you want to build smart chunks based on document layout, not just word count. you'd want this fast, especially when processing large volumes.
Also, chunking matters more than people think. I'm serious here.. not even related to my tool, but I used to have 200 word slivers of text.. and bigger embedding models were NOT helping, lol.
what you get
JSON output with metadata for every element:
{
"type": "heading",
"text": "Step 1. Gather threat intelligence",
"bbox": [64.00, 173.74, 491.11, 218.00],
"font_size": 21.64,
"font_weight": "bold"
}
For example, instead of splitting on word counts and overlaps, you can, now:
- use bounding boxes to find semantic boundaries (where does this chunk probably end, literally.. instead of guessing for each document)
- filter out headers and footers from the top & bottom of pages
- and lots more. you've got ALL the data!
comparison table
| Tool | Speed (pps) | Tables | Images (Figures) | OCR (Y/N) | JSON Output | Best For |
|---|---|---|---|---|---|---|
| pymupdf4llm-C | ~300 | Yes | No (WIP) | N | Yes (structured) | RAG, high volume |
| pymupdf4llm | ~10 | Yes | Yes (but not ML to get contents) | N | Markdown | General extraction |
| pymupdf (alone) | ~250 | No | No, not by itself, requires more effort I believe | N | No (text only) | basic text extraction |
| marker | ~0.5-1 | Yes | Yes (contents with ML?) | Y (optional?) | Markdown | Maximum fidelity |
| docling | ~2-5 | Yes | Yes | Y | JSON | Document intelligence |
| PaddleOCR | ~20-50 | Yes | Yes | Y | Text | Scanned documents |
the tradeoff: speed and control over automatic extraction. marker and docling give higher fidelity if you have time; this is built for when you don't.
what it handles well
- high volume PDF ingestion (millions of pages)
- RAG pipelines where document structure matters for chunking
- custom downstream processing; you own the logic
- cost sensitive deployments; CPU only, no expensive inference
- iteration speed; refine your chunking strategy in minutes
what it doesn't handle
- scanned or image heavy PDFs (no OCR)
- 99%+ accuracy on complex edge cases; this trades some precision for speed
- figues or image extraction
why i built this
i used this in my own RAG project and the difference was great. i got to see results instantly, and my chunks were better! (i used bounding boxes to find where one paragraph ends, not "2000 chars, 500 overlap").
links
repo: https://github.com/intercepted16/pymupdf4llm-C
pip: pip install pymupdf4llm-C (https://pypi.org/project/pymupdf4llm-C)
note: prebuilt wheels from 3.9 -> 3.14 (inclusive) (macOS ARM, macOS x64, Linux (glibc > 2011)). no Windows. pain to build for.
docs and examples in the repo. would appreciate any feedback.
Top comments (0)