Dr Hernani Costa

Posted on Apr 21 • Originally published at radar.firstaimovers.com

RAG Document Ingestion: Why CPU-First Beats GPU Guessing

#ai #productivity #automation #programming

Most RAG deployments fail upstream, not downstream. When your document ingestion pipeline treats every PDF the same, you inherit chaos at retrieval time—and no LLM can fix bad source data.

This is where CPU-first document ingestion strategy becomes your competitive edge, especially on constrained hardware like Raspberry Pi 5. The real win is not picking the fanciest parser. It is routing cheap extraction to expensive OCR only when the evidence demands it.

CPU-First Document Ingestion for RAG on Raspberry Pi 5

TL;DR: Build a CPU-first document ingestion stack for RAG on Raspberry Pi 5 using PyMuPDF, Tesseract, and smart OCR routing.

How to turn messy PDFs into reliable Markdown, structured extraction, and searchable context without jumping too early to GPU-heavy infrastructure

Most RAG teams obsess over models and ignore ingestion, but the real failure often starts upstream. Adopting a CPU-first document ingestion strategy is crucial, especially when working with constrained hardware, as it addresses the root cause of many RAG system failures: bad document intake.

If your PDFs arrive as scans, screenshots, mixed layouts, broken tables, or low-quality exports, your retrieval stack inherits every flaw upstream. By the time the LLM answers badly, the root cause is often not reasoning. It is bad document intake.

Document Ingestion Is the Hidden Bottleneck in RAG

A PDF is a container, not a promise of usable text.

Some PDFs are born-digital and clean. Others are scans. Many are hybrids full of text objects, signatures, screenshots, repeated headers, and visual tables that do not exist as real semantic tables in the file structure. That is why document ingestion is best treated as routing plus fallbacks, not as a single "PDF to Markdown" tool choice.

What matters most is not which library you like on social media. What matters is whether your pipeline can decide, page by page, when cheap extraction is enough and when expensive work is justified.

That is the strategic shift.

Your ingestion layer should not assume every file is hard. It should prove when a file is hard.

Cheap Detection Beats Expensive Guessing

The strongest ingestion systems do one thing well: they route cheap to expensive.

They start with fast CPU-native extraction. Then they escalate only if the page quality, layout complexity, or missing text makes OCR or deeper parsing necessary. That is how you protect speed, cost, and reliability at the same time.

In my experience, teams that skip this step burn time in the wrong place. They throw heavier models or more complex parsers at documents that never needed them.

Why a CPU-First Document Ingestion Architecture Fits Raspberry Pi 5

This is why a CPU-first document ingestion stack makes sense on Raspberry Pi 5.

The hardware profile gives you a solid edge compute envelope, but not a free pass to brute-force every document with heavyweight parsing. That forces discipline, which is a good thing. It pushes you toward a cleaner architecture: Acquire → Detect → Convert → OCR → Extract → Store → Embed.

That sequence matters because it isolates failures.

You preserve the raw file. You detect what kind of document you actually have. You convert first. You OCR only where the evidence says you need OCR. Then, and only then, you hand normalized text to structured extraction and embeddings.

Why This Separation Works

The LLM should not be your page parser.

Its job is stronger when it runs after conversion, over normalized Markdown or clean text. That is the difference between asking a model to rescue chaos and asking it to extract structure from stable input. Your current runtime logic already points in that direction, with GPT-4o mini handling structured extraction on normalized text and embeddings handled separately.

That design is not just technically cleaner. It is easier to maintain.

A builder can improve OCR without touching extraction prompts. A product lead can upgrade embeddings without rewriting parsing logic. A platform team can measure which stage is slow instead of blaming "AI" as one monolithic black box. This level of detail is a core part of effective AI Governance & Risk Advisory.

PyMuPDF and Tesseract Are the Practical Default Lane

For a Pi-first deployment, PyMuPDF plus Tesseract is the right default.

PyMuPDF plays two critical roles. First, it is the fast deterministic lane for born-digital PDFs. Second, it is the preflight gate that helps you decide whether OCR is even necessary. Its own published benchmark suite covers thousands of pages, and it is especially strong for high-throughput text extraction. It also supports OCR workflows through Tesseract when installed separately.

That gives you a simple operational rule:

Use PyMuPDF first. OCR only when the extracted text is sparse, broken, or obviously image-based.

Tesseract Still Earns Its Place

Tesseract remains practical because it fits the edge constraint well.

It has real deployment gravity, broad language support, and a clear Debian package path, including Dutch language data through tesseract-ocr-nld. That matters for reproducibility in real Pi environments.

There is a nuance here. Tesseract does expose optional OpenCL support, but that should not become the center of your architecture. The better mental model is still CPU-first with good routing, preprocessing, and language-pack hygiene.

Throughput Is Good Enough When You Gate OCR

A useful external Pi 5 anchor is roughly 25.6 seconds for a 10-page Tesseract OCR task in one public Raspberry Pi 5 benchmark context. That is not a universal guarantee, but it is enough to show the real lesson: OCR is feasible on Pi 5, yet still expensive enough that you should not run it blindly on every page.

This is where many teams lose time.

They treat OCR as a default. It should be a penalty, not a baseline.

Docling and Marker Are Scale-Up Tools, Not Default Decisions

This is where nuance matters.

Docling is useful when you need a richer pipeline framework with a canonical document object, configurable OCR and layout stages, and concurrency controls through bounded queues. It becomes more attractive when document variety increases and you want stronger pipeline orchestration. Its MIT licensing also makes it easier to discuss in commercial product contexts.

Marker belongs in a different conversation. It is compelling when high-fidelity PDF-to-Markdown or JSON conversion becomes a serious priority, especially once you introduce stronger worker hardware. But it also brings licensing considerations and a more deliberate product decision.

That is why I would frame them this way:

PyMuPDF + Tesseract for the default local lane
Docling for more structured, threaded, scale-up parsing workflows
Marker for high-fidelity reconstruction when the business case justifies the added complexity

The Real Decision Is Not "Best Tool"

The real decision is when to escalate.

You do not win by picking the fanciest parser. You win by keeping your easiest documents cheap, your hardest documents recoverable, and your operational choices explicit.

That is what creates a system instead of a pile of tools.

A Four-Step Framework for CPU-First Document Ingestion

Here is the framework I would use with any SME building RAG on constrained hardware, often as part of our AI Readiness Assessment and broader Business Process Optimization strategy.

Start with native extraction
Use PyMuPDF to test whether the document already contains usable text and whether basic ordering can be recovered cleanly.
Gate OCR aggressively
Run Tesseract only when extracted text is missing, broken, or clearly image-based. Preserve metadata about which pages were OCR'd.
Normalize before enrichment
Convert documents into Markdown or stable structured text before sending content to extraction models or embeddings. That keeps the LLM focused on meaning, not page repair.
Escalate only when complexity earns it
Bring in Docling, Marker, or larger worker tiers only when layout fidelity, tables, or throughput justify the cost.

This is the strategic insight that separates leaders from followers: do not let your hardest document define the cost of every easy one.

Reliability Comes From Architecture, Not Tool Hype

RAG quality is downstream of ingestion quality.

If your pipeline preserves raw files, tracks OCR usage, separates conversion from extraction, and defines explicit escalation paths, you get a system that can be improved over time. If it does not, you get a fragile demo that works until the first ugly PDF arrives.

Let me make this concrete.

Imagine an operations leader at a mid-sized logistics company. They want supplier contracts, invoices, and scanned compliance forms searchable inside an internal assistant. The wrong move is to send everything through a heavyweight parsing stack from day one. The right move is to classify the documents, keep the easy PDFs fast, and escalate only the genuinely messy ones. That is how you protect latency, budget, and trust.

My take is simple: document ingestion is not a preprocessing detail. It is part of a broader Digital Transformation Strategy and Workflow Automation Design that connects technical execution to measurable business outcomes.

DEV Community