Extracting text from PDFs is still one of the most common tasks in data engineering, AI pipelines, and automation workflows. Whether you're building a search system, a retrieval-augmented generation (RAG) pipeline, or simply processing reports, the first step is turning PDFs into clean, usable text.
At first glance this sounds simple, but PDFs were never designed to be machine-readable in the way modern formats are. A PDF is essentially a set of instructions describing how a page should look, not a structured representation of paragraphs, headings, or tables. That means text may be stored in fragments, positioned arbitrarily, or embedded as images.
Because of this, native extraction often produces broken sentences, incorrect reading order, or missing content. Modern tools try to reconstruct structure rather than just reading raw text streams, which is why the choice of extraction method matters.
How PDF Text Extraction Works
Most PDF extraction pipelines follow the same high-level process. First, the document is parsed page by page. Then text blocks are detected and assembled into a readable order. If the document contains scanned pages instead of selectable text, OCR is applied. Finally, the output is normalized so it can be indexed, searched, or passed to downstream systems.
Even though this workflow sounds straightforward, each step contains a surprising amount of complexity. Reading order detection, for example, becomes difficult in multi-column layouts or technical documents. Tables introduce another layer of difficulty, because the visual structure does not always map cleanly to text.
This is why many teams eventually move beyond simple PDF libraries to more complete document processing frameworks.
Extracting Text from a PDF in Python
In Python, the basic workflow for extracting text usually looks the same regardless of the library being used. A document is loaded, parsed, and converted into text that can be printed, stored, or processed further. Different libraries use different APIs, but the general pattern remains consistent. The real differences appear in how well they handle layout, performance, and OCR.
Using Kreuzberg for PDF Extraction
Modern document pipelines often require more than just reading text streams. They need consistent metadata, reliable handling of different formats, and good performance when processing large batches of files.
Kreuzberg is designed for this type of workload. It uses a Rust-based extraction engine with Python bindings (and supports 11 other programming languages as of March 2026), enabling efficient document processing while integrating smoothly into Python pipelines.
Here is how to get started with Kreuzberg in Python. First, install the package:
pip install kreuzberg For the simplest case — extracting text from a PDF synchronously — use
extract_file_sync:
python
from kreuzberg import extract_file_sync
result = extract_file_sync("document.pdf")
print(result.content)
print(f"Pages: {result.metadata['page_count']}")
If you are working in an async context, the async variant works identically:
python
import asyncio
from kreuzberg import extract_file
async def main():
result = await extract_file("document.pdf")
print(result.content)
print(f"Tables found: {len(result.tables)}")`
`asyncio.run(main())
The ExtractionResult object returned by both variants gives you result.content for the extracted text, result.tables for any detected tables, and result.metadata for document properties like page count and format type. To process multiple PDFs at once, use the batch extraction functions, which handle concurrency automatically:
python
from pathlib import Path
from kreuzberg import batch_extract_files_sync
paths = list(Path("documents").glob("*.pdf"))
results = batch_extract_files_sync(paths)
for path, result in zip(paths, results):
print(f"{path.name}: {len(result.content)} characters")
For scanned PDFs, enable OCR by passing an ExtractionConfig with an
OcrConfig:
python
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config = ExtractionConfig(
ocr=OcrConfig(backend="tesseract", language="eng")
)
result = extract_file_sync("scanned.pdf", config=config)
print(result.content)
For Chinese documents, you can also use PaddleOCR:
python
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config = ExtractionConfig(
ocr=OcrConfig(backend="paddleocr", language="zh")
)
result = extract_file_sync("scanned.pdf", config=config)
print(f"Extracted content (preview): {result.content[:100]}")
print(f"Total characters: {len(result.content)}")
If you get a libonnxruntime.so loading error, install onnxruntime first:
python -m pip install--upgrade onnxruntime
If the error still persists on Linux, add the onnxruntime/capi directory to LD_LIBRARY_PATH before running your script (replace the path with your actual venv location):
export LD_LIBRARY_PATH=\"<venv>/lib/pythonX.Y/site-packages/onnxruntime/capi:$LD_LIBRARY_PATH\
Kreuzberg supports Tesseract, EasyOCR, and PaddleOCR as backends, which is useful for multilingual documents where backend quality varies by language.
Handling Scanned PDFs
One of the biggest challenges in real-world workflows is dealing with scanned documents. These files contain images instead of selectable text, so extraction requires optical character recognition.
A modern pipeline typically detects when text is missing and automatically runs OCR before merging the results into the document structure. The quality of OCR depends heavily on language, resolution, and document quality, which is why systems that allow different OCR backends are often more reliable in multilingual environments.
Extracting Tables and Structured Content
Tables are another area where simple extraction approaches struggle. Even when the text is captured correctly, the relationships between rows and columns may be lost.
More advanced extraction pipelines attempt to detect table regions and preserve structure so that data remains usable. This is particularly important in financial reports, research papers, and operational documents where tables often contain the most important information.
Performance and Scaling Considerations
Performance becomes increasingly important as soon as you begin processing more than a handful of files. Batch ingestion, RAG pipelines, and search indexing workflows may involve thousands or millions of documents, and inefficiencies at the parsing stage quickly become expensive.
Several factors influence performance, including how the parsing engine is implemented, how memory is managed, and how well the system supports concurrency. Tools that rely heavily on interpreted execution or external subprocesses often encounter bottlenecks at scale, while native parsing engines tend to perform better under sustained workloads.
This is one reason many modern document processing tools use compiled cores with language bindings on top.
Where PDF Extraction Fits in a Modern Pipeline
In most real systems, text extraction is only the first step. Once text is available, it is typically split into chunks, converted into embeddings, and stored in a vector database for retrieval.
This architecture has become standard for document search and RAG systems because it allows large collections of documents to be queried efficiently. Reliable extraction is the foundation that makes everything else possible.
Common Pitfalls
Developers new to PDF extraction often assume that all PDFs behave the same way. In reality, documents vary widely in structure and quality, and a pipeline that works well for one dataset may fail on another.
It is always worth testing extraction using a mix of documents, including scanned files, multi-column layouts, and large reports. Problems usually appear quickly under realistic conditions.
Another common mistake is ignoring metadata. Information such as page numbers, titles, and document structure often becomes critical later, especially when building retrieval systems that need to cite sources.
Final Thoughts
Extracting text from PDFs in Python is easier than it was a few years ago, but the fundamental challenges of document structure and layout remain. Choosing tools that handle these complexities well can significantly improve the quality of downstream systems, from search to RAG to analytics. Once the ingestion layer is reliable, the rest of the pipeline becomes far easier to design and maintain.
Top comments (0)