TI for Kreuzberg

Posted on Apr 12

The Haystack converter that handles 91+ file formats without a Cloud API

#opensource #ai #converter #kreuzberg

Haystack already has converters for PDFs, for DOCX, for HTML. If you're building a RAG pipeline, you've probably used at least two of them. But if you've ever tried to build an indexing pipeline that handles everything a user might throw at it like PDFs, scanned invoices, spreadsheets, PowerPoint decks, images, archives, you know the pain. You end up wiring together three or four different converters, each with its own quirks, its own failure modes, and its own gap in format coverage.

KreuzbergConverter is now merged into haystack-core-integrations. It's a single component that extracts text from 91+ file formats, runs OCR on scanned documents, preserves table structure, and does all of it locally. You won't need API keys, per-page billing, or files leaving your infrastructure.

Here's how it works.

Why document extraction breaks most RAG pipelines

That first step is text extraction which determines the ceiling for everything downstream. If your extractor drops table data, your LLM can't answer questions about pricing tables. If it mangles OCR output, your embeddings are unusable. If it fails on a PPTX file, that document just doesn't exist in your knowledge base.

Flawed document extraction is consistently identified as the primary bottleneck for RAG pipelines, establishing a performance ceiling that even the most optimized retrieval strategies, chunking methods, or LLM selections cannot break through.

The setup you're replacing

If you're processing mixed file types in Haystack, your indexing pipeline probably looks something like this:

pipeline = Pipeline()
pipeline.add_component("router", FileTypeRouter(
    mime_types=["application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "text/html"]
))
pipeline.add_component("pdf_converter", PyPDFToDocument())
pipeline.add_component("docx_converter", DOCXToDocument())
pipeline.add_component("html_converter", HTMLToDocument())
pipeline.add_component("joiner", DocumentJoiner())

pipeline.connect("router.application/pdf", "pdf_converter")
pipeline.connect("router.application/vnd.openxmlformats-officedocument.wordprocessingml.document", "docx_converter")
pipeline.connect("router.text/html", "html_converter")
pipeline.connect("pdf_converter", "joiner")
pipeline.connect("docx_converter", "joiner")
pipeline.connect("html_converter", "joiner")

Six components wired together, and you still can't handle XLSX, PPTX, images, or scanned PDFs. Every new format means adding another converter and another route.

With KreuzbergConverter, that entire setup becomes:

from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter

pipeline = Pipeline()
pipeline.add_component("converter", KreuzbergConverter())
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "writer")

pipeline.run({"converter": {"sources": ["contract.pdf", "invoice.png", "report.xlsx"]}})

One component. Format detection is automatic and Kreuzberg identifies the MIME type and routes internally. You don't write conditional logic for different file types.

What the output looks like

Here's what you get back when you process a mixed batch. Each file produces a Haystack Document with extracted content and rich metadata:

converter = KreuzbergConverter()
result = converter.run(sources=["quarterly_report.pdf", "scanned_receipt.png", "budget.xlsx"])

for doc in result["documents"]:
    print(f"Source: {doc.meta['file_path']}")
    print(f"Type: {doc.meta['mime_type']}")
    print(f"Languages: {doc.meta['detected_languages']}")
    print(f"Quality: {doc.meta['quality_score']}")
    print(f"Content preview: {doc.content[:100]}")

Source: quarterly_report.pdf
Type: application/pdf
Languages: ['en']
Quality: 0.95
Content preview: Q3 2025 Financial Results\n\nRevenue grew 12% year-over-year...
---
Source: scanned_receipt.png
Type: image/png
Languages: ['en']
Quality: 0.72
Content preview: STORE #4421\n123 Main Street\nTotal: $47.83...
---
Source: budget.xlsx
Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Languages: ['en']
Quality: 0.98
Content preview: | Department | Q1 Budget | Q1 Actual | Variance |...

Notice the quality scores. Clean PDF: 0.95. Scanned receipt: 0.72, which is still usable, but the converter is telling you the extraction was less confident. Spreadsheet with native digital data: 0.98. You can filter on this before embedding but no point polluting your vector store with garbage text from a barely-readable fax.

Three output modes that make it better

Unified (default): One Document per file. All content merged. Drop this into a simple search pipeline and move on.

Per-page: One Document per page, with page numbers and blank-page detection in metadata. Useful when you need page-level precision. You can ask:"what does page 14 of this contract say?"

Chunked: Text is split into semantic chunks, each with an optional embedding vector. This means you can skip the separate DocumentSplitter step entirely if you want. One component does extraction and chunking in a single pass.

Most Haystack converters give you a blob of text and leave chunking to a downstream component. KreuzbergConverter does both, which means fewer pipeline stages to debug.

OCR that works on documents

When the converter hits a scanned PDF or an image file, it runs OCR automatically. Three backends are supported:

Tesseract (default) :- general-purpose, comes pre-installed on most systems
EasyOCR :- better for handwriting and non-Latin scripts, GPU-accelerated
PaddleOCR :-high-volume, 80+ languages, PP-OCRv5 support

Before running OCR, the converter preprocesses images: DPI adjustment, rotation, deskewing, denoising, contrast enhancement, binarization. This happens automatically. If you've ever dealt with scanned faxes or photos of receipts in a production pipeline, you know that raw OCR on unprocessed images produces unusable text. The preprocessing step is what makes the difference between "sort of works" and "it really works."

Tables come through as structured data

Tables extracted from documents are converted to structured markdown. They're not flattened into a wall of text but the row/column structure is preserved. For markdown and HTML output, tables are inlined where they appeared in the original document.

Financial figures, product specifications, pricing tiers, compliance checklists i.e. a huge amount of high-value enterprise data lives in tables. Lose the structure during extraction and your LLM has no way to reconstruct it.

Metadata you can use

Every Document the converter produces comes with metadata that's useful for downstream filtering:

mime_type - actual detected format
detected_languages - automatic language detection
quality_score - extraction confidence (0.0–1.0)
page_count, image_count
annotations - PDF comments and highlights
processing_warnings - anything that went wrong during extraction

The quality_score field is worth calling out. You can use it to filter before embedding or skip documents below 0.7, for example, instead of polluting your vector store with garbage text from a badly scanned document. The language detection lets you route documents to language-specific models or prompts downstream.

Parallel Batch Processing

KreuzbergConverter processes multiple files using a Rust-powered thread pool (rayon). This is not Python-level parallelism limited by the GIL, it's real multi-threaded processing at the system level.

Single file? Processed directly, no overhead. Multiple files? Automatically parallelized across your CPU cores. According to the Kreuzberg project's benchmarks, the library processes 35+ files per second on CPU hardware.

If you're indexing thousands of documents in a common enterprise scenario, this is the difference between a pipeline run that takes minutes and one that takes hours.

Proper Error Handling

If one file in a batch is corrupted or unreadable, the converter logs a warning and moves on. You still get Documents from everything else. This sounds obvious, but in production pipelines processing thousands of files, one bad PDF bringing down the entire indexing job is an operational problem. KreuzbergConverter treats it as a warning, not an exception.

Where it sits in a pipeline

A typical Haystack RAG indexing pipeline with KreuzbergConverter:

KreuzbergConverter is fully serializable and you can save your pipeline as YAML or JSON, version-control it, and deploy it across environments. This is important for DevOps teams managing production pipelines.

Why not just use the cloud APIs?

AWS Textract, Google Document AI, and Azure Document Intelligence all handle document extraction, and they're usually accurate. They're also expensive, and they require sending your files to someone else's servers.

The per-page costs add up fast:

Service	Basic OCR (per 1K pages)	Table extraction (per 1K pages)
AWS Textract	$1.50	$15.00
Google Document AI	$1.50	$30.00
Azure Document Intelligence	$1.50	$30.00

KreuzbergConverter runs on your own hardware. The API cost is $0.

Beyond cost, there's the compliance question. If you're in healthcare, legal, or finance, sending contracts and medical records to a third-party API might not be an option. Local processing means your documents never leave your infrastructure.

How it compares to other Haystack converters

Haystack already has converters: PyPDFToDocument, DOCXToDocument, HTMLToDocument. And integrations for Docling, Unstructured, and MarkItDown.

The tradeoffs:

PyPDFToDocument / DOCXToDocument / HTMLToDocument: Built-in, lightweight, reliable for their specific format. But each handles one format. If you're processing mixed file types, you need a router component that picks the right converter for each file. KreuzbergConverter replaces that entire pattern with one component.

Unstructured: Powerful, but cloud-dependent (the free tier has limitations, the full API is paid). The OSS version has 54 dependencies and a 146 MB install footprint.

Docling (IBM): Good for structured document understanding with a deep learning approach. But it's 1 GB+ installed and 88 dependencies. It's a heavy tool for a heavy job.

KreuzbergConverter: 71 MB installed, 20 dependencies, ELv2 licensed, runs locally. It won't match Docling's deep layout understanding on complex research papers. Still, for the 90% case, extracting clean text and tables from everyday business documents - it's faster, lighter, and doesn't require a GPU.

The other thing worth mentioning: PyMuPDF, which several tools depend on, uses an AGPL-3.0 license. If your application distributes PyMuPDF and isn't open-source, you need a commercial license from Artifex. Kreuzberg is MIT and has no restrictions on commercial use.

Getting started

Install:

pip install kreuzberg-haystack

Minimal pipeline:

from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", KreuzbergConverter())
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "writer")

pipeline.run({"converter": {"sources": ["report.pdf", "data.xlsx", "scan.png"]}})

If you're running Haystack in production with mixed file types, we suggest trying KreuzbergConverter. It's a single component that ensures your documents stay on your local infrastructure while reducing API costs to zero.

DEV Community