TI for Kreuzberg

Posted on Mar 31

Document Structure Extraction with Kreuzberg

#ai #opensource #rag #webdev

Extracting structured data from PDFs is one of the hardest problems in AI infrastructure. Most tools give you a text dump but no headings, no table boundaries, no distinction between a caption and a footnote. When Docling launched, it changed the game with a genuinely good layout model.

We want to be clear– Docling is a great project, and we have the greatest respect for the team at IBM for putting it out there. It’s also fully open-source under a permissive Apache-2.0 license. We integrated their model into Kreuzberg and embedded it into a Rust-native pipeline. Currently, it runs 2.8× faster with a fraction of the memory footprint.

This post covers the behind-the-scenes part: what we used, what we rebuilt from scratch, and where the speed comes from.

Why Document Structure Matters for AI and RAG Pipelines
If you’re building AI infrastructure like RAG pipelines, document processing workflows, or any AI application that ingests PDFs at scale, flat text extraction isn’t enough anymore.

Consider what happens when you feed an LLM a PDF that’s been extracted as a single blob of text. The model can’t distinguish a section heading from body text. It can’t tell if a number belongs to a table cell or a footnote. It merges multi-column layouts into nonsense. The retrieval quality of your entire pipeline degrades because the source data has no structure.

Docling, IBM’s open-source document understanding library, addressed this head-on. Their RT-DETR v2 layout model (called Docling Heron) classifies 17 different document element types: headings, paragraphs, tables, figures, captions, page headers, footers, and more. It produces a structural representation that downstream systems can actually work with.

The model is excellent. The issue lies in what’s around it.

Docling is a Python library built on deep learning inference. Model loading takes time. Processing is sequential. Memory usage scales with document complexity. For a single document or a research prototype, that’s fine. For thousands of documents in a production pipeline, especially if your stack isn’t Python, it starts to matter. That’s the gap we set out to close.

The Foundation
Starting with Kreuzberg v4.5.0, we integrated Docling’s RT-DETR v2 layout model directly into our Rust-native pipeline. The model is open-source under Apache-2.0, and we want to be transparent about its use. Docling’s team built something excellent. But the model is only one piece of a document extraction system. The inference runtime, the text extraction layer, the page processing strategy, the table reconstruction pipeline, all of which you can have in Rust now. The result is a system that uses Docling’s layout intelligence but runs it through an entirely different execution engine.

Here’s where the engineering differences live.

Engineering the Pipeline

ONNX Runtime for Layout Inference
The RT-DETR v2 model runs through ONNX Runtime, not Python’s PyTorch. There’s no Python dependency, GIL contention, and native support for hardware acceleration such as CPU, CUDA, CoreML, and TensorRT. All of this is configurable through a typed AccelerationConfig that works across every language binding Kreuzberg supports.

This alone eliminates the cold-start penalty. The ONNX session loads once and stays resident.

Parallel Page Processing
Layout inference processes page batches in a single session.run() call. SLANet-Plus (the table structure recognition model) and layout inference both run in parallel using thread-local model instances and Rayon workers. Each page is processed independently and released after extraction, keeping memory usage flat even on 500-page documents.

Docling processes pages sequentially through Python. Kreuzberg processes them concurrently through Rust. On a 100-page PDF, that difference compounds fast.

Native PDF Text Extraction via PDFium
This is where most of the quality gains come from, and it’s the biggest architectural divergence from Docling.

Instead of relying on the layout model’s pipeline to also handle text extraction, Kreuzberg reads text directly from the PDF’s native text layer using PDFium’s character-level API. This preserves exact character positions, font metadata (bold, italic, size), and Unicode encoding. The layout model then classifies and organizes this high-fidelity text according to the document’s visual structure.

The distinction matters because Docling’s pipeline treats the rendered page image as the primary input for both layout detection and text extraction. Kreuzberg uses the page image only for layout detection, then pulls text from the PDF’s native layer. You get neural-network-quality structure classification with lossless text fidelity.

Structure Tree Integration
When a PDF contains a tagged structure tree (common in PDF/A and accessibility-compliant documents), Kreuzberg uses the author’s original paragraph boundaries and heading hierarchy, then applies layout model predictions as classification overrides. The structure tree gives you the author’s intent; the layout model gives you visual classification. Combining both produces better results than either alone.

Fixing Edge Cases in PDFs
The single biggest quality improvement came not from the layout model integration, but from rewriting how we extract text from PDFs at the character level.

Before v4.5.0, Kreuzberg used PdfiumParagraph::from_objects() which is a paragraph-level extraction approach that relied on PDFium’s built-in text segmentation. It worked on clean documents but broke down on anything with non-standard font matrices, complex column layouts, or broken CMap encodings. AndPDFs are full of exactly these problems.

We replaced it with per-character text extraction using PDFium’s PdfPageText::chars() API. Every character is read individually with its exact position, font size, and baseline coordinates. From there, we rebuild the text structure ourselves.

This unlocked a chain of fixes that would have been impossible at the paragraph level:

Broken font metrics. Many PDFs report incorrect font sizes due to font matrix scaling. PDFium might say font_size=1 when the rendered text is clearly 12pt. Our old 4pt minimum filter would silently drop all content from these pages. Now, when the filter removes everything, it’s skipped automatically. Same logic for margin filtering. When it removes all text on a page (PDFs with baseline values outside expected bands), the filter falls back gracefully.

Ligature corruption. LaTeX-generated PDFs with broken ToUnicode CMaps produce garbled text: different becomes di!erent, offices becomes o”ces. We repair these inline during character iteration using a vowel/consonant heuristic to disambiguate ambiguous ligature mappings. Fixing this during extraction rather than as a post-processing pass improved both accuracy and performance.

Word spacing artifacts. PDFium sometimes inserts spurious spaces mid-word — shall be active becomes s hall a b e active. Pages with detected broken spacing are re-extracted using character-level gap analysis (font_size × 0.33 threshold). Clean pages use the fast single-call path. On the ISO 21111–10 test document, this reduced garbled lines from 406 to zero.

Multi-column reading order. Federal Register-style multi-column PDFs jumped from 69.9% to 90.7% F1 after switching to PDFium’s text API, which naturally handles column reading order without us needing to implement column detection heuristics.

The final result: Kreuzberg’s PDF markdown extraction hit 91.0% average F1 across 16 test PDFs, compared to Docling’s 91.4%. Effectively at parity, while being 10–50× faster.

How Table Extraction Works
Table extraction runs in two stages.

First, the RT-DETR v2 layout model identifies table regions on the page image. Then, Kreuzberg crops each detected region and runs SLANet-Plus, a specialized model that predicts internal table structure: rows, columns, cells, including colspan and rowspan.

The predicted cell grid is matched against native PDF text positions to reconstruct accurate markdown tables. This hybrid approach i.e., neural structure prediction plus native text extraction, avoids the OCR-like quality loss you get when working only with rendered page images.

We also tightened the detection heuristics. Table detection now requires at least 3 aligned columns, which eliminates false positives from two-column text layouts like academic papers and newsletters. Post-processing rejects tables with 2 or fewer columns, tables where more than 50% of cells contain long text, or tables with an average cell length above 50 characters. These rules cut false positive detections significantly without hurting recall.

Benchmarks: How We Measured This
We benchmarked Kreuzberg against Docling on 171 PDF documents spanning academic papers, government and legal documents, invoices, OCR scans, and edge cases. F1 score measures the harmonic mean of precision and recall and how much of the expected content was correctly extracted, and how much of what was extracted was actually correct.

Metric: Kreuzberg Docling

Structure F1: 42.1% 41.7%

Text F1: 88.9% 86.7%

Avg. processing time: 1,032 ms/doc 2,894 ms/doc

The 2.8× speed advantage comes from four angles: Rust’s native memory management, PDFium character-level text extraction (no Python overhead), ONNX Runtime inference (no PyTorch), and Rayon parallelism across pages. Structure F1 measures how accurately document elements such as headings, paragraphs, and tables are detected.

In broader benchmarks, we compared Kreuzberg to Apache Tika, Docling, MarkItDown, Unstructured.io, PDFPlumber, MinerU, MuPDF4LLM, and more. There, you can see Kreuzberg is substantially faster on average, with much lower memory usage and a smaller installation footprint. The Docker image is around 1–1.3GB versus Docling’s 1GB+ Python installation before you even add your application code.

What This Means for Your Stack
Already using Docling and happy with the quality? You’ll get equivalent extraction accuracy from Kreuzberg with less infrastructure overhead. The layout model is the same, the execution is faster, and the memory is lower.

Running a polyglot stack? If your backend is Rust, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno), Kreuzberg gives you the same layout detection capabilities without wrapping a Python service behind an HTTP endpoint. Native bindings for 12 languages, same Rust core underneath.

Processing at scale? The combination of parallel page processing, native text extraction, and efficient ONNX inference means significantly higher document throughput on the same hardware. No GPU required for layout detection; CPU inference is fast enough for most production workloads.

Layout detection is available across all 12 language bindings, the CLI, the REST API, and the MCP server. Models auto-download from HuggingFace on first use and are cached locally.

Get Started

`# CLI

kreuzberg extract document.pdf — layout-detection

Python

from kreuzberg import extract_file, ExtractionConfig

result = await extract_file(“document.pdf”, ExtractionConfig(

layout_detection=True,

output_format=”markdown”

))`

Document structure extraction is becoming table stakes for production AI pipelines. Modern AI systems depend on structured document data. The faster you can extract data,the more scalable your pipeline becomes.

We’re grateful to the Docling team at IBM for the truly great foundation they’ve provided. If you’re running Docling in production today, try Kreuzberg against it on your actual documents and let us know what you think.

DEV Community

Document Structure Extraction with Kreuzberg

Get Started

Python

Top comments (0)