TI for Kreuzberg

Posted on Mar 2

Kreuzberg vs. Unstructured.io: Benchmarks and Architecture Comparison (March 2026)

#ai #machinelearning #document #opensource

Kreuzberg vs Unstructured: Benchmarks and Architecture Comparison (March 2026)

When building document pipelines, the choice of extraction library directly impacts performance, infrastructure costs, and reliability. Two tools that developers compare are Kreuzberg and Unstructured.io. Both can extract text and metadata from documents, but they differ significantly in architecture and behavior under load. The purpose of this comparison is not to declare a universal winner but to explain where the differences come from and when they matter in practice.

The Core Difference: Architecture

The most important distinction between the two systems is architectural. Unstructured.io is primarily Python-based and designed around flexible pipelines and integrations, which makes its open source library convenient for rapid prototyping and experimentation in Python-centric environments.

Kreuzberg takes a different approach. Its extraction engine is implemented in Rust and exposed through bindings to multiple languages. In March 2026, Kreuzberg supports 12 programming languages: Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, WASM, and TypeScript/Node.js. This design moves performance-critical work into compiled native code while keeping the developer experience accessible in Python and other stacks.

This is important for performance because compiled native code runs directly on the CPU without an interpreter or runtime in between, unlike Python, which adds an extra layer of execution overhead (bytecode interpretation, dynamic typing, and runtime dispatch).

In practical terms, this architectural difference affects:

how memory is allocated and reused (e.g. predictable allocation patterns vs Python object overhead)
how concurrency is implemented (native threads vs multiprocessing or async workarounds)
how much work happens across process boundaries (serialization/deserialization costs)
and how much runtime initialization is required before processing begins

These factors become increasingly noticeable when processing large batches of files or running pipelines continuously in production.

What the Benchmarks Measure

Document-processing benchmarks are meaningful only when they measure more than raw speed. Throughput, latency percentiles, memory usage, installation size, cold start time, and extraction reliability all contribute to real-world performance.

Kreuzberg’s published benchmarks use reproducible tests and real-world datasets containing PDFs, Office documents, images, and multilingual text. The benchmark harness runs continuously and is designed to isolate extraction performance from external bottlenecks (e.g. network or I/O), so results reflect the behavior of the extraction engine itself.

This is important because pipelines often behave very differently depending on file type and layout complexity.

Throughput in Practice

In comparative benchmarks, Kreuzberg has demonstrated significantly higher throughput than many Python-only pipelines, in some cases processing documents roughly 9–50x faster on average across tested workloads. Results vary depending on document type and configuration, but the pattern is consistent: architecture matters.

02.03.2026 — snapshot of Kreuzberg rust (PDF) benchmarks in Duration and Quality Score. See current benchmarks here: https://kreuzberg.dev/benchmarks

These differences arise from several factors:

Native parsing avoids interpreter overhead and reduces per-operation cost
Tight loops and parsing routines can be optimized at compile time
Memory locality and cache efficiency are improved in compiled code
Parallel execution can be implemented without the constraints of the Python GIL

When processing large datasets, these advantages compound. Even small per-document overhead differences (e.g. a few milliseconds) can translate into minutes or hours of total runtime at scale.

Installation Size and Operational Considerations

If you’ve ever deployed a large-scale system, you know that another practical difference appears in installation size and dependency complexity. Benchmarks have shown smaller installation footprints and fewer dependencies for Kreuzberg compared with heavier Python frameworks like Unstructured.

Python-based pipelines often depend on multiple layers of libraries (parsers, OCR tools, system packages), which increases container size and introduces more potential points of failure or version conflicts. In contrast, Kreuzberg packages much of its functionality into a compiled core, reducing the need for large runtime dependency chains.

While this may seem like a minor detail, it affects container build times, CI pipelines, and cold start latency. In large distributed systems, operational efficiency often matters as much as raw processing speed.

Cold Start and Resource Efficiency

Cold start time becomes particularly important in serverless or autoscaling environments. Systems that rely on large dependency stacks or complex initialization routines may take longer to start, increasing latency and costs.

Native engines with smaller runtime requirements tend to start faster because there is less dynamic initialization (e.g. importing modules, resolving dependencies, initializing interpreters). They also tend to use memory more efficiently due to tighter control over allocation and fewer intermediary objects.

Lower memory usage also allows higher parallelism and better performance on smaller machines, which directly impacts cost efficiency in production environments.

Extraction Quality

Performance alone is not enough. Extraction quality depends on document type, layout complexity, language, and OCR requirements. Some pipelines may perform better on certain classes of documents than others, which is why testing on representative datasets is essential. Benchmarks provide useful signals, but they cannot replace real-world evaluation.

In our own benchmarks (early 2026, see the benchmark run and live dashboard, these tradeoffs show up clearly in practice. Across mixed real-world datasets, like PDFs, Office documents, HTML, and scanned images, we observe that performance and extraction behavior vary significantly by document class.

On clean, text-based PDFs, multiple pipelines achieve high success rates (often >95%) with relatively stable latency. In contrast, on OCR-heavy or layout-complex documents (e.g. scanned pages, tables, multilingual content), both latency and extraction consistency diverge more noticeably.

In our runs, throughput differences between approaches range from ~9× to as high as ~50× depending on the workload, while structured extraction success rates can drop by 10–30 percentage points on harder document types. These effects are often linked to how each system handles layout detection, OCR integration, and post-processing heuristics.

These figures are drawn from our late February 2026 benchmark runs and will evolve, but they consistently reinforce the same point: performance and quality are highly dependent on file type, language, and processing mode (single-file vs batch), which is why evaluating on representative datasets is critical before making production decisions.

Choosing Between the Two

Unstructured.io has built a reputation as a flexible, developer-friendly toolkit for working with unstructured data. Its open-source library can be a strong choice for rapid prototyping, especially in Python-heavy environments where flexibility and ecosystem integration are the primary concerns. For smaller workloads or experimental projects, this can be an excellent fit. For production use at scale, bigger teams rely on Unstructured’s hosted platform, which is tailored more toward enterprise deployments and managed workflows.

Kreuzberg open source tends to be particularly attractive in production ingestion pipelines, where throughput, resource efficiency, and predictable performance become more important. In these environments, architectural differences translate directly into cost savings and faster processing. Many teams and companies are already using the OSS in their workflows.

Kreuzberg Cloud will be a fully managed, core-available solution for teams who need reduced complexity and excellent results in a single API. The library will remain MIT-licensed (permissive open-source) forever. The commercial offering is being built around the core library, not by restricting the library itself.

Final Thoughts

Document processing is the foundation of modern AI and search systems. In production environments, pipelines often need to handle millions, or even tens of millions, of documents, where even small inefficiencies compound quickly.

The ingestion layer, including extraction, chunking, and embedding preparation, determines how fast data can be processed, how reliably it can be retrieved, and how much infrastructure is required to operate the system at scale.

Testing with real data remains the most reliable way to decide, and you’re welcome to test Kreuzberg’s newly published comparative benchmarks with batch processing or a single file on GitHub. The benchmarks run continuously.

Benchmarks consistently show that architecture plays a major role in performance. Native engines, efficient memory usage, and minimal dependencies often lead to higher throughput and lower operational cost. At the same time, the best tool always depends on the documents being processed and the requirements of the system.

DEV Community