Posted on Jan 12

Announcing Kreuzberg v4

#opensource #python #rust #ai

We're excited to announce that Kreuzberg v4.0.0 is released!

What is Kreuzberg:
Kreuzberg is a document intelligence library that extracts structured data from 56+ formats, including PDFs, Office docs, HTML, emails, images and many more. Built for RAG/LLM pipelines with OCR, semantic chunking, embeddings, and metadata extraction.

The new v4 is a ground-up rewrite in Rust with bindings for 9 other languages!

What changed:

Rust core: Significantly faster extraction and lower memory usage. No more Python GIL bottlenecks.
Pandoc is gone: Native Rust parsers for all formats. One less system dependency to manage.
10 language bindings: Python, TypeScript/Node.js, Java, Go, C#, Ruby, PHP, Elixir, Rust, and WASM for browsers. Same API, same behavior, pick your stack.
Plugin system: Register custom document extractors, swap OCR backends (Tesseract, EasyOCR, PaddleOCR), add post-processors for cleaning/normalization, and hook in validators for content verification.
Production-ready: REST API, MCP server, Docker images, async-first throughout.
ML pipeline features: ONNX embeddings on CPU (requires ONNX Runtime 1.22.x), streaming parsers for large docs, batch processing, byte-accurate offsets for chunking.

Why polyglot matters:

Document processing shouldn't force your language choice. Your Python ML pipeline, Go microservice, and TypeScript frontend can all use the same extraction engine with identical results. The Rust core is the single source of truth; bindings are thin wrappers that expose idiomatic APIs for each language.

Why the Rust rewrite:

The Python implementation hit a ceiling, and it also prevented us from offering the library in other languages. Rust gives us predictable performance, lower memory, and a clean path to multi-language support through FFI.

Open-Source License:

Kreuzberg is and will remain MIT-licensed. This is one of the most permissive licenses, which allows unrestricted use, modification, and redistribution of the code. Users are free to incorporate the software into proprietary systems without imposing copyleft or other licensing obligations. The codebase, issue tracker, and contribution process remain entirely public.

Document Intelligence Features:

Document intelligence in Kreuzberg extends beyond basic text extraction. In v4, the engine supports:

Text extraction across broad formats: 56+ document formats are supported, including .pdf, .docx, .pptx, .xls, .eml, .msg, and structured XML formats.
Metadata generation: Output includes structural metadata such as page boundaries, section headings, and encoding information.
Chunking strategies: Configurable chunking that respects document structure, enabling finer control over segment sizes for downstream use.
Byte-accurate position tracking: Offsets within extracted text are tracked at the byte level for accurate slicing and reference.
Token reduction methods: Built-in strategies to reduce token counts for model context windows without external preprocessing libraries.
Embeddings support: Optional local embedding generation, using ONNX models, to enable semantic indexing as part of standard pipelines.
These capabilities are exposed through both library APIs and standalone executable components.

The v4 release consolidates Kreuzberg’s role as an open, high-performance document processing and intelligence engine suitable for embedding into production workflows, pipelines, and services.

v4 recognizes that document intelligence is a foundational layer for AI systems, compliance workflows, and enterprise data operations. The new release represents a maturation of both the system and the thinking behind it: open, extensible, performant, and designed to integrate into the systems that depend on it.

Links

Star us on GitHub
Read the Docs
Subreddit
Join our Discord Server

DEV Community

Announcing Kreuzberg v4

Top comments (0)