DEV Community

Cover image for Announcing Kreuzberg v2.0: A Lightweight, Modern Python Text Extraction library
Na'aman Hirschfeld
Na'aman Hirschfeld

Posted on

Announcing Kreuzberg v2.0: A Lightweight, Modern Python Text Extraction library

🔍 What’s Kreuzberg?

Kreuzberg is a Python library that provides a unified async/sync interface for extracting text from PDFs, images, Office documents, and more.

Key Features

  • Async First: Optimized async using anyio and worker processes.
  • Minimal Dependencies: Much smaller footprint compared to alternatives.
  • Serverless-and-Docker Ready: Perfect for serverless functions and containerized deployments.
  • Local Processing: All processing is done locally, with no API calls or cloud services.
  • Modern Python: Built for Python 3.9+ with rigorous typing and extensive testing.
  • Versatile: Supports various formats, including PDFs, spreadsheets, Markdown, LaTeX, and more.

🚀 What’s New in Version 2.0?

Kreuzberg v2.0 brings significant enhancements to performance, usability, and feature set. Here’s what’s new:

  • Sync APIs: Kreuzberg supports synchronous extraction methods alongside async workflows.
  • Batch Processing: Efficiently process multiple files or byte streams in parallel.
  • Smart PDF Handling: Automatically fall back to OCR when direct text extraction fails.
  • Metadata Extraction: Retrieve metadata like document titles or creators using Pandoc.
  • Excel Multi-Sheet Support: Handle even the most complex spreadsheets.
  • Enhanced Performance: Worker processes for faster, resource-efficient extraction.

Check out the v2.0 changelog for more details.

🎯 Who’s It For?

Kreuzberg is ideal for developers building:

  • Retrieval-Augmented Generation (RAG) systems
  • LLM-powered applications
  • Document indexing, analysis, and automation tools

If you’re looking for a lightweight, efficient solution for text extraction, Kreuzberg is a great choice.

⚖️ How Kreuzberg Compares

Here’s how Kreuzberg stacks up against alternatives:

  1. Python OSS Libraries

    • Unstructured.io: Feature-rich but heavy, making it unsuitable for serverless or low-resource environments.
    • Docling: Another strong alternative but larger and heavier—better suited for high-volume, GPU-based workloads.
  2. Non-Python OSS Libraries

    • Apache Tika: Requires a Java server running as a sidecar, with Python client libraries available.
    • Grobid: Excellent for structured research text extraction but comes with a ~20GB Docker image.
  3. Commercial APIs
    Paid solutions like Azure Document Intelligence or AWS Textract offer best-in-class OCR and layout extraction. However, they come with pricing concerns and cloud dependencies, unlike Kreuzberg.

Staring ⭐ is Caring

If Kreuzberg sounds like the library you’ve been looking for, check it out on GitHub.

Please star the repo ⭐—it helps others discover the project and motivates me to keep improving it!

Sentry image

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here →

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay