🔍 What’s Kreuzberg?
Kreuzberg is a Python library that provides a unified async/sync interface for extracting text from PDFs, images, Office documents, and more.
Key Features
-
Async First: Optimized async using
anyio
and worker processes. - Minimal Dependencies: Much smaller footprint compared to alternatives.
- Serverless-and-Docker Ready: Perfect for serverless functions and containerized deployments.
- Local Processing: All processing is done locally, with no API calls or cloud services.
- Modern Python: Built for Python 3.9+ with rigorous typing and extensive testing.
- Versatile: Supports various formats, including PDFs, spreadsheets, Markdown, LaTeX, and more.
🚀 What’s New in Version 2.0?
Kreuzberg v2.0 brings significant enhancements to performance, usability, and feature set. Here’s what’s new:
- Sync APIs: Kreuzberg supports synchronous extraction methods alongside async workflows.
- Batch Processing: Efficiently process multiple files or byte streams in parallel.
- Smart PDF Handling: Automatically fall back to OCR when direct text extraction fails.
- Metadata Extraction: Retrieve metadata like document titles or creators using Pandoc.
- Excel Multi-Sheet Support: Handle even the most complex spreadsheets.
- Enhanced Performance: Worker processes for faster, resource-efficient extraction.
Check out the v2.0 changelog for more details.
🎯 Who’s It For?
Kreuzberg is ideal for developers building:
- Retrieval-Augmented Generation (RAG) systems
- LLM-powered applications
- Document indexing, analysis, and automation tools
If you’re looking for a lightweight, efficient solution for text extraction, Kreuzberg is a great choice.
⚖️ How Kreuzberg Compares
Here’s how Kreuzberg stacks up against alternatives:
-
Python OSS Libraries
- Unstructured.io: Feature-rich but heavy, making it unsuitable for serverless or low-resource environments.
- Docling: Another strong alternative but larger and heavier—better suited for high-volume, GPU-based workloads.
-
Non-Python OSS Libraries
- Apache Tika: Requires a Java server running as a sidecar, with Python client libraries available.
- Grobid: Excellent for structured research text extraction but comes with a ~20GB Docker image.
Commercial APIs
Paid solutions like Azure Document Intelligence or AWS Textract offer best-in-class OCR and layout extraction. However, they come with pricing concerns and cloud dependencies, unlike Kreuzberg.
Staring ⭐ is Caring
If Kreuzberg sounds like the library you’ve been looking for, check it out on GitHub.
Please star the repo ⭐—it helps others discover the project and motivates me to keep improving it!
Top comments (0)