I've been working on Chunklet-py - a powerful Python library for intelligent text and document chunking that's perfect for LLM/RAG applications. Here's why you might want to check it out:
π§ What It Does
Chunklet-py is your friendly neighborhood text splitter that takes all kinds of content and breaks it into smart, context-aware chunks. Instead of dumb character-count splitting, it gives you specialized tools for:
- Sentence Splitter - Multilingual text splitting (50+ languages!)
- Plain Text Chunker - Basic text chunking with constraints
- Document Chunker - Processes PDFs, DOCX, EPUB, ODT, CSV, Excel, and more
- Code Chunker - Language-agnostic code splitting that preserves structure
- Chunk Visualizer - Interactive web interface for real-time chunk exploration
π Key Features
- Blazingly Fast: Parallel processing for large document batches
- Featherlight Footprint: Lightweight and memory-efficient
- Rich Metadata: Context-aware metadata for advanced RAG applications
- Multilingual Mastery: 50+ languages with intelligent detection
- Triple Interface: CLI, library, or web interface
- Infinitely Customizable: Pluggable token counters, custom splitters, processors
π» Quick Example
from chunklet import PlainTextChunker
chunker = PlainTextChunker()
chunks = chunker.chunk(
"Your long text here...",
max_tokens=1000,
max_sentences=10
)
for chunk in chunks:
print(f"Content: {chunk.content[:50]}...")
print(f"Metadata: {chunk.metadata}")
π Why It Matters
Traditional text splitting often breaks meaning - mid-sentence cuts, lost context, language confusion. Chunklet-py keeps your content's structure and meaning intact, making it perfect for:
- Preparing data for LLMs
- Building RAG systems
- AI search applications
- Document processing pipelines
π οΈ Installation
pip install chunklet-py
# For full features:
pip install "chunklet-py[all]"
π Community & Stats
- 50+ languages supported
- 10+ document formats processed
- MIT licensed - free and open source
- Active development with comprehensive testing
Check out the documentation and GitHub repo for more details!
What do you think? Have you worked on similar text processing challenges? Any questions about chunking strategies or the library?
Top comments (0)