Chunklet-py (v2+): One Library to Split Them All - Sentence, Code, Docs

#python #nlp #chunker #rag

I've been working on Chunklet-py - a powerful Python library for intelligent text and document chunking that's perfect for LLM/RAG applications. Here's why you might want to check it out:

⚠ This guide targets chunklet-py v2.1.1.

APIs from v2.2.0+ are not included.

See the latest docs for updates.

🔧 What It Does

Chunklet-py is your friendly neighborhood text splitter that takes all kinds of content and breaks it into smart, context-aware chunks. Instead of dumb character-count splitting, it gives you specialized tools for:

Sentence Splitter - Multilingual text splitting (50+ languages!)
Plain Text Chunker - Basic text chunking with constraints
Document Chunker - Processes PDFs, DOCX, EPUB, ODT, CSV, Excel, and more
Code Chunker - Language-agnostic code splitting that preserves structure
Chunk Visualizer - Interactive web interface for real-time chunk exploration

🚀 Key Features

Blazingly Fast: Parallel processing for large document batches
Featherlight Footprint: Lightweight and memory-efficient
Rich Metadata: Context-aware metadata for advanced RAG applications
Multilingual Mastery: 50+ languages with intelligent detection
Triple Interface: CLI, library, or web interface
Infinitely Customizable: Pluggable token counters, custom splitters, processors

💻 Quick Example

from chunklet import PlainTextChunker

chunker = PlainTextChunker()
chunks = chunker.chunk(
    "Your long text here...",
    max_tokens=1000,
    max_sentences=10
)

for chunk in chunks:
    print(f"Content: {chunk.content[:50]}...")
    print(f"Metadata: {chunk.metadata}")

📊 Why It Matters

Traditional text splitting often breaks meaning - mid-sentence cuts, lost context, language confusion. Chunklet-py keeps your content's structure and meaning intact, making it perfect for:

Preparing data for LLMs
Building RAG systems
AI search applications
Document processing pipelines

🛠️ Installation

pip install chunklet-py

# For full features:
pip install "chunklet-py[all]"

📈 Community & Stats

50+ languages supported
10+ document formats processed
MIT licensed - free and open source
Active development with comprehensive testing

Check out the documentation and GitHub repo for more details!

What do you think? Have you worked on similar text processing challenges? Any questions about chunking strategies or the library?

🔗 Related Posts

🚀 Latest version (v2.2.0+):

https://dev.to/speed_k_7e1b449706e59e433/-introducing-chunklet-py-dj8
🧠 Legacy version (v1, outdated):

https://dev.to/speed_k_7e1b449706e59e433/stop-breaking-context-smarter-text-chunking-for-python-nlp-projects-2n8n

DEV Community