DEV Community

Speedyk-005
Speedyk-005

Posted on

**Chunklet-py: One Library to Split Them All - Sentence, Code, Docs**

I've been working on Chunklet-py - a powerful Python library for intelligent text and document chunking that's perfect for LLM/RAG applications. Here's why you might want to check it out:

πŸ”§ What It Does

Chunklet-py is your friendly neighborhood text splitter that takes all kinds of content and breaks it into smart, context-aware chunks. Instead of dumb character-count splitting, it gives you specialized tools for:

  • Sentence Splitter - Multilingual text splitting (50+ languages!)
  • Plain Text Chunker - Basic text chunking with constraints
  • Document Chunker - Processes PDFs, DOCX, EPUB, ODT, CSV, Excel, and more
  • Code Chunker - Language-agnostic code splitting that preserves structure
  • Chunk Visualizer - Interactive web interface for real-time chunk exploration

πŸš€ Key Features

  • Blazingly Fast: Parallel processing for large document batches
  • Featherlight Footprint: Lightweight and memory-efficient
  • Rich Metadata: Context-aware metadata for advanced RAG applications
  • Multilingual Mastery: 50+ languages with intelligent detection
  • Triple Interface: CLI, library, or web interface
  • Infinitely Customizable: Pluggable token counters, custom splitters, processors

πŸ’» Quick Example

from chunklet import PlainTextChunker

chunker = PlainTextChunker()
chunks = chunker.chunk(
    "Your long text here...",
    max_tokens=1000,
    max_sentences=10
)

for chunk in chunks:
    print(f"Content: {chunk.content[:50]}...")
    print(f"Metadata: {chunk.metadata}")
Enter fullscreen mode Exit fullscreen mode

πŸ“Š Why It Matters

Traditional text splitting often breaks meaning - mid-sentence cuts, lost context, language confusion. Chunklet-py keeps your content's structure and meaning intact, making it perfect for:

  • Preparing data for LLMs
  • Building RAG systems
  • AI search applications
  • Document processing pipelines

πŸ› οΈ Installation

pip install chunklet-py

# For full features:
pip install "chunklet-py[all]"
Enter fullscreen mode Exit fullscreen mode

πŸ“ˆ Community & Stats

  • 50+ languages supported
  • 10+ document formats processed
  • MIT licensed - free and open source
  • Active development with comprehensive testing

Check out the documentation and GitHub repo for more details!

What do you think? Have you worked on similar text processing challenges? Any questions about chunking strategies or the library?

Top comments (0)