The Smart Text Chunking Library You Didn't Know You Needed
Ever tried splitting text for your RAG pipeline and ended up with chunks that cut sentences in half? Or worse — chunks that lose all context between them?
Yeah, I've been there too. That's exactly why I built chunklet-py — a Python library that actually understands text structure.
This post hits the highlights — visit the full documentation for everything else, including:
- Custom sentence splitters for specialized languages
- Custom document processors for unusual file formats
- Custom tokenizers to match your LLM
- CLI flags for batch processing, parallel jobs, error handling, timeouts
- Advanced features like overlap, offset, strict mode, docstring modes
⚠ Quick heads up!
This tutorial requireschunklet-py v2.2.0+and uses APIs not available in earlier versions.Upgrade to the latest version and see the documentation or What’s New for details.
The Problem with Dumb Splitting
Here's what usually happens:
# The naive approach
chunks = [text[i:i+500] for i in range(0, len(text), 500)]
This works... until it doesn't:
- Sentences cut mid-way ("The model got 75%" → "75%" becomes meaningless)
- No context between chunks
- Broken code if you're chunking source files
Solution: chunklet-py
A smart text and code chunking library that respects natural boundaries.
Features
50+ languages supported — Auto-detects language and applies the right splitting rules. No more treating German the same as English.
Multiple constraint types — Mix and match:
-
max_sentences— group by sentences -
max_tokens— respect LLM context limits -
max_section_breaks— keep Markdown headers together (headings##, horizontal rules---,<details>tags) -
max_lines— for code chunking -
max_functions— keep functions together
Multiple file formats — PDF, DOCX, EPUB, HTML, Markdown, LaTeX, ODT, CSV, Excel, plain text — one library handles them all.
Rich metadata — Every chunk comes with source references, character spans, and structural info.
Composable constraints — Mix and match limits to get exactly the chunks you need.
Pluggable architecture — Swap in custom tokenizers, sentence splitters, or document processors.
What's New in v2.2.0
-
API Unification — Methods renamed to
chunk_text,chunk_file,chunk_texts,chunk_filesfor consistency - Visualizer redesign — Fullscreen mode, 3-row layout, smoother hovers
- More code languages — ColdFusion, VB.NET, PHP 8 attributes, Pascal support
- Ruff — Switched to Ruff for faster linting
Check the What's New page for full details.
Installation
pip install chunklet-py
For document support:
pip install chunklet-py[structured-document]
For code:
pip install chunklet-py[code]
For visualization:
pip install chunklet-py[visualization]
Code Examples
Core Imports
from chunklet import DocumentChunker # For PDFs, DOCX, and general text
from chunklet import CodeChunker # For source code
from chunklet import SentenceSplitter # For just sentences
from chunklet import visualizer # Web-based visualizer
DocumentChunker API
Four methods cover most use cases:
| Method | Input | Return Type |
|---|---|---|
chunk_text(text) |
str | List[Chunk] |
chunk_file(path) |
Path or str | List[Chunk] |
chunk_texts(list) |
List[str] | Generator[Chunk] |
chunk_files(list) |
List[Path] | Generator[Chunk] |
DocumentChunker Example
chunker = DocumentChunker()
# Feel free to mix and match these
chunks = chunker.chunk_text(
text,
max_sentences=3, # Stop after X sentences
max_tokens=500, # Don't blow up the LLM context
max_section_breaks=2, # Respect the Markdown headers
overlap_percent=20, # Give it some "memory" of the last chunk
offset=0 # Skip the first N sentences
)
CodeChunker Example
chunker = CodeChunker()
chunks = chunker.chunk_text(
code,
max_lines=50, # Height limit
max_tokens=512, # Width limit
max_functions=1, # One function per chunk
strict=True # True: Crash on big blocks; False: Slice anyway
)
SentenceSplitter (Just Sentences)
from chunklet import SentenceSplitter
splitter = SentenceSplitter()
sentences = splitter.split_text(text, lang="en")
Handles tricky cases like "Dr." or "U.S.A." without breaking them up.
Output Object
Chunkers return Chunk objects (Box instances), so you use dot notation:
for chunk in chunks:
print(chunk.content) # The actual text/code
print(chunk.metadata) # Chunk metadata
Visualizer (Interactive Web UI)
Launch a web interface to experiment with chunking parameters:
chunklet visualize
Or programmatically:
from chunklet import visualizer
v = visualizer.Visualizer(host="127.0.0.1", port=8000)
v.serve() # Opens in your browser
CLI Examples
Prefer the terminal? chunklet-py ships with a full CLI:
# Basic text chunking
chunklet chunk "Your text here." --max-tokens 500
# Chunk a file
chunklet chunk --source document.pdf --max-tokens 500 --metadata
# Split text into sentences
chunklet split "Your text here." --lang en
# Split a file into sentences
chunklet split --source my_file.txt --destination sentences.txt
# Start the interactive visualizer
chunklet visualize
# Code chunking
chunklet chunk --code --source my_script.py --max-functions 1
# Batch processing a directory
chunklet chunk --doc --source ./my_docs --destination ./chunks --n-jobs 4
# With error handling
chunklet chunk --doc --source ./my_docs --on-errors skip
How It Compares
| Library | The Deal | Focus |
|---|---|---|
| chunklet-py | All-in-one, lightweight, multilingual, language-agnostic. | Text, Code, Docs |
| LangChain | Full LLM framework with basic splitters. Good for prototyping. | Full Stack |
| Chonkie | Chunking + embeddings + vector DB all-in-one. | Pipelines |
| Semchunk | Text-only, fast semantic splitting. | Text |
Wrap Up
Chunklet-py is production-ready. It's lightweight, has no heavy dependencies, and the API is consistent — no more guessing which method name to use.
Check it out: github.com/speedyk-005/chunklet-py
Questions? Drop them in the comments!
Top comments (0)