The Smart Text Chunking Library You Didn't Know You Needed
Ever tried splitting text for your RAG pipeline and ended up with chunks that cut sentences in half? Or worse — chunks that lose all context between them?
Yeah, I've been there too. That's exactly why I built chunklet-py — a Python library that actually understands text structure.
This post hits only the highlights and doesn't cover everything — visit the full documentation for everything else, including:
- Custom sentence splitters for specialized languages
- Custom document processors for unusual file formats
- Custom tokenizers to match your LLM
- The rich metadata you can get.
- CLI flags for batch processing, parallel jobs, error handling, timeouts
- Additional args like
n_jobs,lang,show_progress, ...
⚠ Quick heads up!
This tutorial requireschunklet-py v2.2.0+and uses APIs not available in earlier versions.Upgrade to the latest version and see the documentation or What’s New for details.
The Problem with Dumb Splitting
Here's what usually happens:
# The naive approach
chunks = [text[i:i+500] for i in range(0, len(text), 500)]
This works... until it doesn't:
- Sentences cut mid-way ("The model got 75%" → "75%" becomes meaningless)
- No context between chunks
- Broken code if you're chunking source files
Solution: chunklet-py
A smart text and code chunking library that respects natural boundaries.
Features
50+ languages supported — Auto-detects language and applies the right splitting rules. No more treating German the same as English.
Multiple constraint types — Mix and match:
-
max_sentences— group by sentences -
max_tokens— respect LLM context limits -
max_section_breaks— keep Markdown headers together (headings##, horizontal rules---,<details>tags) -
max_lines— for code chunking -
max_functions— keep functions together
Multiple file formats — PDF, DOCX, EPUB, HTML, Markdown, LaTeX, ODT, CSV, Excel, plain text — one library handles them all.
Rich metadata — Every chunk comes with source references, character spans, and structural info.
Composable constraints — Mix and match limits to get exactly the chunks you need.
Pluggable architecture — Swap in custom tokenizers, sentence splitters, or document processors.
What's New in v2.2.0
-
API Unification — Methods renamed to
chunk_text,chunk_file,chunk_texts,chunk_filesfor consistency - Visualizer redesign — Fullscreen mode, 3-row layout, smoother hovers
- More code languages — ColdFusion, VB.NET, PHP 8 attributes, Pascal support
- Ruff — Switched to Ruff for faster linting
Check the What's New page for full details.
Installation
pip install chunklet-py
For document support:
pip install chunklet-py[structured-document]
For code:
pip install chunklet-py[code]
For visualization:
pip install chunklet-py[visualization]
Code Examples
Core Imports
from chunklet import DocumentChunker # For PDFs, DOCX, and general text
from chunklet import CodeChunker # For source code
from chunklet import SentenceSplitter # For just sentences
from chunklet import visualizer # Web-based visualizer
DocumentChunker API
Four methods cover most use cases:
| Method | Input | Return Type |
|---|---|---|
chunk_text(text) |
str | List[Chunk] |
chunk_file(path) |
Path or str | List[Chunk] |
chunk_texts(list) |
List[str] | Generator[Chunk] |
chunk_files(list) |
List[Path] | Generator[Chunk] |
DocumentChunker Example
chunker = DocumentChunker()
# Feel free to mix and match these
chunks = chunker.chunk_text(
text,
max_sentences=3, # Stop after X sentences
max_tokens=500, # Don't blow up the LLM context
max_section_breaks=2, # Respect the Markdown headers
overlap_percent=20, # Give it some "memory" of the last chunk
offset=0 # Skip the first N sentences
)
CodeChunker Example
chunker = CodeChunker()
chunks = chunker.chunk_text(
code,
max_lines=50, # Height limit
max_tokens=512, # Width limit
max_functions=1, # One function per chunk
strict=True, # True: Crash on big blocks; False: Slice anyway
include_comments=True, # True by default
docstring_mode="all", # Options are: all, excluded, summary
)
⚠ Token Counter Requirement
When using the max_tokens constraint, a token_counter function is essential. This function, which you provide, should accept a string and return an integer representing its token count. Failing to provide a token_counter will result in a MissingTokenCounterError.
You can also provide the token_counter directly to any chunking method. If provided in both the constructor and the method, the one in the method will be used.
SentenceSplitter (Just Sentences)
from chunklet import SentenceSplitter
splitter = SentenceSplitter()
sentences = splitter.split_text(text, lang="en") # You can also set it to "auto"
Handles tricky cases like "Dr." or "U.S.A." without breaking them up.
50+ languages are explicitly supported through dedicated libraries (pysbd covers 40+, Indic NLP Library covers 11, sentsplit covers 4, and Sentencex covers ~15, with some overlap), plus the Fallback Splitter handles any other language via Unicode rules (Supported Languages Documentation).
Output Object
Chunkers return Chunk objects (Box instances), so you use dot notation:
for chunk in chunks:
print(chunk.content) # The actual text/code
print(chunk.metadata) # Chunk metadata
Visualizer (Interactive Web UI)
Launch a web interface to experiment with chunking parameters:
chunklet visualize
Or programmatically:
from chunklet import visualizer
v = visualizer.Visualizer(host="127.0.0.1", port=8000)
v.serve() # Opens in your browser
CLI Examples
Prefer the terminal? chunklet-py ships with a full CLI
Here are some quick examples:
# Basic text chunking
chunklet chunk "Your text here." --max-tokens 500
# Chunk a file
chunklet chunk --source document.pdf --max-tokens 500 --metadata
# Split text into sentences
chunklet split "Your text here." --lang en
# Split a file into sentences
chunklet split --source my_file.txt --destination sentences.txt
# Start the interactive visualizer
chunklet visualize
# Code chunking
chunklet chunk --code --source my_script.py --max-functions 1
# Batch processing a directory
chunklet chunk --doc --source ./my_docs --destination ./chunks --n-jobs 4
# With error handling
chunklet chunk --doc --source ./my_docs --on-errors skip
How It Compares
While there are other chunking libraries available, Chunklet-py stands out for its unique combination of versatility, performance, and ease of use. Here's a quick look at how it compares to some of the alternatives:
| Library | Key Differentiator | Focus |
|---|---|---|
| chunklet-py | All-in-one, lightweight, multilingual, language-agnostic with specialized algorithms. | Text, Code, Docs |
| LangChain | Full LLM framework with basic splitters (e.g., RecursiveCharacterTextSplitter, Markdown, HTML, code splitters). Good for prototyping but basic for complex docs or multilingual needs. | Full Stack |
| Chonkie | All-in-one pipeline (chunking + embeddings + vector DB). Uses tree-sitter for code. Multilingual. |
Pipelines |
| Semchunk | Text-only, fast semantic splitting. Built-in tiktoken/HuggingFace support. 85% faster than alternatives. | Text |
| CintraAI Code Chunker | Code-specific, uses tree-sitter. Initially supports Python, JS, CSS only. |
Code |
Chunklet-py is a specialized, drop-in replacement for the chunking step in any RAG pipeline. It handles text, documents, and code without heavy dependencies, while keeping your project lightweight.
🙌 Contributors & Thanks
A huge thank you to the awesome people who helped shape Chunklet-py:
- @jmbernabotto — for helping mostly on the CLI part, suggesting fixes, features, and design improvements.
- @arnoldfranz — for reporting the CLI Path Validation Bug (#6) that helped improve error handling.
License
Check out the LICENSE file for all the details.
Wrap Up
Chunklet-py is production-ready. It's lightweight, has no heavy dependencies, and the API is consistent — no more guessing which method name to use.
Check it out: github.com/speedyk-005/chunklet-py
Questions? Drop them in the comments!
Top comments (0)