Chunklet: Smarter Text Chunking for Python Developers
Why Context Matters in Text Splitting
When preprocessing documents for NLP tasks, standard splitting methods often:
- Break sentences mid-thought (
"The patient showed improvement. However," → "However,"
) - Ignore linguistic boundaries in non-English texts
- Lose critical context between chunks
Chunklet solves this with structural awareness.
1. Installation & Basic Usage
pip install chunklet
Minimal Example:
from chunklet import Chunklet
text = "First sentence. Second sentence. Third sentence."
chunker = Chunklet()
chunks = chunker.chunk(text, mode="sentence", max_sentences=2)
# Output:
# ["First sentence. Second sentence.", "Third sentence."]
This preserves complete sentences while respecting chunk size limits.
2. Key Features Explained
Hybrid Chunking Mode
Combines structural and size-based splitting:
chunks = chunker.chunk(
text,
mode="hybrid",
max_sentences=3, # Structural limit
max_tokens=200, # Size limit
overlap_percent=15 # Context preservation
)
Why this matters:
- Prevents chunks from becoming too long or too short
- Overlap maintains relationships between sections
- Works equally well on code, markdown, or prose
Multilingual Support
# Auto-detection (36+ languages)
chunks = chunker.chunk(multilingual_text)
# Manual override
chunks = chunker.chunk(japanese_text, language="ja")
How it works:
- Uses
py3langid
for fast language detection - Applies language-specific sentence boundaries
- Falls back to regex for unsupported languages
3. Real-World Use Cases
Preparing Legal Documents
legal_text = Path("contract.txt").read_text()
chunks = chunker.chunk(
legal_text,
mode="hybrid",
max_tokens=512,
overlap_percent=20 # Critical for clause relationships
)
Why it works:
- Preserves entire contract clauses
- Maintains references between sections (e.g., "as defined in Section 2.1")
- Handles complex punctuation in legal prose
Processing Academic Papers
chunker = Chunklet(
sentence_splitter=custom_academic_splitter, # Handles citations
token_counter=scibert_tokenizer # Domain-specific counting
)
Customization options:
- Plug in any sentence splitter
- Use HuggingFace tokenizers
- Adjust chunking thresholds per document type
4. Performance Considerations
# For large datasets:
results = chunker.batch_chunk(
documents,
n_jobs=4, # Parallel processing
chunk_size=1000 # Documents per batch
)
Optimization tips:
- Enable
use_cache=True
for repeated texts - Pre-filter very short/long documents
- Monitor memory with
memory_profiler
Ready to try?
GitHub Repository | PyPI Package
Top comments (0)