DEV Community: Speedyk-005

yasbd-lib vs PySBD: two philosophies of sentence boundary detection

Speedyk-005 — Sun, 19 Jul 2026 19:22:59 +0000

Sentence boundary detection sounds boring. Split on . ? !, done, right? Anyone who has tried knows otherwise. Abbreviations, decimals, URLs, nested quotes, ellipsis, legal citations, biomedical jargon—each one turns "split text" into a language-specific puzzle.

Two Python libraries tackle this problem with different philosophies. pysbd has been the go-to since 2020 with 22 languages, ported from Ruby's pragmatic segmenter [1]. yasbd-lib is newer, covers 39 languages, and takes a different architectural approach.

This is the difference between protecting boundaries and finding them.

Architecture: Mutation vs. Pointers

To understand how these engines behave on large datasets, we have to look at how they treat input strings.

PySBD: The Transformation Pipeline

PySBD operates as a multi-stage transformation pipeline [1]. It treats text as a mutable object that must be modified before it can be split [1]. To prevent punctuation within abbreviations, numbers, or URLs from triggering false splits, PySBD applies regular expressions to replace characters with placeholder tokens [1].

flowchart TD
    A[Raw Input Text] --> B["Replace . with {} / Mask URLs"]
    B --> C[Run Rule Engine]
    C --> D[Split on Splitting Marks]
    D --> E[Reverse Replacement / Restore Text]
    E --> F[Extract Text Segments]

The structural consequence: the original string layout is transformed during processing. Because the text is modified mid-flight, calculating exact character offsets (spans) relative to the original uncleaned text requires a post-processing reconstruction step [1]. If you enable text cleaning (clean=True), PySBD raises an error when requesting character spans because it cannot guarantee coordinate matching after modification [1].

yasbd-lib: The Query Planning Approach

yasbd-lib treats text as immutable [2]. It does not modify the raw string [2]. Instead, its architecture resembles a database query planner—generating candidate coordinate arrays and using language-specific filters to narrow down boundary slices [2].

flowchart LR
    A[Raw Input String] --> B[Pass 1: Aggressive Candidate Identification]
    B --> C[Pass 2: Modular Filter Elimination]
    C --> D[Project Slices / Return Index Pointers]

By evolving integer pointers rather than altering text strings, yasbd-lib maintains context of the source layout throughout processing [2]. Token spans are tracked as a first-class structural signal during parsing rather than reconstructed afterward [2].

Deep-Dive Feature Breakdown

1. Spans and Character Offsets

Because PySBD does not natively track indices during its transformation phase, calculating character offsets requires a post-processing step that searches the original document to locate each sentence [1].

The PySBD Reconstruction Step

PySBD reconstructs spans by scanning the original text for each sentence (source):

def sentences_with_char_spans(self, sentences):
    sent_spans = []
    prior_end_char_idx = 0
    for sent in sentences:
        for match in re.finditer('{0}\s*'.format(re.escape(sent)), self.original_text):
            match_str = match.group()
            match_start_idx, match_end_idx = match.span()
            if match_end_idx > prior_end_char_idx:
                sent_spans.append(
                    TextSpan(match_str, match_start_idx, match_end_idx))
                prior_end_char_idx = match_end_idx
                break
    return sent_spans

The Trade-off: This reconstruction performs repeated searches over the original document. In pathological cases—such as documents with many repeated sentences—this can approach quadratic behavior. For typical use cases, the overhead is manageable, but it does add runtime cost on longer texts.

The yasbd-lib Approach

In yasbd-lib, spans are produced natively during boundary detection [2]. It emits boundary allocations dynamically, avoiding lookback overhead [2]. The library also provides an adapter layer for migrating from PySBD [2]:

from yasbd.utils.pysbd_adapter import Segmenter

seg = Segmenter(language="ja")
res = seg.segment('田中さんは「準備は完了しました」そう言って部屋を出た。U.S.A.の経済政策 is complex.')
print(res)
# ['田中さんは「準備は完了しました」そう言って部屋を出た。', 'U.S.A.の経済政策 is complex.']

2. Memory and Streaming

PySBD processes text as complete string buffers [1]. yasbd-lib provides abstractions for memory-constrained environments through lazy evaluation via ParagraphStream and StreamCleaner [2]:

from yasbd.utils.cleaner import StreamCleaner
from yasbd import BoundaryDetector

cleaner = StreamCleaner("Hello  world.   This is  messy.")
detector = BoundaryDetector(lang="en")

sentences = list(detector.segment(cleaner))
print(sentences)
# ['Hello world.', 'This is messy.']

3. Resource Management

Under the hood of the BoundaryDetector pipeline, yasbd-lib manages execution rules using a 5-entry LRU cache (_MAX_CACHED_RULES = 5) [2]. When using automatic language identification (lang="auto"), if confidence drops below the threshold (_MIN_CONFIDENCE = 0.8), the module logs an informational message rather than masking the failure [2]:

# From boundary_detector.py
if lang == "auto":
    lang, confidence = classify_language(snippet)
    if confidence < _MIN_CONFIDENCE:
        log_info(
            self.verbose,
            "Low confidence ({:.2f}) for detected lang {!r} in auto mode",
            confidence,
            lang,
        )

Additionally, yasbd-lib supports preserving token boundaries inside parentheses or brackets via preserve_quote_and_paren=True [2].

Maintenance Status: A Critical Consideration

The architectural differences matter, but there's another factor: PySBD is effectively unmaintained. As of December 2025, an open issue (#135) [9] notes that the repository has seen no recent updates, with multiple open PRs from contributors and a maintainer who has seemingly abandoned the project. The issue author explicitly requested archiving the project to signal to downstream users that they should no longer incorporate it [9].

The maintenance situation has real consequences. Consider these unresolved issues:

[Issue #79] [10] - Infinite Loop (October 2020):

segmenter = pysbd.Segmenter(language="en", clean=False)
text = "..[111 111 111 111 111 111 111 111 111 111]"
segmenter.segment(text)  # Hangs indefinitely

The problem is catastrophic backtracking in NUMBERED_REFERENCE_REGEX. The maintainer acknowledged it in February 2021, saying "Need to dug into details" [10]. Over four years later, it remains unresolved.

[Issue #92] [11] - Catastrophic Backtracking in HTMLTagRule (February 2021):

HTMLTagRule = Rule(r"<\/?\w+((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[\^'\">\s]+))?)+\s*|\s*)\/?>", '')

When processing unfinished HTML attributes, this regex can cause the segmenter to hang indefinitely [11]. A simplified fix was proposed in the same issue, but it remains unreviewed and unmerged.

Both issues stem from the same root cause: regex patterns with nested quantifiers in the transformation pipeline [10, 11]. The project has no active maintainer to review or merge fixes [9].

yasbd-lib was built in response to this situation, offering a drop-in adapter for PySBD to fix edge cases without heavy refactoring [2, 9].

Benchmark Comparisons

The architectural differences influence accuracy and performance.

Feature	PySBD	yasbd-lib
Python Support	3.7–3.11 [1]	3.10–3.14 [2]
Maintenance	Unmaintained (as of 2025) [9]	Actively maintained [2]
Known Issues	Infinite loop on numbered references [10]; catastrophic backtracking in HTML cleaner [11]	No known catastrophic backtracking issues
Approach	Monolithic transformation pipeline [1]	Modular immutable pipeline [2]
State Handling	String mutation with placeholder tokens [1]	Pointer-based operations [2]
Language Profiles	22 Languages [1]	39 Languages [2]
English Golden Score	77 / 92 (83.7%) [2]	91 / 92 (98.9%) [2]
Framework Adapters	Native API [1]	spaCy v3+ integration [2]

Benchmark Note: The English Golden Score is measured on the project's expanded golden corpus of 92 evaluation cases [2]. The original PySBD corpus contained 48 cases; the expanded set removes ambiguous examples and adds coverage for abbreviation chains, contiguous terminators, and other edge cases. Full methodology and test cases are available in the benchmarks directory.

A multi-library performance comparison across increasing text sizes. yasbd-lib consistently outperforms alternatives at every scale.

Edge Case Behavior

Consider how both engines handle challenging inputs:

Input with numbered references (Issue #79): "..[111 111 111 111 111 111 111 111 111 111]" [10]

PySBD: Can enter an infinite loop due to catastrophic backtracking in NUMBERED_REFERENCE_REGEX. This was reported in October 2020 and remains unresolved [10].
yasbd-lib: The two-pass boundary detection approach avoids complex regex substitutions, preventing this class of issue [2].

Input with unfinished HTML (Issue #92): "<iframe width="100%" ... src="url Lorem ipsum..." [11]

PySBD: The HTML cleaning regex can cause catastrophic backtracking, hanging the segmenter indefinitely [11]. Reported in February 2021, still unresolved.
yasbd-lib: Uses a StreamCleaner with configurable cleaning steps, including optional HTML unwrapping that avoids nested quantifiers [2].

Extensibility: Configuration Approaches

What happens when you need to handle custom abbreviations like "Com." or "Adm."?

PySBD: Internal Rule Modification

Because PySBD's rules operate on a shared transformation timeline, they are interdependent [1]. Adding exceptions requires modifying the internal mutation flow [1].

As documented in [Issue #108] [3]:

"Unfortunately, there is no specific documentation about modifying rules as there are so many and each rule is associated with some form of transformation... All those operations need to be performed in that sequence as they are interrelated... Best way is to use python debugger and see how your input text goes through different transformations."

Adding rules without understanding the full pipeline can break downstream regex patterns [1]. With the project unmaintained, there is no clear path for getting such fixes merged upstream [9].

yasbd-lib: Declarative Language Profiles

yasbd-lib decouples matching mechanics from language-specific data [2]. It exposes structured hooks for customization [2]:

The base Rules class defines sets for:

TITLE_ABBRVS: Honorifics that should not split sentences
REFERENCE_ABBRVS: Citation abbreviations (fig, pág)
INLINE_ONLY_ABBRVS: Abbreviations that don't end sentences (blvd)
DATE_ABBRVS: Month and weekday abbreviations
DOTTED_GEOPOL_ABBRVS: Geo abbreviations like U.S., E.U.
TERMINATORS: Extra sentence-ending punctuation
COMMON_SENT_STARTERS: Boundary hints for languages without spaces
POST_QUOTATIVE_PARTICLES and REPORTING_WORDS: For dialogue attribution

To add a new language, you create a file like fr.py, subclass Rules as FrRules, and override only the sets your language needs. The language template [6] provides the structure.

External Language Packs

yasbd-lib supports loading custom language modules at runtime via register_lang_packs() [2]:

from yasbd.rules import register_lang_packs
from yasbd import BoundaryDetector

register_lang_packs(["clinical_yasbd_pack"])
detector = BoundaryDetector(lang="clinical")

Changes to a language-specific profile do not affect the core engine's boundary detection [2].

Language Profile Policy

As documented in [Issue #198] [5], yasbd-lib has frozen its built-in language set at 39 profiles for the v1.x series to maintain API stability. Additional languages must be loaded externally via register_lang_packs() using community-maintained packages like yasbd-extras or yasbd-community.

Which Library Should You Choose?

graph TD
    A[Which SBD to choose?] --> B{Using legacy spaCy v2?}
    B -- Yes --> C[Consider PySBD with caution]
    B -- No --> D{Need active maintenance?}
    D -- Yes --> E[Use yasbd-lib]
    D -- No --> F[Use yasbd-lib for accuracy gains]

Consider PySBD only if:

Absolute legacy lock-in: You are maintaining an existing pipeline tied to spaCy v2 or older deployments that cannot be migrated. Be aware that the project is unmaintained and has known unresolved issues, including infinite loops with numbered references and catastrophic backtracking with certain HTML inputs [9, 10, 11].

Use yasbd-lib if:

Active maintenance: The project is actively maintained with a clear contribution path [2].
Performance at scale: Benchmark results on the Sherlock Holmes text (594k characters) show yasbd completing in approximately 1.6 seconds compared to 13.3 seconds for PySBD on the same hardware [2]. (These results are from the project's benchmark suite; your mileage may vary based on hardware and Python version.)
Character span accuracy: Native span tracking may be preferable for downstream tasks like NER training or RAG indexing [2].
Non-standard inputs: The modular design handles raw markdown, chat logs, and multilingual text [2].
Custom language rules: The declarative profile system simplifies adding new languages or domain-specific abbreviations [2].
Migration path: The included PySBD adapter allows incremental migration without rewriting your entire pipeline [2].
No catastrophic backtracking: The pointer-based architecture avoids the regex issues that plague PySBD's transformation pipeline [2, 10, 11].

References

# Introducing chunklet-py 2.2.0+:

Speedyk-005 — Mon, 23 Feb 2026 03:10:16 +0000

The Smart Text Chunking Library You Didn't Know You Needed

Ever tried splitting text for your RAG pipeline and ended up with chunks that cut sentences in half? Or worse — chunks that lose all context between them?

Yeah, I've been there too. That's exactly why I built chunklet-py — a Python library that actually understands text structure.

This post hits only the highlights and doesn't cover everything — visit the full documentation for everything else, including:

Custom sentence splitters for specialized languages
Custom document processors for unusual file formats
Custom tokenizers to match your LLM
The rich metadata you can get.
CLI flags for batch processing, parallel jobs, error handling, timeouts
Additional args like n_jobs, lang, show_progress, ...

⚠ Quick heads up!

This tutorial requires chunklet-py v2.2.0+ and uses APIs not available in earlier versions.

Upgrade to the latest version and see the documentation or What’s New for details.

The Problem with Dumb Splitting

Here's what usually happens:

# The naive approach
chunks = [text[i:i+500] for i in range(0, len(text), 500)]

This works... until it doesn't:

Sentences cut mid-way ("The model got 75%" → "75%" becomes meaningless)
No context between chunks
Broken code if you're chunking source files

Solution: chunklet-py

A smart text and code chunking library that respects natural boundaries.

Features

50+ languages supported — Auto-detects language and applies the right splitting rules. No more treating German the same as English.

Multiple constraint types — Mix and match:

max_sentences — group by sentences
max_tokens — respect LLM context limits
max_section_breaks — keep Markdown headers together (headings ##, horizontal rules ---, <details> tags)
max_lines — for code chunking
max_functions — keep functions together

Multiple file formats — PDF, DOCX, EPUB, HTML, Markdown, LaTeX, ODT, CSV, Excel, plain text — one library handles them all.

Rich metadata — Every chunk comes with source references, character spans, and structural info.

Composable constraints — Mix and match limits to get exactly the chunks you need.

Pluggable architecture — Swap in custom tokenizers, sentence splitters, or document processors.

What's New in v2.2.0

API Unification — Methods renamed to chunk_text, chunk_file, chunk_texts, chunk_files for consistency
Visualizer redesign — Fullscreen mode, 3-row layout, smoother hovers
More code languages — ColdFusion, VB.NET, PHP 8 attributes, Pascal support
Ruff — Switched to Ruff for faster linting

Check the What's New page for full details.

Installation

pip install chunklet-py

For document support:

pip install chunklet-py[structured-document]

For code:

pip install chunklet-py[code]

For visualization:

pip install chunklet-py[visualization]

Code Examples

Core Imports

from chunklet import DocumentChunker   # For PDFs, DOCX, and general text
from chunklet import CodeChunker       # For source code
from chunklet import SentenceSplitter  # For just sentences
from chunklet import visualizer        # Web-based visualizer

DocumentChunker API

Four methods cover most use cases:

Method	Input	Return Type
`chunk_text(text)`	str	List[Chunk]
`chunk_file(path)`	Path or str	List[Chunk]
`chunk_texts(list)`	List[str]	Generator[Chunk]
`chunk_files(list)`	List[Path]	Generator[Chunk]

DocumentChunker Example

chunker = DocumentChunker()

# Feel free to mix and match these
chunks = chunker.chunk_text(
    text,
    max_sentences=3,       # Stop after X sentences
    max_tokens=500,        # Don't blow up the LLM context
    max_section_breaks=2,  # Respect the Markdown headers
    overlap_percent=20,    # Give it some "memory" of the last chunk
    offset=0               # Skip the first N sentences
)

CodeChunker Example

chunker = CodeChunker()

chunks = chunker.chunk_text(
    code,
    max_lines=50,          # Height limit
    max_tokens=512,        # Width limit
    max_functions=1,       # One function per chunk
    strict=True,            # True: Crash on big blocks; False: Slice anyway
    include_comments=True,  # True by default
    docstring_mode="all",   # Options are: all, excluded, summary
)

⚠ Token Counter Requirement
When using the max_tokens constraint, a token_counter function is essential. This function, which you provide, should accept a string and return an integer representing its token count. Failing to provide a token_counter will result in a MissingTokenCounterError.
You can also provide the token_counter directly to any chunking method. If provided in both the constructor and the method, the one in the method will be used.

SentenceSplitter (Just Sentences)

from chunklet import SentenceSplitter

splitter = SentenceSplitter()
sentences = splitter.split_text(text, lang="en")   # You can also set it to "auto"

Handles tricky cases like "Dr." or "U.S.A." without breaking them up.

50+ languages are explicitly supported through dedicated libraries (pysbd covers 40+, Indic NLP Library covers 11, sentsplit covers 4, and Sentencex covers ~15, with some overlap), plus the Fallback Splitter handles any other language via Unicode rules (Supported Languages Documentation).

Output Object

Chunkers return Chunk objects (Box instances), so you use dot notation:

for chunk in chunks:
    print(chunk.content)   # The actual text/code
    print(chunk.metadata)  # Chunk metadata

Visualizer (Interactive Web UI)

Launch a web interface to experiment with chunking parameters:

chunklet visualize

Or programmatically:

from chunklet import visualizer

v = visualizer.Visualizer(host="127.0.0.1", port=8000)
v.serve()  # Opens in your browser

CLI Examples

Prefer the terminal? chunklet-py ships with a full CLI

Here are some quick examples:

# Basic text chunking
chunklet chunk "Your text here." --max-tokens 500

# Chunk a file
chunklet chunk --source document.pdf --max-tokens 500 --metadata

# Split text into sentences
chunklet split "Your text here." --lang en

# Split a file into sentences
chunklet split --source my_file.txt --destination sentences.txt

# Start the interactive visualizer
chunklet visualize

# Code chunking
chunklet chunk --code --source my_script.py --max-functions 1

# Batch processing a directory
chunklet chunk --doc --source ./my_docs --destination ./chunks --n-jobs 4

# With error handling
chunklet chunk --doc --source ./my_docs --on-errors skip

How It Compares

While there are other chunking libraries available, Chunklet-py stands out for its unique combination of versatility, performance, and ease of use. Here's a quick look at how it compares to some of the alternatives:

Library	Key Differentiator	Focus
chunklet-py	All-in-one, lightweight, multilingual, language-agnostic with specialized algorithms.	Text, Code, Docs
LangChain	Full LLM framework with basic splitters (e.g., RecursiveCharacterTextSplitter, Markdown, HTML, code splitters). Good for prototyping but basic for complex docs or multilingual needs.	Full Stack
Chonkie	All-in-one pipeline (chunking + embeddings + vector DB). Uses `tree-sitter` for code. Multilingual.	Pipelines
Semchunk	Text-only, fast semantic splitting. Built-in tiktoken/HuggingFace support. 85% faster than alternatives.	Text
CintraAI Code Chunker	Code-specific, uses `tree-sitter`. Initially supports Python, JS, CSS only.	Code

Chunklet-py is a specialized, drop-in replacement for the chunking step in any RAG pipeline. It handles text, documents, and code without heavy dependencies, while keeping your project lightweight.

🙌 Contributors & Thanks

A huge thank you to the awesome people who helped shape Chunklet-py:

@jmbernabotto — for helping mostly on the CLI part, suggesting fixes, features, and design improvements.
@arnoldfranz — for reporting the CLI Path Validation Bug (#6) that helped improve error handling.

License

Check out the LICENSE file for all the details.

Wrap Up

Chunklet-py is production-ready. It's lightweight, has no heavy dependencies, and the API is consistent — no more guessing which method name to use.

Check it out: github.com/speedyk-005/chunklet-py

Questions? Drop them in the comments!

Chunklet-py (v2+): One Library to Split Them All - Sentence, Code, Docs

Speedyk-005 — Sat, 20 Dec 2025 18:24:05 +0000

I've been working on Chunklet-py - a powerful Python library for intelligent text and document chunking that's perfect for LLM/RAG applications. Here's why you might want to check it out:

⚠ This guide targets chunklet-py v2.1.1.

APIs from v2.2.0+ are not included.

See the latest docs for updates.

🔧 What It Does

Chunklet-py is your friendly neighborhood text splitter that takes all kinds of content and breaks it into smart, context-aware chunks. Instead of dumb character-count splitting, it gives you specialized tools for:

Sentence Splitter - Multilingual text splitting (50+ languages!)
Plain Text Chunker - Basic text chunking with constraints
Document Chunker - Processes PDFs, DOCX, EPUB, ODT, CSV, Excel, and more
Code Chunker - Language-agnostic code splitting that preserves structure
Chunk Visualizer - Interactive web interface for real-time chunk exploration

🚀 Key Features

Blazingly Fast: Parallel processing for large document batches
Featherlight Footprint: Lightweight and memory-efficient
Rich Metadata: Context-aware metadata for advanced RAG applications
Multilingual Mastery: 50+ languages with intelligent detection
Triple Interface: CLI, library, or web interface
Infinitely Customizable: Pluggable token counters, custom splitters, processors

💻 Quick Example

from chunklet import PlainTextChunker

chunker = PlainTextChunker()
chunks = chunker.chunk(
    "Your long text here...",
    max_tokens=1000,
    max_sentences=10
)

for chunk in chunks:
    print(f"Content: {chunk.content[:50]}...")
    print(f"Metadata: {chunk.metadata}")

📊 Why It Matters

Traditional text splitting often breaks meaning - mid-sentence cuts, lost context, language confusion. Chunklet-py keeps your content's structure and meaning intact, making it perfect for:

Preparing data for LLMs
Building RAG systems
AI search applications
Document processing pipelines

🛠️ Installation

pip install chunklet-py

# For full features:
pip install "chunklet-py[all]"

📈 Community & Stats

50+ languages supported
10+ document formats processed
MIT licensed - free and open source
Active development with comprehensive testing

Check out the documentation and GitHub repo for more details!

What do you think? Have you worked on similar text processing challenges? Any questions about chunking strategies or the library?

🔗 Related Posts

🚀 Latest version (v2.2.0+):

https://dev.to/speed_k_7e1b449706e59e433/-introducing-chunklet-py-dj8
🧠 Legacy version (v1, outdated):

https://dev.to/speed_k_7e1b449706e59e433/stop-breaking-context-smarter-text-chunking-for-python-nlp-projects-2n8n

"Stop Breaking Context: Smarter Text Chunking for Python NLP Projects"

Speedyk-005 — Wed, 13 Aug 2025 21:59:51 +0000

Chunklet: Smarter Text Chunking for Python Developers

⚠ This post is outdated

This guide uses chunklet v1.x, which is no longer maintained. see the Migration Guide: https://speedyk-005.github.io/chunklet-py/latest/migration/

👉 Use chunklet-py v2.x instead:

https://dev.to/speed_k_7e1b449706e59e433/chunklet-py-one-library-to-split-them-all-sentence-code-docs-2eeg

🚀 Latest version (v2.2.0+):

https://dev.to/speed_k_7e1b449706e59e433/-introducing-chunklet-py-dj8

Why Context Matters in Text Splitting

When preprocessing documents for NLP tasks, standard splitting methods often:

Break sentences mid-thought ("The patient showed improvement. However," → "However,")
Ignore linguistic boundaries in non-English texts
Lose critical context between chunks

Chunklet solves this with structural awareness.

1. Installation & Basic Usage

pip install chunklet

Minimal Example:

from chunklet import Chunklet

text = "First sentence. Second sentence. Third sentence."
chunker = Chunklet()
chunks = chunker.chunk(text, mode="sentence", max_sentences=2)

# Output:
# ["First sentence. Second sentence.", "Third sentence."]

This preserves complete sentences while respecting chunk size limits.

2. Key Features Explained

Hybrid Chunking Mode

Combines structural and size-based splitting:

chunks = chunker.chunk(
    text,
    mode="hybrid",
    max_sentences=3,  # Structural limit
    max_tokens=200,   # Size limit
    overlap_percent=15  # Context preservation
)

Why this matters:

Prevents chunks from becoming too long or too short
Overlap maintains relationships between sections
Works equally well on code, markdown, or prose

Multilingual Support

# Auto-detection (36+ languages)
chunks = chunker.chunk(multilingual_text)

# Manual override
chunks = chunker.chunk(japanese_text, language="ja")

How it works:

Uses py3langid for fast language detection
Applies language-specific sentence boundaries
Falls back to regex for unsupported languages

3. Real-World Use Cases

Preparing Legal Documents

legal_text = Path("contract.txt").read_text()
chunks = chunker.chunk(
    legal_text,
    mode="hybrid",
    max_tokens=512,
    overlap_percent=20  # Critical for clause relationships
)

Why it works:

Preserves entire contract clauses
Maintains references between sections (e.g., "as defined in Section 2.1")
Handles complex punctuation in legal prose

Processing Academic Papers

chunker = Chunklet(
    sentence_splitter=custom_academic_splitter,  # Handles citations
    token_counter=scibert_tokenizer  # Domain-specific counting
)

Customization options:

Plug in any sentence splitter
Use HuggingFace tokenizers
Adjust chunking thresholds per document type

4. Performance Considerations

# For large datasets:
results = chunker.batch_chunk(
    documents,
    n_jobs=4,          # Parallel processing
    chunk_size=1000     # Documents per batch
)

Optimization tips:

Enable use_cache=True for repeated texts
Pre-filter very short/long documents
Monitor memory with memory_profiler

Ready to try?

GitHub Repository | PyPI Package