<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Speedyk-005</title>
    <description>The latest articles on DEV Community by Speedyk-005 (@speed_k_7e1b449706e59e433).</description>
    <link>https://dev.to/speed_k_7e1b449706e59e433</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2273923%2F3adfb205-24cf-4e87-ba59-30c6e0beb6ac.jpg</url>
      <title>DEV Community: Speedyk-005</title>
      <link>https://dev.to/speed_k_7e1b449706e59e433</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/speed_k_7e1b449706e59e433"/>
    <language>en</language>
    <item>
      <title># Introducing chunklet-py 2.2.0+:</title>
      <dc:creator>Speedyk-005</dc:creator>
      <pubDate>Mon, 23 Feb 2026 03:10:16 +0000</pubDate>
      <link>https://dev.to/speed_k_7e1b449706e59e433/-introducing-chunklet-py-dj8</link>
      <guid>https://dev.to/speed_k_7e1b449706e59e433/-introducing-chunklet-py-dj8</guid>
      <description>&lt;p&gt;The Smart Text Chunking Library You Didn't Know You Needed&lt;/p&gt;

&lt;p&gt;Ever tried splitting text for your RAG pipeline and ended up with chunks that cut sentences in half? Or worse — chunks that lose all context between them?&lt;/p&gt;

&lt;p&gt;Yeah, I've been there too. That's exactly why I built &lt;a href="https://github.com/speedyk-005/chunklet-py" rel="noopener noreferrer"&gt;chunklet-py&lt;/a&gt; — a Python library that actually understands text structure.&lt;/p&gt;

&lt;p&gt;This post hits only the highlights and doesn't cover everything — visit the &lt;a href="https://speedyk-005.github.io/chunklet-py/latest/" rel="noopener noreferrer"&gt;full documentation&lt;/a&gt; for everything else, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom sentence splitters for specialized languages&lt;/li&gt;
&lt;li&gt;Custom document processors for unusual file formats&lt;/li&gt;
&lt;li&gt;Custom tokenizers to match your LLM&lt;/li&gt;
&lt;li&gt;The rich metadata you can get.&lt;/li&gt;
&lt;li&gt;CLI flags for batch processing, parallel jobs, error handling, timeouts&lt;/li&gt;
&lt;li&gt;Additional args like &lt;code&gt;n_jobs&lt;/code&gt;, &lt;code&gt;lang&lt;/code&gt;, &lt;code&gt;show_progress&lt;/code&gt;, ...&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠ &lt;strong&gt;Quick heads up!&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
This tutorial requires &lt;code&gt;chunklet-py v2.2.0+&lt;/code&gt; and uses APIs not available in earlier versions. &lt;/p&gt;

&lt;p&gt;Upgrade to the latest version and see the &lt;a href="https://speedyk-005.github.io/chunklet-py/latest/" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; or &lt;a href="https://speedyk-005.github.io/chunklet-py/latest/whats-new/" rel="noopener noreferrer"&gt;What’s New&lt;/a&gt; for details.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Problem with Dumb Splitting
&lt;/h2&gt;

&lt;p&gt;Here's what usually happens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The naive approach
&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works... until it doesn't:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sentences cut mid-way ("The model got 75%" → "75%" becomes meaningless)&lt;/li&gt;
&lt;li&gt;No context between chunks&lt;/li&gt;
&lt;li&gt;Broken code if you're chunking source files&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Solution: chunklet-py
&lt;/h2&gt;

&lt;p&gt;A smart text and code chunking library that respects natural boundaries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Features
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;50+ languages supported&lt;/strong&gt; — Auto-detects language and applies the right splitting rules. No more treating German the same as English.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multiple constraint types&lt;/strong&gt; — Mix and match:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;max_sentences&lt;/code&gt; — group by sentences&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_tokens&lt;/code&gt; — respect LLM context limits
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_section_breaks&lt;/code&gt; — keep Markdown headers together (headings &lt;code&gt;##&lt;/code&gt;, horizontal rules &lt;code&gt;---&lt;/code&gt;, &lt;code&gt;&amp;lt;details&amp;gt;&lt;/code&gt; tags)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_lines&lt;/code&gt; — for code chunking&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_functions&lt;/code&gt; — keep functions together&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Multiple file formats&lt;/strong&gt; — PDF, DOCX, EPUB, HTML, Markdown, LaTeX, ODT, CSV, Excel, plain text — one library handles them all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rich metadata&lt;/strong&gt; — Every chunk comes with source references, character spans, and structural info.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Composable constraints&lt;/strong&gt; — Mix and match limits to get exactly the chunks you need.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pluggable architecture&lt;/strong&gt; — Swap in custom tokenizers, sentence splitters, or document processors.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's New in v2.2.0
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API Unification&lt;/strong&gt; — Methods renamed to &lt;code&gt;chunk_text&lt;/code&gt;, &lt;code&gt;chunk_file&lt;/code&gt;, &lt;code&gt;chunk_texts&lt;/code&gt;, &lt;code&gt;chunk_files&lt;/code&gt; for consistency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visualizer redesign&lt;/strong&gt; — Fullscreen mode, 3-row layout, smoother hovers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More code languages&lt;/strong&gt; — ColdFusion, VB.NET, PHP 8 attributes, Pascal support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ruff&lt;/strong&gt; — Switched to Ruff for faster linting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check the &lt;a href="https://speedyk-005.github.io/chunklet-py/latest/whats-new/" rel="noopener noreferrer"&gt;What's New&lt;/a&gt; page for full details.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;chunklet-py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For document support:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;chunklet-py[structured-document]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;chunklet-py[code]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For visualization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;chunklet-py[visualization]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Code Examples
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Imports
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chunklet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DocumentChunker&lt;/span&gt;   &lt;span class="c1"&gt;# For PDFs, DOCX, and general text
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chunklet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CodeChunker&lt;/span&gt;       &lt;span class="c1"&gt;# For source code
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chunklet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceSplitter&lt;/span&gt;  &lt;span class="c1"&gt;# For just sentences
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chunklet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;visualizer&lt;/span&gt;        &lt;span class="c1"&gt;# Web-based visualizer
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  DocumentChunker API
&lt;/h3&gt;

&lt;p&gt;Four methods cover most use cases:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Return Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;chunk_text(text)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;str&lt;/td&gt;
&lt;td&gt;List[Chunk]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;chunk_file(path)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Path or str&lt;/td&gt;
&lt;td&gt;List[Chunk]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;chunk_texts(list)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List[str]&lt;/td&gt;
&lt;td&gt;Generator[Chunk]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;chunk_files(list)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List[Path]&lt;/td&gt;
&lt;td&gt;Generator[Chunk]&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  DocumentChunker Example
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;chunker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DocumentChunker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Feel free to mix and match these
&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_sentences&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# Stop after X sentences
&lt;/span&gt;    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# Don't blow up the LLM context
&lt;/span&gt;    &lt;span class="n"&gt;max_section_breaks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Respect the Markdown headers
&lt;/span&gt;    &lt;span class="n"&gt;overlap_percent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# Give it some "memory" of the last chunk
&lt;/span&gt;    &lt;span class="n"&gt;offset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;               &lt;span class="c1"&gt;# Skip the first N sentences
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  CodeChunker Example
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;chunker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CodeChunker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_lines&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# Height limit
&lt;/span&gt;    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# Width limit
&lt;/span&gt;    &lt;span class="n"&gt;max_functions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# One function per chunk
&lt;/span&gt;    &lt;span class="n"&gt;strict&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;# True: Crash on big blocks; False: Slice anyway
&lt;/span&gt;    &lt;span class="n"&gt;include_comments&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# True by default
&lt;/span&gt;    &lt;span class="n"&gt;docstring_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Options are: all, excluded, summary
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;⚠ &lt;strong&gt;Token Counter Requirement&lt;/strong&gt;&lt;br&gt;
When using the max_tokens constraint, a token_counter function is essential. This function, which you provide, should accept a string and return an integer representing its token count. Failing to provide a token_counter will result in a &lt;a href="https://speedyk-005.github.io/chunklet-py/latest/exceptions-and-warnings/#missingtokencountererror" rel="noopener noreferrer"&gt;MissingTokenCounterError&lt;/a&gt;.&lt;br&gt;
You can also provide the token_counter directly to any chunking method. If provided in both the constructor and the method, the one in the method will be used.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  SentenceSplitter (Just Sentences)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chunklet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceSplitter&lt;/span&gt;

&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceSplitter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# You can also set it to "auto"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Handles tricky cases like "Dr." or "U.S.A." without breaking them up.&lt;/p&gt;

&lt;p&gt;50+ languages are explicitly supported through dedicated libraries (pysbd covers 40+, Indic NLP Library covers 11, sentsplit covers 4, and Sentencex covers ~15, with some overlap), plus the Fallback Splitter handles any other language via Unicode rules (&lt;a href="https://speedyk-005.github.io/chunklet-py/latest/supported-languages/" rel="noopener noreferrer"&gt;Supported Languages Documentation&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Output Object
&lt;/h3&gt;

&lt;p&gt;Chunkers return Chunk objects (Box instances), so you use dot notation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# The actual text/code
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Chunk metadata
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Visualizer (Interactive Web UI)
&lt;/h3&gt;

&lt;p&gt;Launch a web interface to experiment with chunking parameters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;chunklet visualize
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or programmatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chunklet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;visualizer&lt;/span&gt;

&lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;visualizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Visualizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;serve&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Opens in your browser
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  CLI Examples
&lt;/h2&gt;

&lt;p&gt;Prefer the terminal? chunklet-py ships with a &lt;a href="https://speedyk-005.github.io/chunklet-py/latest/getting-started/cli/" rel="noopener noreferrer"&gt;full CLI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here are some quick examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Basic text chunking&lt;/span&gt;
chunklet chunk &lt;span class="s2"&gt;"Your text here."&lt;/span&gt; &lt;span class="nt"&gt;--max-tokens&lt;/span&gt; 500

&lt;span class="c"&gt;# Chunk a file&lt;/span&gt;
chunklet chunk &lt;span class="nt"&gt;--source&lt;/span&gt; document.pdf &lt;span class="nt"&gt;--max-tokens&lt;/span&gt; 500 &lt;span class="nt"&gt;--metadata&lt;/span&gt;

&lt;span class="c"&gt;# Split text into sentences&lt;/span&gt;
chunklet &lt;span class="nb"&gt;split&lt;/span&gt; &lt;span class="s2"&gt;"Your text here."&lt;/span&gt; &lt;span class="nt"&gt;--lang&lt;/span&gt; en

&lt;span class="c"&gt;# Split a file into sentences&lt;/span&gt;
chunklet &lt;span class="nb"&gt;split&lt;/span&gt; &lt;span class="nt"&gt;--source&lt;/span&gt; my_file.txt &lt;span class="nt"&gt;--destination&lt;/span&gt; sentences.txt

&lt;span class="c"&gt;# Start the interactive visualizer&lt;/span&gt;
chunklet visualize

&lt;span class="c"&gt;# Code chunking&lt;/span&gt;
chunklet chunk &lt;span class="nt"&gt;--code&lt;/span&gt; &lt;span class="nt"&gt;--source&lt;/span&gt; my_script.py &lt;span class="nt"&gt;--max-functions&lt;/span&gt; 1

&lt;span class="c"&gt;# Batch processing a directory&lt;/span&gt;
chunklet chunk &lt;span class="nt"&gt;--doc&lt;/span&gt; &lt;span class="nt"&gt;--source&lt;/span&gt; ./my_docs &lt;span class="nt"&gt;--destination&lt;/span&gt; ./chunks &lt;span class="nt"&gt;--n-jobs&lt;/span&gt; 4

&lt;span class="c"&gt;# With error handling&lt;/span&gt;
chunklet chunk &lt;span class="nt"&gt;--doc&lt;/span&gt; &lt;span class="nt"&gt;--source&lt;/span&gt; ./my_docs &lt;span class="nt"&gt;--on-errors&lt;/span&gt; skip
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How It Compares
&lt;/h2&gt;

&lt;p&gt;While there are other chunking libraries available, Chunklet-py stands out for its unique combination of versatility, performance, and ease of use. Here's a quick look at how it compares to some of the alternatives:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;th&gt;Key Differentiator&lt;/th&gt;
&lt;th&gt;Focus&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;chunklet-py&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;All-in-one, lightweight, multilingual, language-agnostic with specialized algorithms.&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Text, Code, Docs&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/langchain-ai/langchain" rel="noopener noreferrer"&gt;LangChain&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Full LLM framework with basic splitters (e.g., RecursiveCharacterTextSplitter, Markdown, HTML, code splitters). Good for prototyping but basic for complex docs or multilingual needs.&lt;/td&gt;
&lt;td&gt;Full Stack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/chonkie-inc/chonkie" rel="noopener noreferrer"&gt;Chonkie&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;All-in-one pipeline (chunking + embeddings + vector DB). Uses &lt;code&gt;tree-sitter&lt;/code&gt; for code. Multilingual.&lt;/td&gt;
&lt;td&gt;Pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/isaacus-dev/semchunk" rel="noopener noreferrer"&gt;Semchunk&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Text-only, fast semantic splitting. Built-in tiktoken/HuggingFace support. 85% faster than alternatives.&lt;/td&gt;
&lt;td&gt;Text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/CintraAI/code-chunker" rel="noopener noreferrer"&gt;CintraAI Code Chunker&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Code-specific, uses &lt;code&gt;tree-sitter&lt;/code&gt;. Initially supports Python, JS, CSS only.&lt;/td&gt;
&lt;td&gt;Code&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Chunklet-py is a specialized, drop-in replacement for the chunking step in any RAG pipeline. It handles text, documents, and code without heavy dependencies, while keeping your project lightweight.&lt;/p&gt;




&lt;h2&gt;
  
  
  🙌 Contributors &amp;amp; Thanks
&lt;/h2&gt;

&lt;p&gt;A huge thank you to the awesome people who helped shape Chunklet-py:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/jmbernabotto" rel="noopener noreferrer"&gt;@jmbernabotto&lt;/a&gt; — for helping mostly on the CLI part, suggesting fixes, features, and design improvements.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/arnoldfranz" rel="noopener noreferrer"&gt;@arnoldfranz&lt;/a&gt; — for reporting the CLI Path Validation Bug (#6) that helped improve error handling.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  License
&lt;/h2&gt;

&lt;p&gt;Check out the &lt;a href="https://github.com/speedyk-005/chunklet-py/blob/main/LICENSE" rel="noopener noreferrer"&gt;LICENSE&lt;/a&gt; file for all the details.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap Up
&lt;/h2&gt;

&lt;p&gt;Chunklet-py is production-ready. It's lightweight, has no heavy dependencies, and the API is consistent — no more guessing which method name to use.&lt;/p&gt;

&lt;p&gt;Check it out: &lt;a href="https://github.com/speedyk-005/chunklet-py" rel="noopener noreferrer"&gt;github.com/speedyk-005/chunklet-py&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Questions? Drop them in the comments!&lt;/p&gt;

</description>
      <category>rag</category>
      <category>chunk</category>
      <category>llm</category>
      <category>python</category>
    </item>
    <item>
      <title>**Chunklet-py (v2+): One Library to Split Them All - Sentence, Code, Docs**</title>
      <dc:creator>Speedyk-005</dc:creator>
      <pubDate>Sat, 20 Dec 2025 18:24:05 +0000</pubDate>
      <link>https://dev.to/speed_k_7e1b449706e59e433/chunklet-py-one-library-to-split-them-all-sentence-code-docs-2eeg</link>
      <guid>https://dev.to/speed_k_7e1b449706e59e433/chunklet-py-one-library-to-split-them-all-sentence-code-docs-2eeg</guid>
      <description>&lt;p&gt;I've been working on &lt;strong&gt;Chunklet-py&lt;/strong&gt; - a powerful Python library for intelligent text and document chunking that's perfect for LLM/RAG applications. Here's why you might want to check it out:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠ This guide targets &lt;code&gt;chunklet-py v2.1.1&lt;/code&gt;.&lt;br&gt;&lt;br&gt;
APIs from &lt;code&gt;v2.2.0+&lt;/code&gt; are not included.&lt;br&gt;&lt;br&gt;
See the &lt;a href="https://speedyk-005.github.io/chunklet-py/latest/" rel="noopener noreferrer"&gt;latest docs&lt;/a&gt; for updates.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  🔧 &lt;strong&gt;What It Does&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Chunklet-py is your friendly neighborhood text splitter that takes all kinds of content and breaks it into smart, context-aware chunks. Instead of dumb character-count splitting, it gives you specialized tools for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sentence Splitter&lt;/strong&gt; - Multilingual text splitting (50+ languages!)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plain Text Chunker&lt;/strong&gt; - Basic text chunking with constraints&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document Chunker&lt;/strong&gt; - Processes PDFs, DOCX, EPUB, ODT, CSV, Excel, and more&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code Chunker&lt;/strong&gt; - Language-agnostic code splitting that preserves structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunk Visualizer&lt;/strong&gt; - Interactive web interface for real-time chunk exploration&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🚀 &lt;strong&gt;Key Features&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Blazingly Fast&lt;/strong&gt;: Parallel processing for large document batches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Featherlight Footprint&lt;/strong&gt;: Lightweight and memory-efficient&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich Metadata&lt;/strong&gt;: Context-aware metadata for advanced RAG applications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multilingual Mastery&lt;/strong&gt;: 50+ languages with intelligent detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Triple Interface&lt;/strong&gt;: CLI, library, or web interface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infinitely Customizable&lt;/strong&gt;: Pluggable token counters, custom splitters, processors&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  💻 &lt;strong&gt;Quick Example&lt;/strong&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chunklet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PlainTextChunker&lt;/span&gt;

&lt;span class="n"&gt;chunker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PlainTextChunker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Your long text here...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_sentences&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Metadata: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  📊 &lt;strong&gt;Why It Matters&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Traditional text splitting often breaks meaning - mid-sentence cuts, lost context, language confusion. Chunklet-py keeps your content's structure and meaning intact, making it perfect for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Preparing data for LLMs&lt;/li&gt;
&lt;li&gt;Building RAG systems&lt;/li&gt;
&lt;li&gt;AI search applications&lt;/li&gt;
&lt;li&gt;Document processing pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🛠️ &lt;strong&gt;Installation&lt;/strong&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;chunklet-py

&lt;span class="c"&gt;# For full features:&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"chunklet-py[all]"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  📈 &lt;strong&gt;Community &amp;amp; Stats&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;50+ languages&lt;/strong&gt; supported&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10+ document formats&lt;/strong&gt; processed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MIT licensed&lt;/strong&gt; - free and open source&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Active development&lt;/strong&gt; with comprehensive testing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check out the &lt;a href="https://speedyk-005.github.io/chunklet-py/latest" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; and &lt;a href="https://github.com/speedyk-005/chunklet-py" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; for more details!&lt;/p&gt;

&lt;p&gt;What do you think? Have you worked on similar text processing challenges? Any questions about chunking strategies or the library?&lt;/p&gt;

&lt;h2&gt;
  
  
  🔗 Related Posts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;🚀 Latest version (v2.2.0+):&lt;br&gt;&lt;br&gt;
&lt;a href="https://dev.to/speed_k_7e1b449706e59e433/-introducing-chunklet-py-dj8"&gt;https://dev.to/speed_k_7e1b449706e59e433/-introducing-chunklet-py-dj8&lt;/a&gt;  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;🧠 Legacy version (v1, outdated):&lt;br&gt;&lt;br&gt;
&lt;a href="https://dev.to/speed_k_7e1b449706e59e433/stop-breaking-context-smarter-text-chunking-for-python-nlp-projects-2n8n"&gt;https://dev.to/speed_k_7e1b449706e59e433/stop-breaking-context-smarter-text-chunking-for-python-nlp-projects-2n8n&lt;/a&gt; &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>nlp</category>
      <category>chunker</category>
      <category>rag</category>
    </item>
    <item>
      <title>**"Stop Breaking Context: Smarter Text Chunking for Python NLP Projects"**</title>
      <dc:creator>Speedyk-005</dc:creator>
      <pubDate>Wed, 13 Aug 2025 21:59:51 +0000</pubDate>
      <link>https://dev.to/speed_k_7e1b449706e59e433/stop-breaking-context-smarter-text-chunking-for-python-nlp-projects-2n8n</link>
      <guid>https://dev.to/speed_k_7e1b449706e59e433/stop-breaking-context-smarter-text-chunking-for-python-nlp-projects-2n8n</guid>
      <description>&lt;h1&gt;
  
  
  &lt;strong&gt;Chunklet: Smarter Text Chunking for Python Developers&lt;/strong&gt;
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠ &lt;strong&gt;This post is outdated&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
This guide uses &lt;code&gt;chunklet v1.x&lt;/code&gt;, which is no longer maintained. see the Migration Guide: &lt;a href="https://speedyk-005.github.io/chunklet-py/latest/migration/" rel="noopener noreferrer"&gt;https://speedyk-005.github.io/chunklet-py/latest/migration/&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;👉 Use &lt;code&gt;chunklet-py v2.x&lt;/code&gt; instead:&lt;br&gt;&lt;br&gt;
&lt;a href="https://dev.to/speed_k_7e1b449706e59e433/chunklet-py-one-library-to-split-them-all-sentence-code-docs-2eeg"&gt;https://dev.to/speed_k_7e1b449706e59e433/chunklet-py-one-library-to-split-them-all-sentence-code-docs-2eeg&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;🚀 Latest version (v2.2.0+):&lt;br&gt;&lt;br&gt;
&lt;a href="https://dev.to/speed_k_7e1b449706e59e433/-introducing-chunklet-py-dj8"&gt;https://dev.to/speed_k_7e1b449706e59e433/-introducing-chunklet-py-dj8&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why Context Matters in Text Splitting&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;When preprocessing documents for NLP tasks, standard splitting methods often:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Break sentences mid-thought (&lt;code&gt;"The patient showed improvement. However," → "However,"&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Ignore linguistic boundaries in non-English texts&lt;/li&gt;
&lt;li&gt;Lose critical context between chunks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Chunklet solves this with structural awareness.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Installation &amp;amp; Basic Usage&lt;/strong&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;chunklet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Minimal Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chunklet&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Chunklet&lt;/span&gt;

&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;First sentence. Second sentence. Third sentence.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;chunker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Chunklet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_sentences&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Output:
# ["First sentence. Second sentence.", "Third sentence."]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;This preserves complete sentences while respecting chunk size limits.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2. Key Features Explained&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Hybrid Chunking Mode&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Combines structural and size-based splitting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hybrid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_sentences&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Structural limit
&lt;/span&gt;    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Size limit
&lt;/span&gt;    &lt;span class="n"&gt;overlap_percent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;  &lt;span class="c1"&gt;# Context preservation
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Why this matters:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prevents chunks from becoming too long or too short&lt;/li&gt;
&lt;li&gt;Overlap maintains relationships between sections&lt;/li&gt;
&lt;li&gt;Works equally well on code, markdown, or prose&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Multilingual Support&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Auto-detection (36+ languages)
&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;multilingual_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Manual override
&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;japanese_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ja&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;How it works:&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Uses &lt;code&gt;py3langid&lt;/code&gt; for fast language detection&lt;/li&gt;
&lt;li&gt;Applies language-specific sentence boundaries&lt;/li&gt;
&lt;li&gt;Falls back to regex for unsupported languages&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Real-World Use Cases&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Preparing Legal Documents&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;legal_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;contract.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;legal_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hybrid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;overlap_percent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;  &lt;span class="c1"&gt;# Critical for clause relationships
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Why it works:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Preserves entire contract clauses&lt;/li&gt;
&lt;li&gt;Maintains references between sections (e.g., "as defined in Section 2.1")&lt;/li&gt;
&lt;li&gt;Handles complex punctuation in legal prose&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Processing Academic Papers&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;chunker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Chunklet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sentence_splitter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;custom_academic_splitter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Handles citations
&lt;/span&gt;    &lt;span class="n"&gt;token_counter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;scibert_tokenizer&lt;/span&gt;  &lt;span class="c1"&gt;# Domain-specific counting
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Customization options:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Plug in any sentence splitter&lt;/li&gt;
&lt;li&gt;Use HuggingFace tokenizers&lt;/li&gt;
&lt;li&gt;Adjust chunking thresholds per document type&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. Performance Considerations&lt;/strong&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# For large datasets:
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;batch_chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# Parallel processing
&lt;/span&gt;    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;     &lt;span class="c1"&gt;# Documents per batch
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Optimization tips:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enable &lt;code&gt;use_cache=True&lt;/code&gt; for repeated texts&lt;/li&gt;
&lt;li&gt;Pre-filter very short/long documents&lt;/li&gt;
&lt;li&gt;Monitor memory with &lt;code&gt;memory_profiler&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Ready to try?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;a href="https://github.com/Speedyk-005/chunklet" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt; | &lt;a href="https://pypi.org/project/chunklet/" rel="noopener noreferrer"&gt;PyPI Package&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>nlp</category>
      <category>opensource</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
