<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: JohnCLlStokes</title>
    <description>The latest articles on DEV Community by JohnCLlStokes (@johnstokes228).</description>
    <link>https://dev.to/johnstokes228</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3918392%2Ff215e8a5-8b22-43f9-9afd-114fcd2d0a13.png</url>
      <title>DEV Community: JohnCLlStokes</title>
      <link>https://dev.to/johnstokes228</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/johnstokes228"/>
    <language>en</language>
    <item>
      <title>Mathematically Optimal Chunking Strategy</title>
      <dc:creator>JohnCLlStokes</dc:creator>
      <pubDate>Thu, 07 May 2026 17:11:39 +0000</pubDate>
      <link>https://dev.to/johnstokes228/mathematically-optimal-chunking-strategy-6f5</link>
      <guid>https://dev.to/johnstokes228/mathematically-optimal-chunking-strategy-6f5</guid>
      <description>&lt;p&gt;&lt;em&gt;In this blog I will introduce the core ideas behind the &lt;a href="https://github.com/cashewe/darn" rel="noopener noreferrer"&gt;darn&lt;/a&gt; package - designed to avoid degraded trust in RAG systems caused by lost context in chunks&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There are &lt;a href="https://www.pinecone.io/learn/chunking-strategies/" rel="noopener noreferrer"&gt;many&lt;/a&gt; documented chunking strategies for &lt;a href="https://en.wikipedia.org/wiki/Retrieval-augmented_generation" rel="noopener noreferrer"&gt;RAG&lt;/a&gt; systems readily available online. These can range from incredibly simple character (or token) limits, to rules-based splitting strategies, or even LLM backed semantic chunking methods. From my experience however, none of these methods provide the production-worthy ‘one-size-fits-all’ approach that they claim to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Limit-based strategies are not context aware enough to work in documents with anything more than short paragraphs of plain text involved, often leading to contextless half sentences or the bottom of tables being lost to orphaned chunks.&lt;/li&gt;
&lt;li&gt;‘Simple’ rules-based strategies inevitably grow in complexity as more edge cases are found until the point that the code becomes completely unmanageable.&lt;/li&gt;
&lt;li&gt;LLM-based ‘semantic’ strategies are costly, inconsistent between documents and can produce drastically different outcomes between runs (and for any production use case, trust me, you will repeat runs).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of this leads to inaccurate recall, incomplete answers and a debugging nightmare for maintainers; which can see promising products rapidly lose trust from their users. Indeed for a solution to be truly business ready, I’d posit it should aim to achieve the following goals:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;be &lt;strong&gt;deterministic&lt;/strong&gt; - index rebuilds are unavoidable and you don’t want to introduce unexpected changes when they do.&lt;/li&gt;
&lt;li&gt;be &lt;strong&gt;extensible&lt;/strong&gt; - real documents are never as uniform as you’d expect and your code will quickly become unmaintainable if you don’t plan to adapt to this truth.&lt;/li&gt;
&lt;li&gt;be &lt;strong&gt;structurally-aware&lt;/strong&gt; - respect the format of your documents and context awareness should reliably follow anyways, as most people write prose semantically.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By considering these goals up-front, we can avoid a ball-of-mud solution that causes fresh headaches for each new edge case raised. After some thinking on possible shapes for such a solution (and admittedly also at least partially to give myself a chance to play with &lt;a href="https://rust-lang.org/" rel="noopener noreferrer"&gt;rust&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Graph_theory" rel="noopener noreferrer"&gt;graphs&lt;/a&gt;, I developed &lt;code&gt;darn&lt;/code&gt; - a tool that aims to use the context inherent in a document’s &lt;em&gt;structure&lt;/em&gt;, and an &lt;em&gt;extensible&lt;/em&gt; list of weighted user preferences, to &lt;em&gt;deterministically&lt;/em&gt; settle on its chunk boundaries.&lt;/p&gt;

&lt;h2&gt;
  
  
  The logic behind darn
&lt;/h2&gt;

&lt;p&gt;At its core, the idea behind the tool is to invert the traditional chunking logic such that rather than blindly demanding:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fibafz1fax1poqtm8jlwb.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fibafz1fax1poqtm8jlwb.jpg" alt="split here!" width="679" height="368"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;we instead pose the question:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fep15fr1jcnizp46fjiw7.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fep15fr1jcnizp46fjiw7.jpg" alt="where shall i split?" width="696" height="391"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Though this distinction may seem arbitrary, asking rather than acting is a better reflection of the confidence we have in our knowledge of the underlying collection of documents (be real, you arent intimately familiar with all 3,000 .docx files in that directory are you?). By accepting that we will be searching not for the ‘correct’ place to split, but rather the ‘least bad’, we can form an architecture that is ‘closer’ to the ambiguous reality of the problem. This will ensure our solution remains flexible to the unique idiosyncrasies in the documents.&lt;/p&gt;

&lt;p&gt;In order to locate these ‘least bad’ splitting points, we need to put some scalar values onto how ‘bad’ a given choice will be, and then we can use a mathematical optimisation algorithm to pick the set of best choices for us. This can be done by applying ‘rules’ with associated ‘punishments’ for breaking them to ‘structures’ in our text, for instance:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;if you split a chunk mid-sentence, the cost is 50.&lt;/p&gt;

&lt;p&gt;if you split a chunk mid-word, the cost is 100&lt;/p&gt;

&lt;p&gt;etc ...&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In &lt;code&gt;darn&lt;/code&gt;, we apply these punishments to the structures found in &lt;a href="https://www.markdownguide.org/basic-syntax/" rel="noopener noreferrer"&gt;markdown text&lt;/a&gt; (i.e. &lt;code&gt;Heading&lt;/code&gt;, &lt;code&gt;List&lt;/code&gt;, &lt;code&gt;ListElement&lt;/code&gt;, etc…), with the option to extend to additional custom text types. Additional types so far implemented include &lt;code&gt;Sentence&lt;/code&gt; and &lt;code&gt;Word&lt;/code&gt; (which are both crucial for chunking but unserved by traditional markdown representations). These punishments can stack based on the number of rules you would violate by splitting in a given location, i.e. a mid-sentence word would cost 150 to split according to the above rules.&lt;/p&gt;

&lt;p&gt;By taking the sum of the punishments at each candidate splitting location, we can quantify how ‘bad’ a decision it would be to split there. This problem can be collapsed into a form of the common “bounded &lt;a href="https://en.wikipedia.org/wiki/Shortest_path_problem" rel="noopener noreferrer"&gt;shortest path&lt;/a&gt;” problem in which we try to find the ‘shortest’ (for us, ‘least-punished’) route from A to Z in a given number of steps or less, which has a canonical mathematical solution, displayed below: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1eky35ssqz0fwl7ypdsm.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1eky35ssqz0fwl7ypdsm.jpg" alt="DAG" width="800" height="615"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To illustrate the means of finding the ‘least bad’ solution, imagine a world where the characters in your text are spaces on a monopoly board, the cost to land on them are representative of the summed punishment we mentioned prior, and we roll a dice to determine how far we can travel each turn.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8xfnvg12t338fprrzc2n.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8xfnvg12t338fprrzc2n.jpg" alt="monopoly" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Given the dice only has 6 sides, what are the best possible spaces to land on to minimise the total cost? Whilst it may seem intuitive to always aim for the cheapest space within range (this is a so-called ‘&lt;a href="https://en.wikipedia.org/wiki/Greedy_algorithm" rel="noopener noreferrer"&gt;greedy&lt;/a&gt;’ approach), in practice taking what seems optimal in a given roll may well end up incurring a higher total cost in future!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwlkbti3jcw1h4044pi3v.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwlkbti3jcw1h4044pi3v.jpg" alt="the greedy trap" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;in order to avoid this scenario, we must encode into our decision making process some understanding of the future impacts of our present actions. Mathematically, this can be done as simply as starting at the end and working backwards, calculating the minimum possible cost to exit the board from each space in reverse.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fszouauwivd82o029qj7m.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fszouauwivd82o029qj7m.jpg" alt="dynamic programming" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By taking this reversed approach to decision making, we are able to learn what the future impacts of our current choices will be, as they are encoded as costs into our rolling window of ‘reachable’ spaces. From here, the problem solves itself - we simply select the space in range of our dice with the lowest ‘minimum cost to exit’ and follow the path we used to arrive at that cost back up the board (in computer science, this is known as &lt;a href="https://www.geeksforgeeks.org/dsa/dynamic-programming/" rel="noopener noreferrer"&gt;dynamic programming&lt;/a&gt;). Performing this same process on our text, we select the set of chunk boundaries (the ‘path’) that collectively form the ‘least bad’ solution - no one chunk is optimised in a vacuum.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;darn&lt;/code&gt; in action
&lt;/h2&gt;

&lt;p&gt;To solidify the value of this approach in a text chunking context, lets take a look at a simple &lt;a href="https://jaspervdj.be/lorem-markdownum/" rel="noopener noreferrer"&gt;lorem ipsum&lt;/a&gt; example, using the following ‘document’:&lt;/p&gt;

&lt;blockquote&gt;
&lt;h1&gt;
  
  
  Virum si memores territus
&lt;/h1&gt;
&lt;h2&gt;
  
  
  Vetat excipit Mnemonidas
&lt;/h2&gt;

&lt;p&gt;Lorem markdownum nullo saliunt, subibis. It fortis crede caedis tenet ille in ovis fuerunt cycno crura faciunt: illic. Matri manat, harenam et dabat in Amoris tamen aderisque Deucalioneas foret arte stabula iugis tot pinus mora.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sit oculis si opus mensas&lt;/li&gt;
&lt;li&gt;Sua negat annis repulsa&lt;/li&gt;
&lt;li&gt;Opus dea rite felle inquit duritia ambiguo&lt;/li&gt;
&lt;li&gt;Humana quaque hunc alii censet procorum&lt;/li&gt;
&lt;li&gt;Antra auro&lt;/li&gt;
&lt;li&gt;Calydonis meas lusibus&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;we’ll set maximum chunk size here at an arbitrary character count of say 200 to keep the example behaviour obvious. the code we’ll run is as follows (I have added comments to the unused settings incase they’re of interest):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from darn_it import Chunker

chunker = Chunker(
    # rules=[]  # use default rules here, rather than defining custom punishments
)

with open("file.md", "r") as f:
    text = f.read()

chunker.get_chunks(
    text=text,
    chunk_size=200,
    # granularity="characters"  # can be set to "tokens" if using the 'model' param
    # model="gpt-4o-mini"  # uses tiktoken to set the tokeniser per model
    # overlap=0  # can be set to allow overlap between chunks
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;darn&lt;/code&gt;s output looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[
    Chunk(
        start=0,
        end=177,
        content=# Virum si memores territus
            ## Vetat excipit Mnemonidas
            Lorem markdownum nullo saliunt, subibis. It fortis crede caedis tenet ille in ovis fuerunt cycno crura faciunt: illic.
    ), Chunk(
        start=177,
        end=287,
        content= Matri manat, harenam et dabat in Amoris tamen aderisque Deucalioneas foret arte stabula iugis tot pinus mora.
    ), Chunk(
        start=287,
        end=468,
        content=- Sit oculis si opus mensas
            - Sua negat annis repulsa
            - Opus dea rite felle inquit duritia ambiguo
            - Humana quaque hunc alii censet procorum
            - Antra auro
            - Calydonis meas lusibus
    )
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can see here that &lt;code&gt;darn&lt;/code&gt; has obeyed sentence structures &lt;em&gt;and&lt;/em&gt; maintained the full context of the list - choosing to create a smaller penultimate chunk to achieve this. This is due to the selected (default) punishments discouraging splitting lists and sentences. Though the default rules also aim to discourage splitting paragraphs, the character limit we have set made this impossible and so the optimal splits were selected to protect the integrity of the list.&lt;/p&gt;

&lt;p&gt;Compare this to a naive limit-based strategy, which for the same character limit would have its final chunk begin part way into the fourth bullet point. Imagine the loss of trust if a user asking your RAG system to summarise a contract clause got back only half a bullet point! Equally compare this to semantic methods, which likely would have arrived at a similar splitting - as the texts structure already reflects its semantic cohesion - but may not have done so reliably.&lt;/p&gt;

&lt;p&gt;Referring back to our original criteria, we can see even from this simple example that darn is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;deterministic&lt;/strong&gt; - by using a mathematically optimal strategy, we guarantee the same output for each run, provided no parameters / input text are changed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;extensible&lt;/strong&gt; - new behaviour is as simple as adding a new rule or structure type.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;structurally-aware&lt;/strong&gt; - it is literally built on the structure of the markdown text.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And as a bonus, the tool is written in rust-backed python - so it runs super-fast and can be slotted into most production workloads easily.&lt;/p&gt;

&lt;h2&gt;
  
  
  When not to use &lt;code&gt;darn&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;There are still some obvious scenarios in which &lt;code&gt;darn&lt;/code&gt; wont be the most appropriate tool that are worth being aware of, notably:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;if you have a ‘small’ collection of documents that can be artisanally split into bespoke chunks, this will likely out perform the mathematical optima (though this isn’t a sustainable strategy for a production workload).&lt;/li&gt;
&lt;li&gt;if you dont have any kind of upper bound on how big your chunks can be, the algorithm becomes poorly defined and its solution will lose most of its decision-making value.&lt;/li&gt;
&lt;li&gt;if your data can’t be meaningfully converted to markdown (i.e. if its entirely made up of computer code for instance) then &lt;code&gt;darn&lt;/code&gt; will split based on structures that may not reflect the meaning of the underlying document.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first two of these are philosophically different scenarios to the one in which &lt;code&gt;darn&lt;/code&gt; is designed for so will likely never be something I look to support. The last of these however would actually be a fairly simple feature request - since the algorithm is decoupled from the ‘structures’ on which it acts, I could feasibly update &lt;code&gt;darn&lt;/code&gt; to work with any alternate text structures too if there was meaningful demand.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap up
&lt;/h2&gt;

&lt;p&gt;On the surface, I hope I’ve been able to demonstrate a viable ‘default’ tool for chunking here - and perhaps next time you need to split text, you will reach for &lt;code&gt;darn&lt;/code&gt; in the first instance! Going a little deeper however, &lt;code&gt;darn&lt;/code&gt;’s core innovation (a refusal to write strict rules for situations I don’t fully understand) reflects a design pattern that is relatively new to me as an engineer, and one I’m keen to mine further in future works. Not all problems have a universal solution, and when they don’t, let your architecture reflect ambiguity, not feigned certainty.&lt;/p&gt;

&lt;p&gt;cheers,&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Johnno&lt;/em&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>nlp</category>
      <category>algorithms</category>
    </item>
  </channel>
</rss>
