<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vivek</title>
    <description>The latest articles on DEV Community by Vivek (@kasturivivek).</description>
    <link>https://dev.to/kasturivivek</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1166521%2F86c9646c-4c66-48b8-908a-681e87a24a5b.png</url>
      <title>DEV Community: Vivek</title>
      <link>https://dev.to/kasturivivek</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kasturivivek"/>
    <language>en</language>
    <item>
      <title>Chunking Strategies for LLM Applications: A Practical Guide to Better RAG Systems</title>
      <dc:creator>Vivek</dc:creator>
      <pubDate>Sun, 24 May 2026 14:56:09 +0000</pubDate>
      <link>https://dev.to/kasturivivek/chunking-strategies-for-llm-applications-a-practical-guide-to-better-rag-systems-30ck</link>
      <guid>https://dev.to/kasturivivek/chunking-strategies-for-llm-applications-a-practical-guide-to-better-rag-systems-30ck</guid>
      <description>&lt;p&gt;&lt;em&gt;Learn how chunking impacts retrieval quality, embedding performance, and the overall effectiveness of Retrieval-Augmented Generation (RAG) systems.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;When building AI applications using &lt;strong&gt;Retrieval-Augmented Generation (RAG)&lt;/strong&gt;, developers often focus on selecting the best LLM or embedding model. But one foundational step is frequently underestimated &lt;strong&gt;chunking&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chunking&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Chunking is the process of breaking large documents into smaller, manageable pieces before generating embeddings and storing them in a vector database.&lt;/p&gt;

&lt;p&gt;Poor chunking can lead to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Irrelevant retrieval results&lt;/li&gt;
&lt;li&gt;Hallucinated answers&lt;/li&gt;
&lt;li&gt;Missing context&lt;/li&gt;
&lt;li&gt;Higher inference costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Good chunking, on the other hand, dramatically improves retrieval precision and response quality.&lt;/p&gt;

&lt;p&gt;In this article, we'll explore the most common &lt;strong&gt;chunking strategies&lt;/strong&gt;, their trade-offs, and when to use each.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why Chunking Matters
&lt;/h1&gt;

&lt;p&gt;LLMs and embedding models cannot process infinitely large documents efficiently.&lt;/p&gt;

&lt;p&gt;Consider a 200-page PDF.&lt;/p&gt;

&lt;p&gt;Instead of embedding the entire file as one vector, we split it into smaller chunks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Large Document
      ↓
 Chunking
      ↓
Embeddings
      ↓
Vector Database
      ↓
Semantic Retrieval
      ↓
LLM Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Without Chunking
&lt;/h3&gt;

&lt;p&gt;A single massive embedding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;loses semantic granularity&lt;/li&gt;
&lt;li&gt;retrieves irrelevant sections&lt;/li&gt;
&lt;li&gt;increases token cost&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  With Chunking
&lt;/h3&gt;

&lt;p&gt;Relevant document sections become searchable and retrievable.&lt;/p&gt;




&lt;h1&gt;
  
  
  Understanding the Chunking Trade-Off
&lt;/h1&gt;

&lt;p&gt;Chunk size affects retrieval quality.&lt;/p&gt;

&lt;p&gt;Too small:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Missing context
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Too large:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Noise + irrelevant information
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ideal chunk balances:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;semantic meaning&lt;/li&gt;
&lt;li&gt;retrieval precision&lt;/li&gt;
&lt;li&gt;token efficiency&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  1. Fixed-Size Chunking
&lt;/h1&gt;

&lt;p&gt;The simplest and most widely used approach.&lt;/p&gt;

&lt;p&gt;Documents are split based on a fixed character or token limit.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;500 tokens&lt;/li&gt;
&lt;li&gt;1000 characters&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Document
──────────────────────────
Chunk 1 (500 tokens)
Chunk 2 (500 tokens)
Chunk 3 (500 tokens)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Python Example
&lt;/h2&gt;

&lt;p&gt;Using LangChain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CharacterTextSplitter&lt;/span&gt;

&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Pros
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Easy to implement&lt;/li&gt;
&lt;li&gt;Fast processing&lt;/li&gt;
&lt;li&gt;Predictable chunk sizes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cons
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Ignores document structure&lt;/li&gt;
&lt;li&gt;May cut sentences mid-way&lt;/li&gt;
&lt;li&gt;Can reduce semantic meaning&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Best For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;quick prototypes&lt;/li&gt;
&lt;li&gt;small datasets&lt;/li&gt;
&lt;li&gt;simple RAG systems&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  2. Recursive Chunking
&lt;/h1&gt;

&lt;p&gt;A smarter version of fixed-size chunking.&lt;/p&gt;

&lt;p&gt;Instead of splitting blindly, it attempts to preserve structure.&lt;/p&gt;

&lt;p&gt;Typical hierarchy:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Paragraph&lt;/li&gt;
&lt;li&gt;Sentence&lt;/li&gt;
&lt;li&gt;Word&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Only if a larger section exceeds size limits does it split further.&lt;/p&gt;




&lt;h2&gt;
  
  
  Workflow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Paragraph too large?
        ↓
Split into sentences
        ↓
Sentence too large?
        ↓
Split into words
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;LangChain Recursive Splitter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;

&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Pros
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Preserves meaning&lt;/li&gt;
&lt;li&gt;Better retrieval quality&lt;/li&gt;
&lt;li&gt;Handles mixed documents&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cons
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Slightly slower&lt;/li&gt;
&lt;li&gt;May still ignore domain-specific structure&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Best For
&lt;/h2&gt;

&lt;p&gt;Most RAG systems.&lt;/p&gt;

&lt;p&gt;This is often the &lt;strong&gt;default recommendation&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  3. Sentence-Based Chunking
&lt;/h1&gt;

&lt;p&gt;This strategy keeps chunks aligned with sentence boundaries.&lt;/p&gt;

&lt;p&gt;Instead of arbitrary token counts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chunk = Complete Sentences
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;Document:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI systems rely on retrieval.
Chunking improves retrieval quality.
Poor chunking hurts accuracy.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Possible chunks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chunk 1:
AI systems rely on retrieval.

Chunk 2:
Chunking improves retrieval quality.

Chunk 3:
Poor chunking hurts accuracy.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Python Example
&lt;/h2&gt;

&lt;p&gt;Using NLTK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;nltk&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sent_tokenize&lt;/span&gt;

&lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sent_tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Pros
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Natural language boundaries&lt;/li&gt;
&lt;li&gt;Cleaner embeddings&lt;/li&gt;
&lt;li&gt;Improved semantic integrity&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cons
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Uneven chunk sizes&lt;/li&gt;
&lt;li&gt;Large sentences may exceed limits&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Best For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;conversational data&lt;/li&gt;
&lt;li&gt;articles&lt;/li&gt;
&lt;li&gt;QA systems&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  4. Paragraph-Based Chunking
&lt;/h1&gt;

&lt;p&gt;Paragraphs usually contain a coherent idea.&lt;/p&gt;

&lt;p&gt;This makes them useful chunk boundaries.&lt;/p&gt;




&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Paragraph 1 → Chunk 1
Paragraph 2 → Chunk 2
Paragraph 3 → Chunk 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Pros
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;High semantic coherence&lt;/li&gt;
&lt;li&gt;Human-readable chunks&lt;/li&gt;
&lt;li&gt;Works well for blogs and docs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cons
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Paragraph length varies&lt;/li&gt;
&lt;li&gt;Large paragraphs can overflow&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Best For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;blogs&lt;/li&gt;
&lt;li&gt;documentation&lt;/li&gt;
&lt;li&gt;research papers&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  5. Overlapping Chunking
&lt;/h1&gt;

&lt;p&gt;One major issue with chunking:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;context loss at boundaries.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Chunk 1:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The API authentication uses JWT...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Chunk 2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;...tokens for secure communication.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Important meaning spans both chunks.&lt;/p&gt;

&lt;p&gt;Overlap solves this.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Overlap Works
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chunk 1
──────────────
AAAA BBBB CCCC

Chunk 2
          CCCC DDDD EEEE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CCCC&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;appears in both chunks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Code Example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Pros
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Better retrieval continuity&lt;/li&gt;
&lt;li&gt;Reduces boundary problems&lt;/li&gt;
&lt;li&gt;Higher answer accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cons
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;More embeddings&lt;/li&gt;
&lt;li&gt;Larger vector storage&lt;/li&gt;
&lt;li&gt;Increased retrieval cost&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Best For
&lt;/h2&gt;

&lt;p&gt;Nearly all production RAG systems.&lt;/p&gt;

&lt;p&gt;Typical overlap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10–20%&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  6. Semantic Chunking
&lt;/h1&gt;

&lt;p&gt;Semantic chunking uses meaning instead of size.&lt;/p&gt;

&lt;p&gt;The document is split where topic changes occur.&lt;/p&gt;

&lt;p&gt;This is significantly more intelligent.&lt;/p&gt;




&lt;h2&gt;
  
  
  Concept
&lt;/h2&gt;

&lt;p&gt;Instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Every 500 tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;we split by:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Meaning shift
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;Document:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Section A → Databases
Section B → Kubernetes
Section C → Security
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Semantic chunking creates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chunk 1 → Database topic
Chunk 2 → Kubernetes topic
Chunk 3 → Security topic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  High-Level Pipeline
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Text
 ↓
Sentence embeddings
 ↓
Similarity comparison
 ↓
Topic boundary detection
 ↓
Chunks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Python Example (Conceptual)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics.pairwise&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cosine_similarity&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sentence similarity determines where to split.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pros
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Excellent retrieval quality&lt;/li&gt;
&lt;li&gt;Topic-aware&lt;/li&gt;
&lt;li&gt;Strong contextual relevance&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cons
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Computationally expensive&lt;/li&gt;
&lt;li&gt;More implementation effort&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Best For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;enterprise search&lt;/li&gt;
&lt;li&gt;legal documents&lt;/li&gt;
&lt;li&gt;knowledge bases&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  7. Structure-Aware Chunking
&lt;/h1&gt;

&lt;p&gt;Some documents already contain structure.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HTML headings&lt;/li&gt;
&lt;li&gt;Markdown sections&lt;/li&gt;
&lt;li&gt;PDFs with titles&lt;/li&gt;
&lt;li&gt;Code files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of ignoring this, we use it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;Markdown:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Authentication&lt;/span&gt;
JWT details...

&lt;span class="gh"&gt;# Rate Limiting&lt;/span&gt;
API throttling...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Chunks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Authentication section
Rate Limiting section
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Code Example
&lt;/h2&gt;

&lt;p&gt;Markdown Header Splitter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MarkdownHeaderTextSplitter&lt;/span&gt;

&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Header1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;##&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Header2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Pros
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;High semantic consistency&lt;/li&gt;
&lt;li&gt;Uses author intent&lt;/li&gt;
&lt;li&gt;Excellent for documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cons
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Depends on clean formatting&lt;/li&gt;
&lt;li&gt;Less effective on raw text&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Best For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;developer docs&lt;/li&gt;
&lt;li&gt;wikis&lt;/li&gt;
&lt;li&gt;technical manuals&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  8. Code Chunking
&lt;/h1&gt;

&lt;p&gt;Source code needs special handling.&lt;/p&gt;

&lt;p&gt;Splitting every 500 characters can break logic.&lt;/p&gt;

&lt;p&gt;Instead:&lt;/p&gt;

&lt;p&gt;Split by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;function&lt;/li&gt;
&lt;li&gt;class&lt;/li&gt;
&lt;li&gt;module&lt;/li&gt;
&lt;li&gt;AST nodes&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Bad Chunk
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;login&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;cut halfway.&lt;/p&gt;




&lt;h2&gt;
  
  
  Better Chunk
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Entire&lt;/span&gt; &lt;span class="nf"&gt;login&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;function&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Example Using Tree-sitter
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tree_sitter&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AST-based parsing preserves syntax.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pros
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Maintains logical structure&lt;/li&gt;
&lt;li&gt;Better code retrieval&lt;/li&gt;
&lt;li&gt;Strong for AI coding assistants&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cons
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Language-specific tooling&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Best For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;code copilots&lt;/li&gt;
&lt;li&gt;repository search&lt;/li&gt;
&lt;li&gt;software documentation&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Comparing Chunking Strategies
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;th&gt;Complexity&lt;/th&gt;
&lt;th&gt;Best Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fixed Size&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Prototypes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recursive&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;General RAG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sentence&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;QA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Paragraph&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Articles&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overlap&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Production RAG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic&lt;/td&gt;
&lt;td&gt;Very High&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Enterprise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structure-Aware&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Docs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code Chunking&lt;/td&gt;
&lt;td&gt;Very High&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Code AI&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h1&gt;
  
  
  A Practical Chunking Strategy
&lt;/h1&gt;

&lt;p&gt;Many successful RAG systems use a &lt;strong&gt;hybrid approach&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Structure-aware
        +
Recursive splitting
        +
10–20% overlap
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Document
   ↓
Heading Split
   ↓
Recursive Chunking
   ↓
Overlap
   ↓
Embeddings
   ↓
Vector DB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This usually offers the best balance between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;relevance&lt;/li&gt;
&lt;li&gt;cost&lt;/li&gt;
&lt;li&gt;simplicity&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;Chunking is not just preprocessing.&lt;/p&gt;

&lt;p&gt;It directly influences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;retrieval precision&lt;/li&gt;
&lt;li&gt;embedding quality&lt;/li&gt;
&lt;li&gt;hallucination rate&lt;/li&gt;
&lt;li&gt;user experience&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is no universal best strategy.&lt;/p&gt;

&lt;p&gt;A good rule:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Start with recursive + overlap&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Move to &lt;strong&gt;semantic or structure-aware chunking&lt;/strong&gt; as complexity grows&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;code-aware chunking&lt;/strong&gt; for engineering systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In many cases, improving chunking yields larger gains than switching to a bigger LLM.&lt;/p&gt;




</description>
      <category>ai</category>
      <category>rag</category>
      <category>learning</category>
      <category>programming</category>
    </item>
    <item>
      <title>Building LLM Applications: Core Concepts of RAG, Embeddings, and Orchestration</title>
      <dc:creator>Vivek</dc:creator>
      <pubDate>Sun, 05 Apr 2026 19:29:21 +0000</pubDate>
      <link>https://dev.to/kasturivivek/building-llm-applications-core-concepts-of-rag-embeddings-and-orchestration-4on5</link>
      <guid>https://dev.to/kasturivivek/building-llm-applications-core-concepts-of-rag-embeddings-and-orchestration-4on5</guid>
      <description>&lt;p&gt;&lt;strong&gt;Objective&lt;/strong&gt;&lt;br&gt;
This article explains the core architecture and implementation of LLM-based systems&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Table of Contents&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM Invocation&lt;/li&gt;
&lt;li&gt;Prompt Engineering&lt;/li&gt;
&lt;li&gt;Embeddings &amp;amp; Vector Search&lt;/li&gt;
&lt;li&gt;RAG Pipeline&lt;/li&gt;
&lt;li&gt;LangGraph Workflows&lt;/li&gt;
&lt;li&gt;Production Architecture&lt;/li&gt;
&lt;li&gt;Streaming &amp;amp; Scaling&lt;/li&gt;
&lt;li&gt;Key Takeaways&lt;/li&gt;
&lt;li&gt;References&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;LLM Invocation&lt;/strong&gt;&lt;br&gt;
LLM Invocation: How We “Call” Large Language Models (and What Actually Happens)&lt;/p&gt;

&lt;p&gt;Large Language Models (LLMs) like GPT-style models are usually accessed through something that looks like an API call. But what you’re really doing is an LLM invocation: sending structured input (messages) into a model and receiving generated output back.&lt;/p&gt;

&lt;p&gt;This post explains what “LLM invocation” means, why it’s different from typical APIs, and the execution flow that happens every time you ask a model a question.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is LLM Invocation?
&lt;/h2&gt;

&lt;p&gt;LLM Invocation is the process of interacting with a large language model by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sending structured input (usually a list of messages)&lt;/li&gt;
&lt;li&gt;receiving generated output (the model’s response)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Unlike traditional APIs, LLM invocation has some unique characteristics.&lt;/p&gt;

&lt;h2&gt;
  
  
  How LLM Invocation differs from traditional APIs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1) Input is natural language (plus structure)
&lt;/h3&gt;

&lt;p&gt;In a typical REST API, your input is rigid (JSON payloads with fixed fields). With LLMs, your “input” is mostly language.&lt;/p&gt;

&lt;p&gt;Even though the request may be wrapped in a JSON format (roles, messages, metadata), the substance is natural language instructions.&lt;/p&gt;

&lt;h3&gt;
  
  
  2) Output is probabilistic
&lt;/h3&gt;

&lt;p&gt;Traditional APIs return deterministic results for the same request (assuming the underlying data doesn’t change).&lt;/p&gt;

&lt;p&gt;LLMs don’t work like that. They generate output via token-by-token prediction, so the result can vary depending on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;randomness settings (temperature, top_p, etc.)&lt;/li&gt;
&lt;li&gt;tiny wording differences in the prompt&lt;/li&gt;
&lt;li&gt;context length and ordering&lt;/li&gt;
&lt;li&gt;model version&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short: same prompt does not always mean the exact same output.&lt;/p&gt;

&lt;h3&gt;
  
  
  3) Context is everything
&lt;/h3&gt;

&lt;p&gt;This is the most important point operationally:&lt;/p&gt;

&lt;p&gt;LLMs don’t “remember” in the way apps do.&lt;/p&gt;

&lt;p&gt;They only see what you send inside the context window during that invocation. If something isn’t included in the messages, the model can’t use it (unless it’s part of the model’s training, which is general—not your private state).&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Concept: Context Window
&lt;/h2&gt;

&lt;p&gt;LLMs operate inside a context window, which is basically the maximum amount of text (tokens) the model can consider at once.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The model does not retain memory across requests by default&lt;/li&gt;
&lt;li&gt;Every time you invoke it, it processes the full message stack you provide&lt;/li&gt;
&lt;li&gt;Input quality determines output quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your prompt is unclear, contradictory, or missing key constraints, the model’s output will reflect that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Execution Flow: What happens during an LLM call?
&lt;/h2&gt;

&lt;p&gt;A simplified invocation pipeline looks like this:&lt;/p&gt;

&lt;p&gt;1) User Query  &lt;/p&gt;

&lt;p&gt;2) Message Formatting (system + user + optional assistant history)  &lt;/p&gt;

&lt;p&gt;3) LLM Processing (token-by-token prediction)  &lt;/p&gt;

&lt;p&gt;4) Generated Response&lt;/p&gt;

&lt;p&gt;The “magic” is in step 2 and step 3:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Step 2 determines what the model is allowed to assume and how it should behave (system instructions are especially powerful).&lt;/li&gt;
&lt;li&gt;Step 3 is not retrieval of a stored answer; it’s generation of the next most likely token repeatedly until the output is complete.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Example: A simple LLM invocation (JavaScript)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;system&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;You are a technical assistant&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Explain vector databases&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Final Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;LLM invocation = sending structured messages + receiving generated output.&lt;/li&gt;
&lt;li&gt;Unlike traditional APIs, LLM outputs are probabilistic and context-dependent.&lt;/li&gt;
&lt;li&gt;LLMs don’t remember across calls; they only know what you include in the context window.&lt;/li&gt;
&lt;li&gt;The quality and structure of your input (especially system + user messages) strongly determines the quality of output.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Note
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;This article is based on my hands-on learning and implementation of LLM systems. AI tools were used to assist in structuring and refining the content.&lt;/li&gt;
&lt;li&gt;This is part 1 of the series. In upcoming parts, we will dive into other topics.&lt;/li&gt;
&lt;li&gt;Follow along to build a complete understanding of LLM-based systems.&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
  </channel>
</rss>
