DEV Community: Vivek

Chunking Strategies for LLM Applications: A Practical Guide to Better RAG Systems

Vivek — Sun, 24 May 2026 14:56:09 +0000

Learn how chunking impacts retrieval quality, embedding performance, and the overall effectiveness of Retrieval-Augmented Generation (RAG) systems.

Introduction

When building AI applications using Retrieval-Augmented Generation (RAG), developers often focus on selecting the best LLM or embedding model. But one foundational step is frequently underestimated chunking

Chunking

Chunking is the process of breaking large documents into smaller, manageable pieces before generating embeddings and storing them in a vector database.

Poor chunking can lead to:

Irrelevant retrieval results
Hallucinated answers
Missing context
Higher inference costs

Good chunking, on the other hand, dramatically improves retrieval precision and response quality.

In this article, we'll explore the most common chunking strategies, their trade-offs, and when to use each.

Why Chunking Matters

LLMs and embedding models cannot process infinitely large documents efficiently.

Consider a 200-page PDF.

Instead of embedding the entire file as one vector, we split it into smaller chunks:

Large Document
      ↓
 Chunking
      ↓
Embeddings
      ↓
Vector Database
      ↓
Semantic Retrieval
      ↓
LLM Response

Without Chunking

A single massive embedding:

loses semantic granularity
retrieves irrelevant sections
increases token cost

With Chunking

Relevant document sections become searchable and retrievable.

Understanding the Chunking Trade-Off

Chunk size affects retrieval quality.

Too small:

Missing context

Too large:

Noise + irrelevant information

The ideal chunk balances:

semantic meaning
retrieval precision
token efficiency

1. Fixed-Size Chunking

The simplest and most widely used approach.

Documents are split based on a fixed character or token limit.

Example:

500 tokens
1000 characters

How It Works

Document
──────────────────────────
Chunk 1 (500 tokens)
Chunk 2 (500 tokens)
Chunk 3 (500 tokens)

Python Example

Using LangChain:

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

chunks = splitter.split_text(document)

Pros

Easy to implement
Fast processing
Predictable chunk sizes

Cons

Ignores document structure
May cut sentences mid-way
Can reduce semantic meaning

Best For

quick prototypes
small datasets
simple RAG systems

2. Recursive Chunking

A smarter version of fixed-size chunking.

Instead of splitting blindly, it attempts to preserve structure.

Typical hierarchy:

Paragraph
Sentence
Word

Only if a larger section exceeds size limits does it split further.

Workflow

Paragraph too large?
        ↓
Split into sentences
        ↓
Sentence too large?
        ↓
Split into words

Example

LangChain Recursive Splitter:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

chunks = splitter.split_text(document)

Pros

Preserves meaning
Better retrieval quality
Handles mixed documents

Cons

Slightly slower
May still ignore domain-specific structure

Best For

Most RAG systems.

This is often the default recommendation.

3. Sentence-Based Chunking

This strategy keeps chunks aligned with sentence boundaries.

Instead of arbitrary token counts:

Chunk = Complete Sentences

Example

Document:

AI systems rely on retrieval.
Chunking improves retrieval quality.
Poor chunking hurts accuracy.

Possible chunks:

Chunk 1:
AI systems rely on retrieval.

Chunk 2:
Chunking improves retrieval quality.

Chunk 3:
Poor chunking hurts accuracy.

Python Example

Using NLTK:

import nltk
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(document)

Pros

Natural language boundaries
Cleaner embeddings
Improved semantic integrity

Cons

Uneven chunk sizes
Large sentences may exceed limits

Best For

conversational data
articles
QA systems

4. Paragraph-Based Chunking

Paragraphs usually contain a coherent idea.

This makes them useful chunk boundaries.

Example

Paragraph 1 → Chunk 1
Paragraph 2 → Chunk 2
Paragraph 3 → Chunk 3

Pros

High semantic coherence
Human-readable chunks
Works well for blogs and docs

Cons

Paragraph length varies
Large paragraphs can overflow

Best For

blogs
documentation
research papers

5. Overlapping Chunking

One major issue with chunking:

context loss at boundaries.

Example:

Chunk 1:

The API authentication uses JWT...

Chunk 2:

...tokens for secure communication.

Important meaning spans both chunks.

Overlap solves this.

How Overlap Works

Chunk 1
──────────────
AAAA BBBB CCCC

Chunk 2
          CCCC DDDD EEEE

Notice:

CCCC

appears in both chunks.

Code Example

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100
)

Pros

Better retrieval continuity
Reduces boundary problems
Higher answer accuracy

Cons

More embeddings
Larger vector storage
Increased retrieval cost

Best For

Nearly all production RAG systems.

Typical overlap:

10–20%

6. Semantic Chunking

Semantic chunking uses meaning instead of size.

The document is split where topic changes occur.

This is significantly more intelligent.

Concept

Instead of:

Every 500 tokens

we split by:

Meaning shift

Example

Document:

Section A → Databases
Section B → Kubernetes
Section C → Security

Semantic chunking creates:

Chunk 1 → Database topic
Chunk 2 → Kubernetes topic
Chunk 3 → Security topic

High-Level Pipeline

Text
 ↓
Sentence embeddings
 ↓
Similarity comparison
 ↓
Topic boundary detection
 ↓
Chunks

Python Example (Conceptual)

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

Sentence similarity determines where to split.

Pros

Excellent retrieval quality
Topic-aware
Strong contextual relevance

Cons

Computationally expensive
More implementation effort

Best For

enterprise search
legal documents
knowledge bases

7. Structure-Aware Chunking

Some documents already contain structure.

Examples:

HTML headings
Markdown sections
PDFs with titles
Code files

Instead of ignoring this, we use it.

Example

Markdown:

# Authentication
JWT details...

# Rate Limiting
API throttling...

Chunks:

Authentication section
Rate Limiting section

Code Example

Markdown Header Splitter:

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers = [
    ("#", "Header1"),
    ("##", "Header2")
]

Pros

High semantic consistency
Uses author intent
Excellent for documentation

Cons

Depends on clean formatting
Less effective on raw text

Best For

developer docs
wikis
technical manuals

8. Code Chunking

Source code needs special handling.

Splitting every 500 characters can break logic.

Instead:

Split by:

function
class
module
AST nodes

Bad Chunk

def login():
    ...

cut halfway.

Better Chunk

Entire login() function

Example Using Tree-sitter

import tree_sitter

AST-based parsing preserves syntax.

Pros

Maintains logical structure
Better code retrieval
Strong for AI coding assistants

Cons

Language-specific tooling

Best For

code copilots
repository search
software documentation

Comparing Chunking Strategies

Strategy	Quality	Complexity	Best Use
Fixed Size	Low	Low	Prototypes
Recursive	High	Low	General RAG
Sentence	Medium	Low	QA
Paragraph	Medium	Low	Articles
Overlap	High	Low	Production RAG
Semantic	Very High	High	Enterprise
Structure-Aware	High	Medium	Docs
Code Chunking	Very High	High	Code AI

A Practical Chunking Strategy

Many successful RAG systems use a hybrid approach.

Example:

Structure-aware
        +
Recursive splitting
        +
10–20% overlap

Pipeline:

Document
   ↓
Heading Split
   ↓
Recursive Chunking
   ↓
Overlap
   ↓
Embeddings
   ↓
Vector DB

This usually offers the best balance between:

relevance
cost
simplicity

Final Thoughts

Chunking is not just preprocessing.

It directly influences:

retrieval precision
embedding quality
hallucination rate
user experience

There is no universal best strategy.

A good rule:

Start with recursive + overlap
Move to semantic or structure-aware chunking as complexity grows
Use code-aware chunking for engineering systems

In many cases, improving chunking yields larger gains than switching to a bigger LLM.

Building LLM Applications: Core Concepts of RAG, Embeddings, and Orchestration

Vivek — Sun, 05 Apr 2026 19:29:21 +0000

Objective
This article explains the core architecture and implementation of LLM-based systems

Table of Contents

LLM Invocation
Prompt Engineering
Embeddings & Vector Search
RAG Pipeline
LangGraph Workflows
Production Architecture
Streaming & Scaling
Key Takeaways
References

LLM Invocation
LLM Invocation: How We “Call” Large Language Models (and What Actually Happens)

Large Language Models (LLMs) like GPT-style models are usually accessed through something that looks like an API call. But what you’re really doing is an LLM invocation: sending structured input (messages) into a model and receiving generated output back.

This post explains what “LLM invocation” means, why it’s different from typical APIs, and the execution flow that happens every time you ask a model a question.

What is LLM Invocation?

LLM Invocation is the process of interacting with a large language model by:

sending structured input (usually a list of messages)
receiving generated output (the model’s response)

Unlike traditional APIs, LLM invocation has some unique characteristics.

How LLM Invocation differs from traditional APIs

1) Input is natural language (plus structure)

In a typical REST API, your input is rigid (JSON payloads with fixed fields). With LLMs, your “input” is mostly language.

Even though the request may be wrapped in a JSON format (roles, messages, metadata), the substance is natural language instructions.

2) Output is probabilistic

Traditional APIs return deterministic results for the same request (assuming the underlying data doesn’t change).

LLMs don’t work like that. They generate output via token-by-token prediction, so the result can vary depending on:

randomness settings (temperature, top_p, etc.)
tiny wording differences in the prompt
context length and ordering
model version

In short: same prompt does not always mean the exact same output.

3) Context is everything

This is the most important point operationally:

LLMs don’t “remember” in the way apps do.

They only see what you send inside the context window during that invocation. If something isn’t included in the messages, the model can’t use it (unless it’s part of the model’s training, which is general—not your private state).

Core Concept: Context Window

LLMs operate inside a context window, which is basically the maximum amount of text (tokens) the model can consider at once.

That means:

The model does not retain memory across requests by default
Every time you invoke it, it processes the full message stack you provide
Input quality determines output quality

If your prompt is unclear, contradictory, or missing key constraints, the model’s output will reflect that.

Execution Flow: What happens during an LLM call?

A simplified invocation pipeline looks like this:

1) User Query

2) Message Formatting (system + user + optional assistant history)

3) LLM Processing (token-by-token prediction)

4) Generated Response

The “magic” is in step 2 and step 3:

Step 2 determines what the model is allowed to assume and how it should behave (system instructions are especially powerful).
Step 3 is not retrieval of a stored answer; it’s generation of the next most likely token repeatedly until the output is complete.

Example: A simple LLM invocation (JavaScript)

const response = await model.invoke([
  { role: "system", content: "You are a technical assistant" },
  { role: "user", content: "Explain vector databases" }
]);

Final Takeaways

LLM invocation = sending structured messages + receiving generated output.
Unlike traditional APIs, LLM outputs are probabilistic and context-dependent.
LLMs don’t remember across calls; they only know what you include in the context window.
The quality and structure of your input (especially system + user messages) strongly determines the quality of output.

Note

This article is based on my hands-on learning and implementation of LLM systems. AI tools were used to assist in structuring and refining the content.
This is part 1 of the series. In upcoming parts, we will dive into other topics.
Follow along to build a complete understanding of LLM-based systems.