Dev J. Shah 🥑

Posted on Feb 22 • Edited on Mar 15

Chunking for context: 6 Strategies Every AI Engineer Should Know

#rag #ai

Introduction

The very purpose of chunking is to split the data into chunks, convert each chunk into embeddings, and store it in a vector database to be able to use it to provide context to the LLM.

This data can be split using multiple strategies. The ultimate goal is to make sure that whenever the relevant chunk/s are being fetched, it provides enough context to the LLM to make sure that the query of the user is properly addressed. The following are some of the strategies that can be used to split the data.

Fixed Size

The most famous method of chunking is fixed-size. The data can be split into a fixed size, that is, the text can be split into a fixed number of characters. For instance, if the data contains 5000 characters and the chunk size is 250 characters, then the data gets divided into 20 chunks.

For this strategy, one important parameter used is overlapping. Developers often choose to overlap the characters to make sure that the context is maintained in each chunk. In simple words, if the overlap is of 20 characters, this means that the last 20 characters of a chunk will become the first 20 characters of the next chunk. This helps persist the context.

Recursive

Recursive chunking splits text by using the largest meaningful structure first, and only falls back to smaller ones if the chunk is still too large. This needs to be explained with the help of an example.

Assuming the following data is to be split.

Recursive chunking splits text step by step while preserving meaning.

In retrieval-augmented generation systems, documents often exceed the context window of large language models. If text is split blindly into fixed sizes, important ideas may be broken across chunks, and retrieval quality suffers.

Recursive chunking solves this by attempting to keep larger semantic units intact first, such as paragraphs. Only when a paragraph exceeds the size limit does the system split it further into sentences or smaller units.

Considering the required size of each chunk is 20 tokens (~ 20 words).

Following is the order in which recursive chunking divides the text.

Whole document
Paragraph \n\n
Sentence .
Token

Following this order, the splitter first checks if the whole document can fit into a single chunk. However, it exceeds the limit of 20 tokens. Hence, it would divide the content into separate paragraphs.

P1: 10 tokens

Recursive chunking splits text step by step while preserving meaning.

P2: 33 tokens

In retrieval-augmented generation systems, documents often exceed the context window of large language models. If text is split blindly into fixed sizes, important ideas may be broken across chunks, and retrieval quality suffers.

P3: 35 tokens

Recursive chunking solves this by attempting to keep larger semantic units intact first, such as paragraphs. Only when a paragraph exceeds the size limit does the system split it further into sentences or smaller units.

Since P1 fits within the limit of 20 tokens, P1 becomes the first chunk. P2 and P3 exceed the limit and will be further divided into sentences.

P2, S1: 17 tokens

In retrieval-augmented generation systems, documents often exceed the context window of large language models.

P2, S2: 19 tokens

If text is split blindly into fixed sizes, important ideas may be broken across chunks, and retrieval quality suffers.

P3, S1: 17 tokens

Recursive chunking solves this by attempting to keep larger semantic units intact first, such as paragraphs.

P3, S2: 18 tokens

Only when a paragraph exceeds the size limit does the system split it further into sentences or smaller units.

All of these fit within separate chunks. Hence, the final list of chunks will be P1, P2;S1, P2;S2, P3;S1, P3;S2.

Document-Based

Document-based chunking splits text along natural document boundaries like sections, subsections, and paragraphs, which is similar to recursive chunking. However, the key difference is that when a section exceeds the token limit, document-based chunking identifies semantically meaningful split points (like topic shifts) rather than mechanically splitting at arbitrary separators. This ensures each chunk contains coherent, self-contained information.

For instance, the following data,

# Analysis Framework
We employed a mixed-methods approach combining quantitative regression analysis with qualitative thematic coding. The quantitative component used multiple linear regression to identify predictors of user satisfaction, controlling for demographic variables including age, gender, and geographic location. Model selection involved comparing AIC values across nested models. The qualitative component involved coding open-ended responses using an inductive approach to identify emergent themes. Two independent coders achieved an inter-rater reliability of κ=0.85.

will split as follows

# Analysis Framework - Quantitative
We employed a mixed-methods approach combining quantitative regression analysis with qualitative thematic coding. The quantitative component used multiple linear regression to identify predictors of user satisfaction, controlling for demographic variables including age, gender, and geographic location. Model selection involved comparing AIC values across nested models.

# Analysis Framework - Qualitative
The qualitative component involved coding open-ended responses using an inductive approach to identify emergent themes. Two independent coders achieved an inter-rater reliability of κ=0.85.

Hierarchical

In this type of chunking strategy, the data is first divided into small semantic units, that is, either sentences or small paragraphs, to form separate chunks. This can be considered a level 3 division (the most granular level). Further, level 3 chunks are grouped together based on similarity, which forms level 2 chunks, and lastly, based on the main topic, level 2 chunks are grouped to become level 1 chunks. This bottom-up aggregation is what makes hierarchical chunking unique.

In this strategy, the action happens during the retrieval. The user query is used to find the most relevant level 3 chunk. Upon a successful match, the level 2 or level 1 chunk is retrieved.

Semantic

In Semantic Chunking, content is divided based on the actual meaning and topic of the text.

Imagine the algorithm reading through your document sentence by sentence. It starts building a "chunk" and asks itself: "Is this next sentence still talking about the same thing?" As long as the next sentence is semantically related to the previous one, it is added to the current chunk.

As soon as the topic shifts or if the chunk reaches a maximum token limit, the current chunk is closed, and a new one begins. This ensures that a single idea stays together, making it much easier for an AI to retrieve the right context later.

The model identifies topic shifts by calculating the mathematical distance between sentences using embeddings and cosine similarity. Check out the following blog to know more about it.

Dev J. Shah 🥑

Jul 29 '25

Exploring RAG: Math behind Embeddings & Cosine Similarity

#vectordatabase #rag #ai

10 min read

LLM Based

In LLM-based chunking, the model acts as an intelligent editor, choosing precisely where to place dividers.

Key Considerations for Breaking Content:

Semantic Drift: It detects subtle transitions in subject matter, asking: "Are we still talking about the same core concept?"
Conceptual Integrity: It ensures an idea is fully explained before cutting.
Logical Grouping: It groups related ideas even if they use different wording or span across multiple paragraphs.
Structural Intelligence: It respects the flow of headings and sections but prioritizes the actual narrative flow over rigid formatting.
Relational Awareness: The LLM excels at identifying functional pairs, such as:
- Cause ↔ Effect
- Definition ↔ Example
- Problem ↔ Solution

Key Considerations for Chunking

Context Window Limits: Every embedding model has a maximum token limit (e.g., 512, 8192 tokens). If a text chunk exceeds this limit, the model will simply truncate the text, ignoring any content beyond the limit. This results in incomplete vector representations and lost data. Therefore, your maximum chunk size must always be safely below the embedding model's context window limit.

Granularity vs. Context: A larger context window allows for bigger chunks, capturing more context but potentially diluting specific details. Smaller windows force smaller chunks, which are more precise but may lack surrounding context. The choice of chunk size is a direct trade-off that must align with the capabilities of the specific embedding model you intend to use downstream.

Conclusion

Choosing the right chunking strategy depends on the nature of your data, the embedding model you use, and the kind of queries your system needs to handle. Each strategy covered above comes with its own trade-offs between simplicity, semantic accuracy, and computational cost.

If you want to learn more about AI Engineering and related terminologies, I highly recommended getting the book “AI Engineering” by Chip Huyen.

DEV Community