DEV Community

Cover image for Understanding LangChain's RecursiveCharacterTextSplitter
Youdiowei Eteimorde
Youdiowei Eteimorde

Posted on • Updated on

Understanding LangChain's RecursiveCharacterTextSplitter

Large language models are powerful tools with extensive capabilities; nonetheless, they grapple with a distinct limitation known as the context window. This context window defines the boundaries within which these models can proficiently process text. Take, for example, gpt-3.5-turbo, which operates within a context length of 4,096 tokens, approximately corresponding to 3,500 words.

But what occurs when you present these models with a document that exceeds their context window? This is where a clever strategy known as "chunking" comes into play. Chunking involves dividing the document into smaller, more manageable sections that fit comfortably within the context window of the large language model.

Langchain provides users with a range of chunking techniques to choose from. However, among these options, the RecursiveCharacterTextSplitter emerges as the favored and strongly recommended method.

Quick overview

The RecursiveCharacterTextSplitter takes a large text and splits it based on a specified chunk size. It does this by using a set of characters. The default characters provided to it are ["\n\n", "\n", " ", ""].

It takes in the large text then tries to split it by the first character \n\n. If the first split by \n\n is still large then it moves to the next character which is \n and tries to split by it. If it is still larger than our specified chunk size it moves to the next character in the set until we get a split that is less than our specified chunk size.

Code implementation

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines β€” CPU, disk drives, printer, card reader β€” sitting up on a raised floor under bright fluorescent lights.

The text above is extracted from an article written by Paul Graham, titled: What I Worked On. Let's utilize the RecursiveCharacterTextSplitter to break it into small chunks, each with a maximum size of 100 characters.

First we import it from langchain:

from langchain.text_splitter import RecursiveCharacterTextSplitter
Enter fullscreen mode Exit fullscreen mode

Let's load the text we wish to create chunks from into a variable called text.

text = """What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines β€” CPU, disk drives, printer, card reader β€” sitting up on a raised floor under bright fluorescent lights.
"""
Enter fullscreen mode Exit fullscreen mode

Next we create a RecursiveCharacterTextSplitter instance, configuring it with a chunk_size of 100 and a chunk_overlap value of zero. Our approach involves using the length function to measure each chunk based on its character count.

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap  = 0,
    length_function = len,
)
Enter fullscreen mode Exit fullscreen mode

The RecursiveCharacterTextSplitter offers several methods for performing splits. In our case, we will utilize the split_text method. This method requires a string input representing the text and returns an array of strings, each representing a chunk after the splitting process.

texts = text_splitter.split_text(text)
print(len(texts)) # 11
print(texts[0]) # 'What I Worked On\n\nFebruary 2021'
Enter fullscreen mode Exit fullscreen mode

Upon performing the split our text was successfully divided into a total of 11 separate chunks.

In-Depth Explanation

Recursion

Just as its name suggests, the RecursiveCharacterTextSplitter employs recursion as the core mechanism to accomplish text splitting. Now, let's take a detailed journey through the process of how our earlier code was capable of achieving this feat.

For our walkthrough, we'll utilize the same text and parameters that we employed during the code implementation. This involves a segment from Paul Graham's essay, and we'll consider a chunk size of 100 characters. The characters we use for splitting will be ['\n\n', '\n', ' ', '' ].

Paul Graham's Essay

Let's begin with our initial text. Currently presented in human-readable form, our next step involves transforming it into a format that computers can readily comprehend.

Paul Graham's Essay for computers

Now, the new lines have been converted to \n, which is precisely what we need in order to carry out our splitting process.

Highlighted Paul Graham's essay

Let's select our text. This can be likened to invoking the split_text method on our text.

As mentioned earlier, the RecursiveCharacterTextSplitter attempts to initiate splits using a predefined set of characters. Its first attempt involves the \n\n character, which serves as a means to split by paragraphs. Let's now identify all occurrences of this character within our text.

Occurrence of paragraphs in the essay

Once we've located all instances of the \n\n characters, the subsequent step involves executing a split using this character as our designated separator.

Split text by paragraph

Presently, we have four splits. Our next step involves assessing each split to check whether they meet the condition of being smaller than our specified chunk size, which is set at 100 characters.

The first two splits satisfy this condition, thus earning them the label of good splits. Since both segments consist of fewer than 100 characters, we can combine them to create our initial chunk.

Initial chunk of RecursiveCharacterTextSplitter

Proceeding to the second split, we find ourselves in a situation where further reduction isn't achievable using the \n\n character. Therefore, we proceed to the next character: \n. Our objective here is to execute a split using the \n character and determine if we can achieve a reduction in the split's size.

This operation is akin to invoking the split_text on the second split text, but with the inclusion of the \n character. This is where the concept of recursion comes into play.

Second chunk

Upon executing the split using the \n character, we end up with two splits. The first split qualifies as a good split, given that it contains only one character. However, the second split surpasses our designated chunk size.

Consequently, we need to invoke the split_text method on this particular split once again. However, this time we'll employ a split using the next character in our character list, which happens to be the ' ' character.

RecursiveCharacterTextSplitter using spaces

Finally, we have successfully decreased the split size. Now, we proceed to iterate through each split in order to perform a merge. The guiding principle for these merges is that no resulting merged split should exceed our designated chunk size of 100 characters.

merge list

Following the merge, we end up with four chunks, each adhering to our condition that a chunk should not surpass 100 characters.

Now, let's revisit the original text splits and identify which split remains to be processed.

Final chunk in RecursiveCharacterTextSplitter

We still have one split that is greater than our chunk size. We repeat the same procedures again.

Perform split with RecursiveCharacterTextSplitter

We initiate the split using the new line character as the separator.

Perform split with RecursiveCharacterTextSplitter using spaces

We perform a split using spaces as the separators.

Perform merge with RecursiveCharacterTextSplitter

Next, we proceed with a merge, ensuring that no merged segments exceed the defined chunk size.

After going through the entire process, we arrive at generating eleven individual chunks. Each of these eleven chunks successfully adheres to the 100-character limit.
This outcome aligns precisely with what we achieved programmatically.

Final chunks from using the RecursiveCharacterTextSplitter


And there we have it. We've delved into the inner workings of LangChain's RecursiveCharacterTextSplitter. For those who are intrigued, you can explore the source code here. If you found this article informative, please consider showing your appreciation with a reaction: πŸ’– πŸ¦„ 🀯 πŸ™Œ πŸ”₯

Top comments (15)

Collapse
 
notsob profile image
scitlec

The maintainers of the Langchain documentation should link to your useful explanation.

Thanks!

Collapse
 
tisu19021997 profile image
Pham Minh Quang

I totally agree! The langchain documentation is just suck.

Collapse
 
eteimz profile image
Youdiowei Eteimorde

Thanks for your kind words πŸ₯° Who knows they might.

Collapse
 
james_stover_cb94b158d958 profile image
James Stover

Something doesn’t quite work right as I see some words throughout my text after splitting are broken apart with a space making 2 non-words of each of them. They have quite a few characters in between, so it isn’t frequent, but in a large body of text, these add up. I am concerned about the detrimental impact to the vector embeddings and retrieval then.

Collapse
 
eteimz profile image
Youdiowei Eteimorde

Splitting is far from perfect. Hopefully more efficient techniques will be developed.

Collapse
 
megatux profile image
Cristian Molina

What about the chunk_overlap param?

Collapse
 
eteimz profile image
Youdiowei Eteimorde • Edited

The chunk_overlap parameter determines how much the chunks overlap with each other.

For example let's split your comment into three chunks.

What about | the chunk_ | overlap param?

Let's overlap each chunk with 5 characters:

What about the | about the chunk_ | chunk_overlap param?

If we didn't use chunk overlapping your comment would have lost is meaning when split.

Collapse
 
megatux profile image
Cristian Molina

Thanks! That makes sense but what value should I use if, for instance, I need to save the texts In a vectorDB later to augment a RAG?
Does it matter? If this is significant I'd add this information to the article.
Thanks again.

Thread Thread
 
eteimz profile image
Youdiowei Eteimorde

It is all depends on your data and what you are trying to achieve. The whole Augmenting LLMs with external knowledge is still in it's infancy. So you can experiment with different params to see how your LLM performs during RAG.

Collapse
 
ajeet214 profile image
Ajeet Verma

great explanation!

Collapse
 
abirpahlwan profile image
Pahlwan Rabiul Islam

Thanks a lot

Collapse
 
nikhilk19 profile image
Nikhil Kulkarni

This was very helpful, Thanks for the detailed explanation!

Collapse
 
ramkumartr profile image
RamKumar-T-R

Good write-up with more insightful knowledge on implementation part

Collapse
 
githubedcults profile image
edcults

Thanks a lot for detailed explanation, I wonder why this is not linked or published as langchain blogs

Collapse
 
vishalnagda1 profile image
Vishal Nagda

It's really a valuable post.