DEV Community

Understanding LangChain's RecursiveCharacterTextSplitter

Youdiowei Eteimorde on August 12, 2023

Large language models are powerful tools with extensive capabilities; nonetheless, they grapple with a distinct limitation known as the context win...

Read full post

scitlec • Aug 31 '23

The maintainers of the Langchain documentation should link to your useful explanation.

Thanks!

Pham Minh Quang • Sep 10 '23

I totally agree! The langchain documentation is just suck.

Youdiowei Eteimorde • Sep 1 '23

Thanks for your kind words 🥰 Who knows they might.

Cristian Molina • Jan 25 '24

What about the chunk_overlap param?

Youdiowei Eteimorde • Jan 27 '24 • Edited

The chunk_overlap parameter determines how much the chunks overlap with each other.

For example let's split your comment into three chunks.

What about | the chunk_ | overlap param?

Let's overlap each chunk with 5 characters:

What about the | about the chunk_ | chunk_overlap param?

If we didn't use chunk overlapping your comment would have lost is meaning when split.

Cristian Molina • Jan 28 '24

Thanks! That makes sense but what value should I use if, for instance, I need to save the texts In a vectorDB later to augment a RAG?
Does it matter? If this is significant I'd add this information to the article.
Thanks again.

Youdiowei Eteimorde • Jan 29 '24

It is all depends on your data and what you are trying to achieve. The whole Augmenting LLMs with external knowledge is still in it's infancy. So you can experiment with different params to see how your LLM performs during RAG.

James Stover • Jan 17 '24

Something doesn’t quite work right as I see some words throughout my text after splitting are broken apart with a space making 2 non-words of each of them. They have quite a few characters in between, so it isn’t frequent, but in a large body of text, these add up. I am concerned about the detrimental impact to the vector embeddings and retrieval then.

Youdiowei Eteimorde • Jan 27 '24

Splitting is far from perfect. Hopefully more efficient techniques will be developed.

Pahlwan Rabiul Islam • Nov 22 '23

Thanks a lot

Lambozo • Jul 11 '24

Could I just make the chunk_size really big? Like 1000 or even 2000, is there a downsides with doing that?

Youdiowei Eteimorde • Jul 14 '24

Yes you could. If your LLM has a very large context window it won't be a problem.

Lambozo • Jul 15 '24

Thank you for your reply. How do I find out how much context window do I have or my LLM has? How to increase that?

Ajeet Verma • Jan 23 '24

great explanation!

edcults • Apr 22 '24

Thanks a lot for detailed explanation, I wonder why this is not linked or published as langchain blogs

Meenakshi Arumugam • Jan 13 '25

Clear explanation, Thanks a lot!!

Leonard Pleschberger • Sep 21 '24

Great explanation, many thanks!

eden lin • Sep 29 '24

very good, thank you

Ni • Jun 1 '24

You should be writing more such!
For someone new to LangChain and text split, this post really went deep on the subject.

Thanks!

Youdiowei Eteimorde • Jun 4 '24

Thank you for your nice words ☺️

Vishal Nagda • Mar 27 '24

It's really a valuable post.

RamKumar-T-R • Jan 1 '24

Good write-up with more insightful knowledge on implementation part

Sophie du Couédic • Jun 27 '24

I am wondering why '.' is not part of the default separators? It seems to me that it would be effective to separate sentences.

Nosy Scout • Jan 16 '25

In my opinion, you would risk loosing very relevant information by splitting and occasionally not merging numbers with decimal point.

Nikhil Kulkarni • Nov 7 '23

This was very helpful, Thanks for the detailed explanation!

Nosy Scout • Jan 16 '25 • Edited

Hey, great post! Could you please explain, how is the start_index for each chunk formed? It is sometimes -1 and almost always not corresponding to start position of the chunked text in the original file. Thanks!

Healey Wong • Jul 5 '24

Great and direct explanation!

akthar hussain • Jan 1 '25

appreciate finding this article!