Day 5 - Chunking continued - RAG

#ai #nlp #rag #tutorial

Sliding window chunking
To understand this method, we need to know about two parameters, window size and step size. Let's now see how with the help of these two parameters, sliding window chunking works.

Consider the following :

Sample text:
Redis is an open-source, in-memory data store that is primarily used as a cache, database, and message broker. Unlike traditional databases that store data on disk, Redis keeps data in memory (RAM), which makes data access extremely fast. It is commonly used in applications where high performance and low latency are critical, such as caching frequently accessed data, managing user sessions, real-time analytics, task queues, and messaging systems.
Window size =15
Step size =5

Window position is at the first character. It takes the first 15 characters and stores them in chunk1.
Redis is an op.
Now the window moves, how farther it is gonna move will be based on step size. Since we are considering it as 5, window moves 5 characters. from that new moved point, it takes next 15 characters and store them in chunk 2
s is an open-so

Roughly, sliding window chunking looks like this.
[Redis [is an [open-source], in-memory d[ata store] that is primarily used as a cache, database], and message broker. Unlike traditional].

Sliding window is more of a overlapping chunking. Unlike normal overlapping chunking, where we take 1/4th of previous sentence, we are doing a more extensive overlapping in this kind of sliding window chunking.

In overlapping chunking, there is a limitation, if the text contains two unrelated ideas, by means of overlapping chunking, we are bringing them close together. We are forcefully making relationship. This can provide absurd results. Sliding window also carries this limitation. Token consumption will be more. As more number of chunks will be generated, equivalent number of token should also be generated. (tokens will be produced by embedding model)

Another disadvantage with this approach is that, point(generated from query), redundant results will be returned. (as there are several repetitions among several chunks).

Where sliding window chunking can be used ?
When the data in a text are not that related to each other and we need to explicitly establish a relationship between them, sliding window chunking can be used. In essence, to link less related items together.

Token based chunking
Input text is converted to tokens
Single word or character can be considered as token Each token will be assigned a number (like oneshot encoding). These numbers will be sent to embedding model for generating vector points.

When can token based chunking be used?
When there is ratelimiting in the embedding model, we can choose this method, to give a set of tokens(say 100/200 etc). This is not much used.

TOON (Token object oriented notation)
to send json in a more compact manner to a LLM, notation was employed. But this is not much effective.

Some of the commonly used chunking methdologies are shared in this and previous post. There isn't one size fits all chunking method. It varies based our usecase and dataset.

Converting Documents to chunks
Tools for converting documents to a text format so that it can be converted into proper chunks. pdfs cannot be processed as such.

1.Pypdfloader from langchain
2.Pypdf
3.Mupdf etc...
4.Tessaract (for document containing scanned files)

Here also there is no one best tool/package for processing pdfs. It varies based on document data. For special elements like tables in a documents there are few tools, that handles them. First we detect tables(means of regular expression like space before and after. Entire table will be converted into one chunk). We can also use tools like camelot to processing tabular data. Sometimes there can be also images in a document. But in vector DB, it is quite difficult to link images and textual data together. This is all about chunking methdologies.

DEV Community

Day 5 - Chunking continued - RAG

Top comments (0)