What is chunking ?
It is one of the step in RAG pipeline. Dividing a large document into several small parts. Each small part is called chunk. Chunking means dividing.Let's consider this following passage:
Redis is a high-speed, in-memory data structure store that functions as a database, cache, message broker, and streaming engine. It is widely used for real-time applications because it keeps data in RAM rather than on disk, enabling sub-millisecond response times. Unlike traditional databases (like MySQL or PostgreSQL) that read from a hard drive, Redis operates in the computer's main memory, which is significantly faster.
We are going to give the whole passage to the embedding model. It will generate a point (let's consider it as P1)and it is stored in vector DB. There is a small problem with this approach. If i ask a query like , "How redis functions ? " intended answer for this question will be "database, cache, message broker, and streaming engine". However, since the entire passage is stored as single point, it wont retrieve the specific part, it will return the entire passage. To get only the specific part and leave out irrelevant parts as an answer to the query, chunking is very important.
Chunking can be performed in two ways:
- Discrete chunking
- Semantic chunking
How Small a chunk should be or what should be the size of a chunk ?
If i ask a question "How are you ? " to LLM, if it answers as "sun rises in the east", it is irrelevant but the stmt provided is not wrong. It is just irrelevant to the question provided. LLM wont just say, i dont know, it tries to make up some answer. By means of chunking, we are going to tweak the way in which LLM provides answer.
Discrete chunking
Fixed logic to generate chunk; Let's see some types in discrete chunking :
Fixed Chunking
If i say size as 25 characters, each chunk will contain only 25 characters. In a paragraph, first 25 characters will be in chunk1 , next 25 characters will be in chunk2 etc... In the redis passage, if i start to split into 25 characters, first chunk would be Redis is a high-speed i second chunk would be n memory data structure etc. When we see these chunks, we can see that, meaning of the words is lost due to splitting. What can we infer from this chunk Redis is a high-speed i meaning is lost right ?
How can we better do chunking in this ?
Besides taking 25 characters, we can take till sentence get completed i.e 25 characters and till fullstop. In this case, chunk 1 would be Redis is a high-speed in-memory data structure store that functions as a database, cache, message broker, and streaming engine
Overlapping chunks
Taking from the heading, words between the chunks would be overlapped. i.e Consider the first sentence as Redis is a high-speed in-memory data structure store that functions as a database, cache, message broker, and streaming engine and second sentence as It is widely used for real-time applications because it keeps data in RAM rather than on disk, enabling sub-millisecond response times. If overlapping chunking is applied,few words from the last sentence would be added to starting of next sentence. i.e
Chunk 1 would be Redis is a high-speed in-memory data structure store that functions as a database, cache, message broker, and streaming engine and Chunk 2 would be database, cache, message broker, and streaming engine. It is widely used for real-time applications because it keeps data in RAM rather than on disk, enabling sub-millisecond response times .
Sometimes there are chances for the points to be plotted farther from each other although the texts are closely related to each other. overlapping chunking will reduce this event to some extent.
Top comments (0)