DEV Community

Ramya Perumal
Ramya Perumal

Posted on

RAG - Sliding Window, Token Based Chunking and PDF Chunking Packages

Sliding Window Chunking

Sliding Window Chunking is a more intensive chunking mechanism.

In this method, a window size is defined based on a character or token limit. Instead of creating completely separate chunks, the window moves forward gradually while keeping part of the previous content.

  • The character or token limit is called the window size
  • The amount the window moves forward each time is called the step size

This is a stricter form of overlapping chunking.

How it Works

Suppose:

Window size = 500 characters
Step size = 100 characters

The first chunk may contain characters 1–500.
The second chunk starts after moving 100 characters and may contain characters 101–600.

Because of this overlap, related information is repeatedly included across chunks.

Benefits

The major benefit of this method is that semantically related points are stored closer together in the vector database, almost like clusters. This improves retrieval in scenarios where context changes frequently.

Disadvantages
Problem 1: Higher Token Consumption

Since overlapping data is repeatedly embedded, the embedding model consumes more tokens. This increases computational cost unless local embedding models are used.

Problem 2: Duplicate Retrieval

Because related chunks are stored very close together, the LLM may retrieve multiple duplicate or nearly identical chunks instead of retrieving different contexts.

As a result:

  • Context diversity decreases
  • Token usage increases
  • Retrieval efficiency may reduce

Where Sliding Window Chunking is Useful

Sliding window chunking is useful in scenarios where context switching happens frequently.

Example: Source Code

In coding-related datasets:

  • Different parts of the code may not be directly related
  • One service or module may trigger another service indirectly
  • Important context may exist across multiple sections

For example, in microservices architecture:

  • One service event may trigger another service
  • Related logic may exist in different files or services

Sliding window chunking helps preserve such relationships, even though it comes with higher token consumption.

Token Based Chunking

Token-based chunking mainly focuses on cost optimization and model limitations.

LLMs process text as tokens rather than words.

Depending on the tokenizer and model:

  • One word may become a single token
  • One word may become multiple tokens
  • Sometimes even a single character can become a token

Since models have token input limits, token-based chunking is used to ensure the content stays within the allowed token size.

In this method:

  1. Text is split based on token count
  2. Chunks are converted into embedding vectors
  3. Vectors are stored in the vector database

This method is mainly used when working with strict token constraints.

TOON (Token-Oriented Object Notation)

TOON stands for Token-Oriented Object Notation.

It is an alternative representation format designed to reduce token usage compared to JSON.

JSON is human-readable, but repeated keys increase token consumption.

Repeated structures and keys increase token usage.

TOON reduces repeated keys and represents the same information in a more token-efficient format.

The purpose is to reduce embedding and inference cost while preserving context.

LLMLingua

LLMLingua is a framework used for prompt compression.

It converts user queries or prompts into simplified versions while preserving the original meaning and context.

The main goal is:

  • Reduce token consumption
  • Lower inference cost
  • Improve efficiency

However, aggressive compression may sometimes reduce retrieval quality compared to the original JSON or text structure.

Summary of Chunking Methods

The commonly used chunking methods are:

  • Fixed Chunking
  • Overlapping Chunking
  • Semantic Chunking
  • Embedding-Based Chunking
  • Sliding Window Chunking
  • Token-Based Chunking

These methods represent different approaches and trade-offs.

In real-world applications, multiple chunking methods are often combined depending on:

  • Dataset type
  • Retrieval quality
  • Cost constraints
  • Token limitations
  • Application requirements

There is no single chunking strategy that works best for every dataset.

PDF Reading in RAG Systems

To process documents such as company internal communication files, PDFs must first be converted into readable text.

Several libraries are commonly used for this purpose under LangChain Framework:

  • PyPDFLoader
  • PyPDF
  • PyMuPDF

Different document types require different processing approaches. A single package may not work effectively for all document formats.

Challenges in PDF Processing

Documents may contain:

Scanned images
Multi-column layouts
Tables
Handwritten text
Two-sided scanned pages

Because of this, preprocessing becomes an important step.

Tools Used in PDF Processing
Camelot
Camelot is commonly used to extract table content from PDFs.

Tesseract
Tesseract or computer vision models are used to convert scanned images into readable text documents.

Final RAG Flow for Documents

  1. Raw documents are collected
  2. Images, tables, and scanned content are converted into text
  3. Data is cleaned
  4. Documents are split into chunks using chunking methods
  5. Chunks are converted into embedding vectors
  6. Vectors are stored in the vector database for retrieval

Top comments (0)