Rajesh Singh

Posted on Oct 7

RAG Chunking Strategies That Actually Work (and Why Most Don’t)

#ai #machinelearning #llm #rag

So, you’ve decided to build a RAG (Retrieval-Augmented Generation) system. Congrats! 🎉
You’ve just volunteered for one of AI’s most underrated challenges: getting your data to play nice with your LLM.

Let me paint a picture.

You’ve got your fancy LLM humming, the vector DB is ready, and you excitedly upload your first PDF.
Then you ask something innocent like:

“What’s the company policy on remote work?”

…and your LLM replies with:

“The company policy regarding quantum physics states that Schrödinger’s cat may work remotely on Tuesdays.”

Wait, what?!
Welcome to the wild world of RAG data ingestion, where your biggest enemy isn’t the model, but your chunking strategy.

The “Just Chunk It” Fallacy (a.k.a. Fixed-Size Chunking)

Maybe it’s your first day building RAG. You think,How hard can this be?

“I’ll split my text every 500 characters.”

Here’s what happens next:

text = "The Python Global Interpreter Lock (GIL) is a mutex that protects..."
chunks = [text[i:i+50] for i in range(0, len(text), 50)]

Result:

chunks[0] = "The Python Global Interpreter Lock (GIL) is a mu"
chunks[1] = "tex that protects access to Python objects, pre"

Congratulations — you just dissected the word “mutex.”
Your LLM now thinks you’re talking about Egyptian relics or LaTeX syntax.

✅ The Fix: Smart Chunking (Recursive Character Text Splitting)

You wouldn’t interrupt someone mid-word, right?

Do the same for your text. Use recursive splitting that respects natural boundaries:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

This tells your splitter:

“Start with paragraphs. If that fails, try sentences. Still too long? Fine, use spaces. But please, don’t butcher words.”

Different documents deserve different chunking styles:

🧠 Technical docs: 800–1000 chars — context matters. (Semantic chunking)
🧑‍💻 Code: Keep functions intact. (Syntax-aware chunking)
📊 Tables: Don’t split them at all. (Structure-preserving chunking)

Metadata Amnesia - The Context Killer

You’ve ingested 500 documents.
A user asks about “Q2 2024 sales strategy,” and your RAG confidently answers using… Q2 2019 data. 🤦‍♂️

Why? Because you chunked text with no metadata & no context.

✅ The Fix: Metadata-Enriched Chunking

chunk_metadata = {
    "source": "Q2_2024_Sales_Strategy.pdf",
    "document_type": "strategic_plan",
    "date": "2024-04-01",
    "department": "Sales",
    "author": "Jane Smith",
    "page": 5,
    "section": "Market Analysis",
    "quarter": "Q2",
    "year": 2024,
    "tags": ["sales", "strategy", "B2B"]
}

Now your vector DB can do this:

results = vectordb.search(
    query="sales strategy",
    filter={"year": 2024, "quarter": "Q2"}
)

Boom, the context restored.

“PDF Is Just Text” - The Great Delusion

PDFs are not just text. They can be:

Scanned images (OCR required)
Multi-column layouts
Tables of doom
Pages full of header/footer noise

Naïve parsing gives you this:

"Product Q1 Sales Q2 Sales Widget $100K $150K"

Your LLM now believes “Q1 Sales Q2 Sales Widget” is a product line. 💀

✅ The Fix: Format-Specific Parsing

For normal PDFs:

import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        tables = page.extract_tables()

For scanned PDFs:

import pytesseract
from pdf2image import convert_from_path

images = convert_from_path('scanned.pdf')
text = "".join(pytesseract.image_to_string(img) for img in images)

Want to know if OCR is needed?

def is_scanned_pdf(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = pdf.pages[0].extract_text()
        return len(text.strip()) < 50

“Tables Are Just Text” - No, They’re Not.

Without structure, your tables turn into gibberish.

Original:

| Metric  | Q1    | Q2     | Q3    |
|---------|-------|--------|-------|
| Revenue | $1M   | $1.5M  | $2M   |
| Costs   | $600K | $700K  | $800K |

Parsed (naïve):

"Metric Q1 Q2 Q3 Revenue $1M $1.5M $2M Costs $600K $700K $800K"

✅ The Fix:

Use Structured-to-Unstructured Transformation.

Option 1 - Markdown Table Serialization

def table_to_markdown(table):
    headers = table[0]
    md = "| " + " | ".join(headers) + " |\n"
    md += "| " + " | ".join(["---"] * len(headers)) + " |\n"
    for row in table[1:]:
        md += "| " + " | ".join(str(cell) for cell in row) + " |\n"
    return md

Option 2 - Natural Language Linearization

def table_to_context(table):
    headers = table[0]
    return "\n".join(
        f"{row[0]}: " + ", ".join(f"{headers[i]} is {row[i]}" for i in range(1, len(row)))
        for row in table[1:]
    )

💡 Pro tip: Do both! Store markdown for humans, and natural language for the LLM.

“One Size Fits All” Myth (Chunk Size Disaster)

If you believe “500 characters is the magic chunk size,” brace for chaos.
Every document type is a snowflake.

✅ The Fix: Domain-Adaptive Chunking

CHUNK_CONFIGS = {
    "code": {"size": 300, "overlap": 30},
    "legal": {"size": 1000, "overlap": 200},
    "technical_manual": {"size": 800, "overlap": 100},
    "chat_logs": {"size": 200, "overlap": 20},
    "general": {"size": 500, "overlap": 50}
}

def get_chunker(document_type):
    config = CHUNK_CONFIGS.get(document_type, CHUNK_CONFIGS["general"])
    return RecursiveCharacterTextSplitter(
        chunk_size=config["size"],
        chunk_overlap=config["overlap"]
    )

Pick your chunker wisely.

Duplicate Content Problem

Ingest from multiple sources, and soon your system starts echoing itself — five identical answers, all slightly different.

✅ The Fix: Semantic Deduplication

Approach 1 - Fuzzy Matching

from difflib import SequenceMatcher

def are_chunks_similar(c1, c2, threshold=0.85):
    return SequenceMatcher(None, c1, c2).ratio() >= threshold

Approach 2 - Hash-Based Deduplication

import hashlib

def hash_chunk(text):
    normalized = ' '.join(text.lower().split())
    return hashlib.sha256(normalized.encode()).hexdigest()

Because no one likes that “I’ve heard this before” vibe.

Closing Thoughts

RAG ingestion is hard. Documents are messy. And there’s always that one PDF that breaks everything.

But here’s the truth:
Everyone’s RAG system is held together with duct tape and prayer.
Knowing where to apply the duct tape is the trick.

Start simple, then

Get basic chunking working
Add metadata
Test with real documents
Iterate based on failure

The best RAG system isn’t the one with fancy tech - it’s the one that actually answers user questions correctly.

So go forth, and chunk responsibly.

Got your own RAG horror story? Drop it in the comments😅
If you found this helpful, smash that ❤️ and follow me for more dev adventures!

DEV Community