DEV Community

Cover image for RAG Chunking Strategies That Actually Work (and Why Most Don’t)
Rajesh Singh
Rajesh Singh

Posted on

RAG Chunking Strategies That Actually Work (and Why Most Don’t)

So, you’ve decided to build a RAG (Retrieval-Augmented Generation) system. Congrats! 🎉
You’ve just volunteered for one of AI’s most underrated challenges: getting your data to play nice with your LLM.

Let me paint a picture.

You’ve got your fancy LLM humming, the vector DB is ready, and you excitedly upload your first PDF.
Then you ask something innocent like:

“What’s the company policy on remote work?”

…and your LLM replies with:

“The company policy regarding quantum physics states that Schrödinger’s cat may work remotely on Tuesdays.”

Wait, what?!
Welcome to the wild world of RAG data ingestion, where your biggest enemy isn’t the model, but your chunking strategy.

The “Just Chunk It” Fallacy (a.k.a. Fixed-Size Chunking)

Maybe it’s your first day building RAG. You think,How hard can this be?

“I’ll split my text every 500 characters.”

Here’s what happens next:

text = "The Python Global Interpreter Lock (GIL) is a mutex that protects..."
chunks = [text[i:i+50] for i in range(0, len(text), 50)]
Enter fullscreen mode Exit fullscreen mode

Result:

chunks[0] = "The Python Global Interpreter Lock (GIL) is a mu"
chunks[1] = "tex that protects access to Python objects, pre"
Enter fullscreen mode Exit fullscreen mode

Congratulations — you just dissected the word “mutex.”
Your LLM now thinks you’re talking about Egyptian relics or LaTeX syntax.

✅ The Fix: Smart Chunking (Recursive Character Text Splitting)

You wouldn’t interrupt someone mid-word, right?

Do the same for your text. Use recursive splitting that respects natural boundaries:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)
Enter fullscreen mode Exit fullscreen mode

This tells your splitter:

“Start with paragraphs. If that fails, try sentences. Still too long? Fine, use spaces. But please, don’t butcher words.”

Different documents deserve different chunking styles:

🧠 Technical docs: 800–1000 chars — context matters. (Semantic chunking)
🧑‍💻 Code: Keep functions intact. (Syntax-aware chunking)
📊 Tables: Don’t split them at all. (Structure-preserving chunking)

Metadata Amnesia - The Context Killer

You’ve ingested 500 documents.
A user asks about “Q2 2024 sales strategy,” and your RAG confidently answers using… Q2 2019 data. 🤦‍♂️

Why? Because you chunked text with no metadata & no context.

✅ The Fix: Metadata-Enriched Chunking

chunk_metadata = {
    "source": "Q2_2024_Sales_Strategy.pdf",
    "document_type": "strategic_plan",
    "date": "2024-04-01",
    "department": "Sales",
    "author": "Jane Smith",
    "page": 5,
    "section": "Market Analysis",
    "quarter": "Q2",
    "year": 2024,
    "tags": ["sales", "strategy", "B2B"]
}
Enter fullscreen mode Exit fullscreen mode

Now your vector DB can do this:

results = vectordb.search(
    query="sales strategy",
    filter={"year": 2024, "quarter": "Q2"}
)
Enter fullscreen mode Exit fullscreen mode

Boom, the context restored.

“PDF Is Just Text” - The Great Delusion

PDFs are not just text. They can be:

  • Scanned images (OCR required)
  • Multi-column layouts
  • Tables of doom
  • Pages full of header/footer noise

Naïve parsing gives you this:

"Product Q1 Sales Q2 Sales Widget $100K $150K"
Enter fullscreen mode Exit fullscreen mode

Your LLM now believes “Q1 Sales Q2 Sales Widget” is a product line. 💀

✅ The Fix: Format-Specific Parsing

For normal PDFs:

import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        tables = page.extract_tables()
Enter fullscreen mode Exit fullscreen mode

For scanned PDFs:

import pytesseract
from pdf2image import convert_from_path

images = convert_from_path('scanned.pdf')
text = "".join(pytesseract.image_to_string(img) for img in images)
Enter fullscreen mode Exit fullscreen mode

Want to know if OCR is needed?

def is_scanned_pdf(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = pdf.pages[0].extract_text()
        return len(text.strip()) < 50
Enter fullscreen mode Exit fullscreen mode

“Tables Are Just Text” - No, They’re Not.

Without structure, your tables turn into gibberish.

Original:

| Metric  | Q1    | Q2     | Q3    |
|---------|-------|--------|-------|
| Revenue | $1M   | $1.5M  | $2M   |
| Costs   | $600K | $700K  | $800K |
Enter fullscreen mode Exit fullscreen mode

Parsed (naïve):

"Metric Q1 Q2 Q3 Revenue $1M $1.5M $2M Costs $600K $700K $800K"
Enter fullscreen mode Exit fullscreen mode

✅ The Fix:

Use Structured-to-Unstructured Transformation.

Option 1 - Markdown Table Serialization

def table_to_markdown(table):
    headers = table[0]
    md = "| " + " | ".join(headers) + " |\n"
    md += "| " + " | ".join(["---"] * len(headers)) + " |\n"
    for row in table[1:]:
        md += "| " + " | ".join(str(cell) for cell in row) + " |\n"
    return md
Enter fullscreen mode Exit fullscreen mode

Option 2 - Natural Language Linearization

def table_to_context(table):
    headers = table[0]
    return "\n".join(
        f"{row[0]}: " + ", ".join(f"{headers[i]} is {row[i]}" for i in range(1, len(row)))
        for row in table[1:]
    )
Enter fullscreen mode Exit fullscreen mode

💡 Pro tip: Do both! Store markdown for humans, and natural language for the LLM.

“One Size Fits All” Myth (Chunk Size Disaster)

If you believe “500 characters is the magic chunk size,” brace for chaos.
Every document type is a snowflake.

✅ The Fix: Domain-Adaptive Chunking

CHUNK_CONFIGS = {
    "code": {"size": 300, "overlap": 30},
    "legal": {"size": 1000, "overlap": 200},
    "technical_manual": {"size": 800, "overlap": 100},
    "chat_logs": {"size": 200, "overlap": 20},
    "general": {"size": 500, "overlap": 50}
}

def get_chunker(document_type):
    config = CHUNK_CONFIGS.get(document_type, CHUNK_CONFIGS["general"])
    return RecursiveCharacterTextSplitter(
        chunk_size=config["size"],
        chunk_overlap=config["overlap"]
    )
Enter fullscreen mode Exit fullscreen mode

Pick your chunker wisely.

Duplicate Content Problem

Ingest from multiple sources, and soon your system starts echoing itself — five identical answers, all slightly different.

✅ The Fix: Semantic Deduplication

Approach 1 - Fuzzy Matching

from difflib import SequenceMatcher

def are_chunks_similar(c1, c2, threshold=0.85):
    return SequenceMatcher(None, c1, c2).ratio() >= threshold
Enter fullscreen mode Exit fullscreen mode

Approach 2 - Hash-Based Deduplication

import hashlib

def hash_chunk(text):
    normalized = ' '.join(text.lower().split())
    return hashlib.sha256(normalized.encode()).hexdigest()
Enter fullscreen mode Exit fullscreen mode

Because no one likes that “I’ve heard this before” vibe.

Closing Thoughts

RAG ingestion is hard. Documents are messy. And there’s always that one PDF that breaks everything.

But here’s the truth:
Everyone’s RAG system is held together with duct tape and prayer.
Knowing where to apply the duct tape is the trick.

Start simple, then

  1. Get basic chunking working
  2. Add metadata
  3. Test with real documents
  4. Iterate based on failure

The best RAG system isn’t the one with fancy tech - it’s the one that actually answers user questions correctly.

So go forth, and chunk responsibly.

Got your own RAG horror story? Drop it in the comments😅
If you found this helpful, smash that ❤️ and follow me for more dev adventures!

Top comments (0)