DEV Community: Yash Bhoskar

Agentic Chunking - Why Your RAG Pipeline Is Quietly Failing (And How to Fix It)

Yash Bhoskar — Thu, 25 Jun 2026 16:21:48 +0000

The Real Reason RAG Fails

You embedded your docs. You picked a great model. Answers are still shallow and wrong.

The culprit isn't your model. It's your chunks.

Split ideas in the wrong place and your retriever returns broken context. Your model hallucinates to fill the gaps. Traditional chunking optimizes for speed — agentic chunking optimizes for understanding.

The Chunking Showdown

Method	How It Works	Strengths	Weaknesses
Fixed-Size	Split every N tokens	Fast, cheap	Cuts mid-idea, mixes unrelated concepts
Semantic	Split at similarity boundaries	More natural breaks	Still static, misses long-range links
Proposition-Based	Extract atomic facts first	High granularity, factually precise	Can feel fragmented without smart grouping
Agentic	LLM decides chunk membership + evolves metadata	Meaning-first, dynamic, coherent	More LLM calls, higher indexing cost

What Makes Agentic Chunking Different

It behaves like a good editor, not a pair of scissors:

Generalizes across vocabulary — apples, pizza, and sushi all become food_preferences
Evolves chunk metadata dynamically — titles and summaries refresh as new content is added, improving retrieval ranking
Handles real-world mess — blogs, research notes, and docs that repeat or shift topics don't break it

The Core Loop

Built on the Dense X Retrieval paper (arXiv:2312.06648) which proved propositions outperform sentences and passages as retrieval units:

generated by nano 🍌

for proposition in extract_propositions(document):
    chunk_id = find_relevant_chunk(proposition, chunk_outline)
    if chunk_id:
        add_to_chunk(chunk_id, proposition)
        refresh_metadata(chunk_id)
    else:
        create_new_chunk(proposition)

For reliable structured output from the LLM:

class ChunkID(BaseModel):
    chunk_id: str | None = None

Want to try this in LangChain? The community prompt is live here:
🔗 kumja/proposal-indexing on LangSmith Hub

🌽 The Corn Test

Your knowledge base contains three corn-related facts: fresh corn, corn tortillas, and high-fructose corn syrup.

A fixed-size chunker mashes them together. A query about healthy snacks retrieves the corn syrup fact too — poisoning your context.

An agentic chunker separates them into fresh_produce, traditional_cuisine, and food_additives. The right chunk, the right query, the right answer.

If your chunking can handle corn, it can handle production data.

Honest Tradeoff

Need ultra-low cost and latency? Semantic chunking is fine.

Need retrieval quality you can trust? Agentic chunking is worth the extra LLM calls — especially for long-form docs, overlapping topics, or any pipeline where a wrong answer has real consequences.

FAQs

Is agentic chunking the same as proposition-based chunking?
No — propositions are the raw material. Agentic chunking is the assembly process that groups them intelligently.

Does it work with any embedding model?
Yes. It's a pre-processing step. Embed with whatever you prefer after chunking.

Best model to use as the agent?
Smaller models (Haiku, GPT-4o-mini) handle chunk assignment well. Use larger models for generating high-quality titles and summaries.

Good for real-time ingestion?
Best for batch indexing. For real-time, run semantic chunking fast and agentic chunking async in the background.

📄 Research: Chen et al. (2023). Dense X Retrieval. arXiv:2312.06648
🔗 LangSmith Prompt: kumja/proposal-indexing

Docling - AI-Powered Document Pipeline for LLMs & RAG

Yash Bhoskar — Thu, 25 Jun 2026 16:18:20 +0000

If you've ever tried feeding a PDF into an LLM and wondered why the output was garbage — the problem wasn't your model. It was your parser.
Docling is an open-source document AI pipeline by IBM Research that goes far beyond text extraction. Unlike traditional tools like pypdf or pdfplumber, Docling uses deep learning to understand document structure — reconstructing tables, fixing reading order, and producing clean, LLM-ready output. Whether you're building a RAG system, processing financial reports, or ingesting research papers, Docling is the document intelligence layer your pipeline is missing.

It doesn’t just extract content — it reconstructs the meaningful layout of a document.

Why Docling Beats Traditional Parsers

Let’s be honest — traditional libraries were never built for AI workflows.

Feature	Traditional Parsers	Docling
Text Extraction	✅	✅
Layout Understanding	❌	✅
Table Reconstruction	❌ (messy text)	✅ (structured grid)
Multi-format Support	Limited	Extensive
Reading Order	Broken in columns	Correct
Chunking for LLMs	Manual	Built-in
Metadata Awareness	❌	✅

The Real Problem with Traditional Tools

Traditional tools:

Extract text based on positions, not meaning
Break tables into unreadable blobs
Completely mess up multi-column layouts
Lose context like headings, sections, and hierarchy

Result: Garbage input → Poor LLM output

Docling’s Edge

Docling flips the game:

Uses deep learning models (not heuristics)
Understands document structure like a human
Outputs clean, structured, LLM-ready data

This is not parsing — this is document intelligence.

Multi-Format Support (One Pipeline to Rule Them All)

Docling isn’t just for PDFs.

It seamlessly handles:

PDF
Word (.docx)
PowerPoint (.pptx)
Excel (.xlsx)
HTML / Markdown
Images (PNG, JPEG, TIFF)
AsciiDoc

You can run a single pipeline across mixed document types — something traditional tools simply can’t do.

The Parsing Phase — Where Docling Truly Shines

Layout Understanding (DocLayNet)

Docling uses DocLayNet, a trained model that identifies:

Headings
Paragraphs
Tables
Figures
Captions
Footnotes
Lists
Code blocks

It doesn’t just see text — it understands what that text is.

Table Parsing (TableFormer)

Traditional tools butcher tables.

Docling uses TableFormer to:

Reconstruct full table grids
Handle merged cells
Understand multi-line headers
Preserve row/column relationships

Output = Clean, structured data (not scrambled text)

Figure & Chart Detection

Extracts figures as images
Links them with captions
Maintains document context

⚠️ Note: It does not interpret chart data — only isolates it cleanly.

🔍 OCR (But Done Right)

For scanned documents:

Uses EasyOCR / Tesseract
Maintains layout-aware reading order

No more left-to-right OCR chaos.

Reading Order Recovery

This is a silent killer in PDFs.

Docling:

Fixes multi-column reading
Reconstructs logical flow
Makes documents actually readable for LLMs

Chunking — Built for RAG (This is Gold)

If you're building RAG systems, this is where Docling becomes insane value.

Hierarchical Chunking

Respects structure (heading → section → paragraph)
No random splits mid-sentence

Hybrid Chunking

Combines:
- Semantic structure
- Token limits

Perfect chunks for LLM context windows

Context Preservation

Each chunk carries:

Page number
Bounding box
Section hierarchy

Retrieval becomes accurate + explainable

Tables & Figures Stay Intact

Tables are never split
Figures remain atomic

No more broken context in retrieval

DoclingDocument — The Secret Sauce

Instead of raw text, Docling outputs a:

`DoclingDocument`

A structured representation of:

Entire document hierarchy
Layout elements
Metadata

You can export it as:

Markdown
JSON
HTML

This makes the pipeline fully composable

Plug-and-Play with LLM Ecosystems

Docling integrates with:

Drop it straight into your RAG pipeline as the ingestion layer.

⚠️ What Docling Isn’t Perfect At

Let’s keep it real:

❌ No chart-to-data interpretation
🐢 Slow for very large documents (200+ pages)
⚖️ Overkill for simple text PDFs

When Should You Use Docling?

Use Docling when working with:

📄 Research papers
📊 Financial reports
📘 Technical documentation
📜 Contracts

Basically — anything with structure

💡 When NOT to Use It

Skip Docling if:

You just need plain text extraction
Your documents are extremely simple

In those cases, lighter tools are faster.

Bonus: Notebook for Hands-On Usage

A full notebook is attached where you can explore Docling in action and integrate it efficiently into your pipeline.

Final Thoughts

Docling isn’t just another parser — it’s a foundation layer for Document AI systems.

If traditional tools are:

“Extract text and hope for the best”

Docling is:

“Understand the document, preserve its meaning, and make it LLM-ready”

🧠 My Take

As LLM applications grow, input quality matters more than model size.

Docling solves the real bottleneck:
👉 Turning messy documents into structured, meaningful data

And that’s exactly why it stands out.

Deep Dive into Semantic Chunking for RAG

Yash Bhoskar — Thu, 25 Jun 2026 16:12:33 +0000

In the previous article, Different Chunking Methods for RAG, we explored several strategies used to split documents before feeding them into a Retrieval-Augmented Generation (RAG) pipeline.

In this chapter, we’ll go deeper into Semantic Chunking — one of the most powerful techniques for improving retrieval accuracy in modern RAG systems.

We’ll cover:

What semantic chunking actually means ?
How it works internally ?
Why it improves retrieval accuracy ?
How it compares to other chunking strategies used in production systems ?

Why Traditional Chunking Often Fails

Most early RAG pipelines relied on fixed-size chunking, where documents are split into chunks of predefined size (for example, 500 tokens with a 50 token overlap).

While this approach is simple, it introduces a fundamental problem:

it ignores the semantic structure of the text.

For example, imagine a paragraph discussing transformer architectures, followed by another paragraph explaining reinforcement learning. A fixed-size splitter might cut the text in the middle of the explanation, creating chunks that contain partial or mixed topics.

This leads to two common issues:

Context fragmentation – important ideas get split across chunks.
Noisy retrieval – chunks contain unrelated information.

When these chunks are retrieved during query time, the LLM receives incomplete or irrelevant context, which directly reduces answer quality.

What is Semantic Chunking?

Semantic chunking is a strategy that splits documents based on meaning rather than size.

Instead of arbitrarily cutting text every few hundred tokens, semantic chunking groups sentences that discuss the same topic.

The goal is simple:

Each chunk should represent a coherent semantic idea.

For example, consider the following sequence of sentences:

Sentence 1: Explanation of transformers
Sentence 2: Attention mechanism in transformers
Sentence 3: Multi-head attention architecture
Sentence 4: Reinforcement learning algorithms

A semantic chunker would produce:

Chunk 1 → Sentences 1–3 (transformer topic)
Chunk 2 → Sentence 4 (new topic)

This ensures that each chunk represents a complete concept, which significantly improves retrieval relevance.

How Semantic Chunking Works

Most semantic chunking implementations follow a similar pipeline.

Step 1 — Sentence Segmentation

The document is first split into sentences.

Example:

Document → Sentence1, Sentence2, Sentence3, Sentence4

This allows the algorithm to analyze semantic similarity at a granular level.

Step 2 — Generate Sentence Embeddings

Each sentence is converted into a vector representation using an embedding model.

Common embedding models include:

Sentence Transformers
BGE embeddings
Instructor embeddings
OpenAI embeddings

Each sentence is now represented as a high-dimensional vector capturing its meaning.

Step 3 — Compute Similarity Between Sentences

Next, the algorithm calculates cosine similarity between consecutive sentences.

Example:

similarity(S1, S2)
similarity(S2, S3)
similarity(S3, S4)

High similarity indicates the sentences belong to the same topic, while low similarity suggests a topic shift.

Step 4 — Detect Topic Boundaries

If the similarity between sentences drops below a predefined threshold, a new chunk boundary is created.

Example rule:

similarity > 0.75 → same chunk
similarity < 0.65 → start new chunk

This dynamically segments the document based on semantic transitions.

Step 5 — Build Semantic Chunks

Finally, sentences are grouped into chunks that maintain topic continuity.

Unlike fixed chunking, semantic chunks may vary in size, but they maintain contextual coherence.

Why Semantic Chunking Improves RAG Performance

Semantic chunking improves RAG pipelines in several important ways.

1. Better Context Integrity

Each chunk contains a complete explanation of a concept, which helps the LLM reason more effectively.

2. Higher Retrieval Precision

Vector similarity search works best when chunks represent clear semantic topics rather than mixed content.

3. Reduced Hallucination

When retrieved context is precise and coherent, the LLM is less likely to generate unsupported information.

4. Improved Answer Grounding

Because chunks are semantically aligned, answers are better supported by retrieved documents.

Accuracy Comparison with Other Chunking Methods

Across many internal and industry experiments, semantic chunking tends to outperform traditional chunking approaches.

Chunking Method	Retrieval Precision	Context Quality	Implementation Effort
Fixed Token Chunking	Medium	Low	Easy
Recursive Chunking	Medium–High	Medium	Moderate
Semantic Chunking	High	High	Advanced

In many RAG systems, teams report:

15–30% improvement in retrieval relevance
More grounded responses
Lower hallucination rates

These improvements become especially noticeable in long-form documents like research papers, legal documents, or technical documentation.

Practical Challenges

Despite its advantages, semantic chunking is not always trivial to implement.

Some practical challenges include:

Higher compute cost
Generating embeddings for every sentence can be expensive for large document sets.

Threshold tuning
The similarity threshold must be tuned carefully to avoid overly small or overly large chunks.

Variable chunk sizes
Chunks can become uneven, which sometimes requires adding a maximum token limit.

Production Best Practices

In most production RAG systems, semantic chunking is combined with token limits and overlap strategies.

A common configuration looks like this:

Semantic similarity threshold: 0.75
Max chunk size: 800 tokens
Overlap: 50 tokens

This ensures chunks remain semantically meaningful while staying within model limits.

What’s Next

Semantic chunking is a powerful technique, but it’s just one piece of the puzzle. In the next chapter, we’ll explore Agentic Chunking — a dynamic approach where the LLM itself decides how to group information based on meaning and relevance, evolving chunk metadata over time.

Follow along as we discuss Agentic Chunking in our next chapter.

Different Chunking Methods for RAG

Yash Bhoskar — Thu, 25 Jun 2026 16:04:39 +0000

The Ultimate Guide to Chunking Methods for RAG

What is Chunking?

To stay within a Large Language Model's (LLM) token limit, we employ chunking—a preprocessing technique that breaks down continuous text into discrete blocks. This allows the model to process information efficiently without exceeding its memory constraints.

What is RAG?

LLMs often suffer from hallucinations, generating false information with unearned confidence. This lack of factual "grounding" makes them unreliable for many high-stakes tasks.

To solve this, RAG (Retrieval-Augmented Generation) was introduced to provide LLMs with a "source of truth" to consult before answering.

To make RAG work, we first turn our documents into "digital fingerprints" called vector embeddings. We use specialized AI models (bi-encoders) to translate human text into these numbers, which are then stored in a vector database.

Think of it like a high-tech library: the quality of the search depends entirely on how we’ve filed the information. If our "chunks" of text are too big or too small, the AI won't find the right answer. That’s why choosing a smart chunking strategy is just as important as the search method itself for getting accurate results.

Different Chunking Methods

1. Fixed-Size Chunking

This is the most straightforward "brute force" approach where you decide on a set number of characters or tokens (e.g., 500 characters) and split the text exactly at those intervals.

While it is incredibly fast and computationally cheap, it is "blind" to the content. It often cuts sentences in half or separates a heading from its relevant paragraph, which can lead to a loss of context during retrieval. This is the "old reliable" method—it just counts characters and cuts.

from langchain.text_splitter import CharacterTextSplitter

text = "Your long document text here..."
splitter = CharacterTextSplitter(
    separator="",
    chunk_size=500,
    chunk_overlap=50  # Overlap helps keep context between chunks
)
chunks = splitter.split_text(text)

2. Recursive Chunking

Considered the "industry standard" for many applications, this method attempts to be more polite to the structure of the text. It uses a hierarchy of separators—starting with double newlines, then single newlines, then spaces—to break the text.

If a paragraph is too big, it looks for the next best place to split it, aiming to keep related sentences together in a single block as much as possible. This is the recommended default.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.create_documents([text])

3. Document-Specific Chunking

This method acknowledges that a Python script, an HTML page, and a Markdown file are structured differently. Instead of treating everything like a plain wall of text, it uses the document’s inherent formatting (like <div> tags, # headers, or function definitions) to determine the boundaries.

This ensures that a single function or a specific sub-section of a manual stays intact as a coherent unit. This is best for structured data like Markdown, HTML, or Code.

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(markdown_text)

4. Semantic Chunking

Rather than looking at characters or formatting, this method looks at meaning. It analyzes the "distance" in ideas between sentences; as long as the sentences are talking about the same topic, they stay in the same chunk.

When the model detects a significant shift in the subject matter, it creates a break. This results in chunks that vary in size but are incredibly consistent in their topical focus. This requires an embedding model to "read" the sentences and decide if they belong together.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

# It groups sentences by how similar they are in meaning
splitter = SemanticChunker(OpenAIEmbeddings())
chunks = splitter.create_documents([text])

5. Agentic Chunking

This is the most advanced and "human-like" strategy, where an LLM acts as an autonomous agent to decide where the breaks should go. The agent reads the document and asks, "Does this part stand alone as a complete thought?" It essentially "edits" the document into logical pieces based on high-level reasoning.

While this is the most accurate and context-aware method, it is also the slowest and most expensive because it requires multiple AI calls just to prepare the data. This is usually a custom "loop" where you ask an LLM to look at a chunk and decide if it's "complete" or needs more text.

# Conceptual Pseudo-code (usually implemented via LangGraph or custom loops)
def agentic_split(text):
    # 1. Start with a small piece of text
    # 2. Ask LLM: "Is this a complete thought?"
    # 3. If NO: Add next sentence and repeat.
    # 4. If YES: Create chunk and move to the next part.
    pass

Which One Should You Use?

Method	Best For...	Difficulty	Cost
Fixed	Quick prototypes	Very Easy	$0
Recursive	General text / Articles	Easy	$0
Document	Code / Formatted docs	Medium	$0
Semantic	Deep research / RAG	Hard	Low (Embedding API calls)
Agentic	High-precision needs	Very Hard	High (LLM API calls)

What's Next?

That’s the high-level view of how we break down data for LLMs! But knowing the definitions is only half the battle. In the coming weeks, I’ll be breaking down each of these strategies in detail—sharing the code, the common pitfalls, and the "Goldilocks" settings for your chunk sizes.

Follow along as we Deep Dive into Semantic Chunking For Rag in our next chapter.

RAG Is Not Just Chunking Embedding Retrieval Generation

Yash Bhoskar — Thu, 25 Jun 2026 15:52:21 +0000

If I had a dollar $ for every time someone explained RAG in exactly four boxes and an arrow between each, I'd have enough to fine-tune a small LLM by now.

Here's the thing — those four boxes aren't wrong. They're just the skeleton. And a skeleton without organs, blood flow, and a nervous system doesn't walk anywhere. It just lies there looking like it should work.

So before you nod along to the "it's simple" version, sit with these for a second:

Did your parser actually capture the table on page 14, or did it turn into word soup?
That chart your document had — does your pipeline even know it existed?
Why that chunk size? Why that overlap? Did you pick it, or did a tutorial pick it for you?
Your vector DB choice — was that a real decision, or the first result on Google?
The 5 chunks you retrieved — are they relevant, or just similar-sounding?
Is there noise riding along with the signal, diluting your answer?
How do you know the LLM's answer is actually grounded in what you retrieved, and not just... plausible?

That's not pedantry. That's the entire difference between a RAG demo that wows your manager once and a RAG system that survives contact with real users and real documents.

The Real Flow (Bird's-Eye View)

Think of it less like a pipe and more like a relay race with judges at every handoff:

Stage	What's actually happening	The question nobody asks
Parsing	Documents → clean structured text	Did tables/images survive, or vanish?
Chunking	Splitting text into digestible pieces	Why this size? Why this overlap?
Embedding	Turning chunks into vectors	Does this model "get" your domain?
Storage	Vectors land in a DB	Picked for hype, or for your scale/latency needs?
Hybrid Search	Keyword (BM25) + semantic search	Are you only doing vector search and missing exact matches?
Metadata Filtering	Narrowing by source/date/dept	Or is everything just dumped into one giant pile?
Reranking	Cross-encoder re-scores top candidates	Or are you trusting raw similarity scores blindly?
Context Selection	Picking the final Top-K chunks	Too few = missing info. Too many = confused LLM.
Generation	LLM writes the answer	Grounded in your docs, or politely hallucinating?
Answer Relevancy	Did it actually answer the question	Anyone checking, or just shipping it?

Every single row above has its own failure modes, its own trade-offs, and honestly — its own rabbit hole worth a blog post of its own.

Why This Actually Matters

A "simple" RAG pipeline fails silently. It doesn't crash — it just gives you a confidently wrong answer, citing a chunk that's 70% irrelevant, built from a table your parser butchered, retrieved because it was vector-similar rather than actually-useful. And nobody notices until a user does.

Good RAG isn't about stacking the four boxes. It's about making every junction in that relay race accountable — parsing accountable for fidelity, chunking accountable for context, retrieval accountable for relevance, generation accountable for grounding.

What's Next

This was the 30,000-ft view — intentionally not deep, just enough to make you go "oh, there's way more going on here." Up next, I'll deep-dive each stage one by one, starting with the most underrated villain of every RAG pipeline: document parsing (yes, before you even think about chunking).

Stay tuned. 🧠

Inspired by my own hurdles 🙂