Parth Sarthi Sharma

Posted on Jan 3

Loaders, Splitters & Embeddings — How Bad Chunking Breaks Even Perfect RAG Systems

#ai #llm #rag #vectordatabase

When people debug poor RAG results, they usually blame:

the vector database
the embedding model
the prompt

But in real systems, the most common root cause sits much earlier:

👉 The document ingestion pipeline

If ingestion is wrong, retrieval will be wrong — no matter how good your embeddings or LLM are.

In this article, we’ll break down:

What a document ingestion pipeline actually is
The role of loaders, splitters, and embeddings
Why chunking is the most underestimated design decision
How ingestion mistakes silently ruin RAG systems

This is concept-first thinking, with tooling examples only where helpful.

1. What Is a Document Ingestion Pipeline?

At a high level, an ingestion pipeline converts raw data into retrievable semantic units.

Raw Source
  → Documents
    → Chunks
      → Embeddings
        → Vector Store

Each stage:

Loses information if done poorly
Constrains everything downstream
Is extremely hard to “fix later”

A bad ingestion pipeline doesn’t fail loudly — it fails by returning plausible but wrong answers.

2. Document Loaders: The Foundation (Often Ignored)

What loaders do

Document loaders are responsible for:

Reading raw sources (PDFs, HTML, Markdown, APIs, DBs)
Extracting text
Attaching metadata

Examples of sources:

PDFs (policies, contracts)
Websites / wikis
Git repositories
Knowledge bases (Confluence, Notion)
Databases or APIs

Common loader failures

PDFs with broken text order
Headers/footers mixed into content
Navigation menus treated as text
Missing or inconsistent metadata

If metadata is lost at this stage, you cannot recover it later.

Design rule #1

Treat metadata as first-class data.

At minimum, preserve:

source (URL / file / system)
page or section
document type
timestamp
access scope

Frameworks (e.g. LangChain loaders) help — but you must inspect loader output manually, at least once.

3. Text Splitters: Where Most RAG Systems Break

LLMs do not retrieve documents.
They retrieve chunks.

This makes text splitting one of the most important — and misunderstood — steps in RAG.

Why splitting matters

Bad splitting causes:

Partial facts
Broken reasoning
“Lost in the middle” effects
Irrelevant or misleading retrievals

Once text is chunked incorrectly, embeddings faithfully encode the wrong thing.

4. Chunk Size Is a Trade-Off, Not a Constant

There is no universally “correct” chunk size.

Small chunks

Pros

Higher recall
Precise matching

Cons

Loss of local context
Fragmented meaning

Large chunks

Pros

Better semantic completeness
More context per chunk

Cons

Fewer chunks fit in the context window
Irrelevant content dilutes relevance

The right chunk size depends on document structure and query intent — not a blog default.

5. Why Bad Chunking Ruins Even Perfect Embeddings

This is the key misconception:

“If embeddings are good, retrieval will be good.”

Not true.

Embeddings encode what you give them.

If a chunk:

mixes multiple topics
cuts sentences mid-thought
spans unrelated sections

Then the embedding becomes a semantic average — and retrieval quality collapses.

This is why:

identical embedding models can perform wildly differently
RAG quality varies more by ingestion than by model choice

6. Overlap: A Necessary Evil

Chunk overlap exists to:

preserve continuity
avoid cutting critical information

But overlap has costs:

More chunks
Higher storage cost
Higher retrieval noise

Overlap should be:

intentional
minimal
justified by document structure

Blindly adding overlap is not a fix — it’s a tax.

7. Embeddings Come Last (For a Reason)

Embeddings are often treated as the “magic step”.

In reality:

They are deterministic
They faithfully reflect upstream decisions
They cannot repair bad ingestion

By the time text reaches the embedding stage:

Most architectural decisions are already locked in

This is why changing the embedding model rarely fixes poor RAG results.

8. Ingestion Is an Architectural Decision

In production systems:

Ingestion pipelines evolve over time
Different document types need different strategies
Re-ingestion is expensive and risky

That makes ingestion:

a platform concern
not a one-off script
not a junior task

If you design ingestion casually, you pay for it forever.

9. Key Takeaways

RAG failures usually start at ingestion
Loaders must preserve clean text and metadata
Chunking decisions dominate retrieval quality
Embeddings encode mistakes faithfully
Ingestion pipelines are architecture, not plumbing

What’s Next

In the next article, we’ll explore:

Why “Lost in the Middle” Breaks Most RAG Systems
— and why retrieving the right chunks doesn’t guarantee the model will use them.

DEV Community