DEV Community

Parth Sarthi Sharma
Parth Sarthi Sharma

Posted on

Loaders, Splitters & Embeddings — How Bad Chunking Breaks Even Perfect RAG Systems

Diagram explains the RAG document ingestion pipeline

When people debug poor RAG results, they usually blame:

  • the vector database
  • the embedding model
  • the prompt

But in real systems, the most common root cause sits much earlier:

👉 The document ingestion pipeline

If ingestion is wrong, retrieval will be wrong — no matter how good your embeddings or LLM are.

In this article, we’ll break down:

  • What a document ingestion pipeline actually is
  • The role of loaders, splitters, and embeddings
  • Why chunking is the most underestimated design decision
  • How ingestion mistakes silently ruin RAG systems

This is concept-first thinking, with tooling examples only where helpful.

1. What Is a Document Ingestion Pipeline?

At a high level, an ingestion pipeline converts raw data into retrievable semantic units.

Raw Source
  → Documents
    → Chunks
      → Embeddings
        → Vector Store
Enter fullscreen mode Exit fullscreen mode

Each stage:

  • Loses information if done poorly
  • Constrains everything downstream
  • Is extremely hard to “fix later”

A bad ingestion pipeline doesn’t fail loudly — it fails by returning plausible but wrong answers.

2. Document Loaders: The Foundation (Often Ignored)

What loaders do

Document loaders are responsible for:

  • Reading raw sources (PDFs, HTML, Markdown, APIs, DBs)
  • Extracting text
  • Attaching metadata

Examples of sources:

  • PDFs (policies, contracts)
  • Websites / wikis
  • Git repositories
  • Knowledge bases (Confluence, Notion)
  • Databases or APIs

Common loader failures

  • PDFs with broken text order
  • Headers/footers mixed into content
  • Navigation menus treated as text
  • Missing or inconsistent metadata

If metadata is lost at this stage, you cannot recover it later.

Design rule #1

Treat metadata as first-class data.

At minimum, preserve:

  • source (URL / file / system)
  • page or section
  • document type
  • timestamp
  • access scope

Frameworks (e.g. LangChain loaders) help — but you must inspect loader output manually, at least once.

3. Text Splitters: Where Most RAG Systems Break

LLMs do not retrieve documents.
They retrieve chunks.

This makes text splitting one of the most important — and misunderstood — steps in RAG.

Why splitting matters

Bad splitting causes:

  • Partial facts
  • Broken reasoning
  • “Lost in the middle” effects
  • Irrelevant or misleading retrievals

Once text is chunked incorrectly, embeddings faithfully encode the wrong thing.

4. Chunk Size Is a Trade-Off, Not a Constant

There is no universally “correct” chunk size.

Small chunks

Pros

  • Higher recall
  • Precise matching

Cons

  • Loss of local context
  • Fragmented meaning

Large chunks

Pros

  • Better semantic completeness
  • More context per chunk

Cons

  • Fewer chunks fit in the context window
  • Irrelevant content dilutes relevance

The right chunk size depends on document structure and query intent — not a blog default.

5. Why Bad Chunking Ruins Even Perfect Embeddings

This is the key misconception:

“If embeddings are good, retrieval will be good.”

Not true.

Embeddings encode what you give them.

If a chunk:

  • mixes multiple topics
  • cuts sentences mid-thought
  • spans unrelated sections

Then the embedding becomes a semantic average — and retrieval quality collapses.

This is why:

  • identical embedding models can perform wildly differently
  • RAG quality varies more by ingestion than by model choice

6. Overlap: A Necessary Evil

Chunk overlap exists to:

  • preserve continuity
  • avoid cutting critical information

But overlap has costs:

  • More chunks
  • Higher storage cost
  • Higher retrieval noise

Overlap should be:

  • intentional
  • minimal
  • justified by document structure

Blindly adding overlap is not a fix — it’s a tax.

7. Embeddings Come Last (For a Reason)

Embeddings are often treated as the “magic step”.

In reality:

  • They are deterministic
  • They faithfully reflect upstream decisions
  • They cannot repair bad ingestion

By the time text reaches the embedding stage:

  • Most architectural decisions are already locked in

This is why changing the embedding model rarely fixes poor RAG results.

8. Ingestion Is an Architectural Decision

In production systems:

  • Ingestion pipelines evolve over time
  • Different document types need different strategies
  • Re-ingestion is expensive and risky

That makes ingestion:

  • a platform concern
  • not a one-off script
  • not a junior task

If you design ingestion casually, you pay for it forever.

9. Key Takeaways

  • RAG failures usually start at ingestion
  • Loaders must preserve clean text and metadata
  • Chunking decisions dominate retrieval quality
  • Embeddings encode mistakes faithfully
  • Ingestion pipelines are architecture, not plumbing

What’s Next

In the next article, we’ll explore:

Why “Lost in the Middle” Breaks Most RAG Systems
— and why retrieving the right chunks doesn’t guarantee the model will use them.

Top comments (0)