When people debug poor RAG results, they usually blame:
- the vector database
- the embedding model
- the prompt
But in real systems, the most common root cause sits much earlier:
👉 The document ingestion pipeline
If ingestion is wrong, retrieval will be wrong — no matter how good your embeddings or LLM are.
In this article, we’ll break down:
- What a document ingestion pipeline actually is
- The role of loaders, splitters, and embeddings
- Why chunking is the most underestimated design decision
- How ingestion mistakes silently ruin RAG systems
This is concept-first thinking, with tooling examples only where helpful.
1. What Is a Document Ingestion Pipeline?
At a high level, an ingestion pipeline converts raw data into retrievable semantic units.
Raw Source
→ Documents
→ Chunks
→ Embeddings
→ Vector Store
Each stage:
- Loses information if done poorly
- Constrains everything downstream
- Is extremely hard to “fix later”
A bad ingestion pipeline doesn’t fail loudly — it fails by returning plausible but wrong answers.
2. Document Loaders: The Foundation (Often Ignored)
What loaders do
Document loaders are responsible for:
- Reading raw sources (PDFs, HTML, Markdown, APIs, DBs)
- Extracting text
- Attaching metadata
Examples of sources:
- PDFs (policies, contracts)
- Websites / wikis
- Git repositories
- Knowledge bases (Confluence, Notion)
- Databases or APIs
Common loader failures
- PDFs with broken text order
- Headers/footers mixed into content
- Navigation menus treated as text
- Missing or inconsistent metadata
If metadata is lost at this stage, you cannot recover it later.
Design rule #1
Treat metadata as first-class data.
At minimum, preserve:
- source (URL / file / system)
- page or section
- document type
- timestamp
- access scope
Frameworks (e.g. LangChain loaders) help — but you must inspect loader output manually, at least once.
3. Text Splitters: Where Most RAG Systems Break
LLMs do not retrieve documents.
They retrieve chunks.
This makes text splitting one of the most important — and misunderstood — steps in RAG.
Why splitting matters
Bad splitting causes:
- Partial facts
- Broken reasoning
- “Lost in the middle” effects
- Irrelevant or misleading retrievals
Once text is chunked incorrectly, embeddings faithfully encode the wrong thing.
4. Chunk Size Is a Trade-Off, Not a Constant
There is no universally “correct” chunk size.
Small chunks
Pros
- Higher recall
- Precise matching
Cons
- Loss of local context
- Fragmented meaning
Large chunks
Pros
- Better semantic completeness
- More context per chunk
Cons
- Fewer chunks fit in the context window
- Irrelevant content dilutes relevance
The right chunk size depends on document structure and query intent — not a blog default.
5. Why Bad Chunking Ruins Even Perfect Embeddings
This is the key misconception:
“If embeddings are good, retrieval will be good.”
Not true.
Embeddings encode what you give them.
If a chunk:
- mixes multiple topics
- cuts sentences mid-thought
- spans unrelated sections
Then the embedding becomes a semantic average — and retrieval quality collapses.
This is why:
- identical embedding models can perform wildly differently
- RAG quality varies more by ingestion than by model choice
6. Overlap: A Necessary Evil
Chunk overlap exists to:
- preserve continuity
- avoid cutting critical information
But overlap has costs:
- More chunks
- Higher storage cost
- Higher retrieval noise
Overlap should be:
- intentional
- minimal
- justified by document structure
Blindly adding overlap is not a fix — it’s a tax.
7. Embeddings Come Last (For a Reason)
Embeddings are often treated as the “magic step”.
In reality:
- They are deterministic
- They faithfully reflect upstream decisions
- They cannot repair bad ingestion
By the time text reaches the embedding stage:
- Most architectural decisions are already locked in
This is why changing the embedding model rarely fixes poor RAG results.
8. Ingestion Is an Architectural Decision
In production systems:
- Ingestion pipelines evolve over time
- Different document types need different strategies
- Re-ingestion is expensive and risky
That makes ingestion:
- a platform concern
- not a one-off script
- not a junior task
If you design ingestion casually, you pay for it forever.
9. Key Takeaways
- RAG failures usually start at ingestion
- Loaders must preserve clean text and metadata
- Chunking decisions dominate retrieval quality
- Embeddings encode mistakes faithfully
- Ingestion pipelines are architecture, not plumbing
What’s Next
In the next article, we’ll explore:
Why “Lost in the Middle” Breaks Most RAG Systems
— and why retrieving the right chunks doesn’t guarantee the model will use them.

Top comments (0)