Most RAG discussions focus on retrieval quality.
- Which embeddings to use.
- Which vector database is faster.
- Which similarity metric performs better.
But in production, RAG systems rarely fail because of retrieval alone.
They fail because of how content is chunked, batched, and indexed — quietly, expensively, and at scale.
Why Chunking Is a Cost Decision, Not Just a Text Decision
Chunking is often treated as a preprocessing step:
“Split documents into 500-token chunks and move on.”
That decision impacts:
- Retrieval accuracy
- Context window usage
- Latency
- Token cost
- Index size
- Re-ranking complexity
Bad chunking doesn’t just reduce answer quality — it multiplies operational cost.
The Real Trade-Offs in Chunk Size
Small Chunks
Pros
- More precise retrieval
- Better semantic focus
- Lower “Lost in the Middle” risk
Cons
- More chunks per document
- Larger vector index
- Higher retrieval fan-out
- More context assembly overhead
Large Chunks
Pros
- Fewer vectors
- Smaller index
- Faster ingestion
Cons
- Lower relevance density
- More noise per chunk
- Higher chance of ignored context
- Worse attention utilisation
👉 There is no “perfect” chunk size.
There is only context-aware chunking.
Why Batching Quietly Becomes Your Biggest Cost Lever
Most teams underestimate batching.
Batching affects:
- Ingestion throughput
- Embedding API cost
- Failure recovery
- Observability
- Reprocessing overhead
Common Anti-Pattern
- Ingesting documents one by one
- Embedding synchronously
- No retry or visibility
This works for demos. It collapses at scale.
What Good Batching Looks Like
Production-grade ingestion pipelines:
- Batch documents intentionally
- Track batch IDs
- Log failures per batch
- Allow partial retries
- Emit metrics per stage
Batching isn’t just optimisation — it’s operational control.
Indexing: The Forgotten Scaling Problem
Indexing is often treated as “fire and forget”.
But indexing decisions affect:
- Query latency
- Memory footprint
- Rebuild cost
- Migration complexity
Questions teams forget to ask:
- Can we re-index incrementally?
- Can we support multiple indexes per domain?
- Can we rebuild without downtime?
- Can we version indexes safely?
RAG systems age badly without good indexing strategy.
Why These Costs Compound in Production
Here’s the uncomfortable truth:
Every extra chunk
→ increases retrieval cost
→ increases prompt size
→ increases token spend
→ increases latency
→ reduces answer quality
Poor chunking and batching don’t fail loudly.
They fail financially.
Practical Guidelines (That Actually Work)
Some battle-tested principles:
- Prefer smaller, semantically complete chunks
- Avoid “just in case” retrieval
- Batch ingestion with observability
- Track cost per document, not per query
- Treat indexes as versioned assets
- Re-evaluate chunking as usage evolves
RAG is not a static pipeline — it’s a living system.
Final Takeaway
RAG systems don’t get expensive overnight.
They get expensive through:
- Over-chunking
- Over-retrieval
- Under-observability
- Poor batching discipline
If you don’t design ingestion for scale, your costs will scale for you.
What’s Next
In the next article, we’ll step back and ask:
Simple RAG vs Agentic RAG: What Problem Are You Actually Solving?
Because adding agents before fixing ingestion is usually a mistake.
Discussion
How are you currently handling chunking and batching in your RAG pipelines?
What trade-offs have surprised you the most?
Top comments (0)