Parth Sarthi Sharma

Posted on Jan 6

Chunking, Batching & Indexing: The Hidden Costs of RAG Systems

#ai #rag #llm #softwareengineering

Most RAG discussions focus on retrieval quality.

Which embeddings to use.
Which vector database is faster.
Which similarity metric performs better.

But in production, RAG systems rarely fail because of retrieval alone.

They fail because of how content is chunked, batched, and indexed — quietly, expensively, and at scale.

Why Chunking Is a Cost Decision, Not Just a Text Decision

Chunking is often treated as a preprocessing step:

“Split documents into 500-token chunks and move on.”

That decision impacts:

Retrieval accuracy
Context window usage
Latency
Token cost
Index size
Re-ranking complexity

Bad chunking doesn’t just reduce answer quality — it multiplies operational cost.

The Real Trade-Offs in Chunk Size

Small Chunks
Pros

More precise retrieval
Better semantic focus
Lower “Lost in the Middle” risk

Cons

More chunks per document
Larger vector index
Higher retrieval fan-out
More context assembly overhead

Large Chunks

Pros

Fewer vectors
Smaller index
Faster ingestion

Cons

Lower relevance density
More noise per chunk
Higher chance of ignored context
Worse attention utilisation

👉 There is no “perfect” chunk size.
There is only context-aware chunking.

Why Batching Quietly Becomes Your Biggest Cost Lever

Most teams underestimate batching.

Batching affects:

Ingestion throughput
Embedding API cost
Failure recovery
Observability
Reprocessing overhead

Common Anti-Pattern

Ingesting documents one by one
Embedding synchronously
No retry or visibility

This works for demos. It collapses at scale.

What Good Batching Looks Like

Production-grade ingestion pipelines:

Batch documents intentionally
Track batch IDs
Log failures per batch
Allow partial retries
Emit metrics per stage

Batching isn’t just optimisation — it’s operational control.

Indexing: The Forgotten Scaling Problem

Indexing is often treated as “fire and forget”.

But indexing decisions affect:

Query latency
Memory footprint
Rebuild cost
Migration complexity

Questions teams forget to ask:

Can we re-index incrementally?
Can we support multiple indexes per domain?
Can we rebuild without downtime?
Can we version indexes safely?

RAG systems age badly without good indexing strategy.

Why These Costs Compound in Production

Here’s the uncomfortable truth:

Every extra chunk
→ increases retrieval cost
→ increases prompt size
→ increases token spend
→ increases latency
→ reduces answer quality

Poor chunking and batching don’t fail loudly.
They fail financially.

Practical Guidelines (That Actually Work)

Some battle-tested principles:

Prefer smaller, semantically complete chunks
Avoid “just in case” retrieval
Batch ingestion with observability
Track cost per document, not per query
Treat indexes as versioned assets
Re-evaluate chunking as usage evolves

RAG is not a static pipeline — it’s a living system.

Final Takeaway

RAG systems don’t get expensive overnight.

They get expensive through:

Over-chunking
Over-retrieval
Under-observability
Poor batching discipline

If you don’t design ingestion for scale, your costs will scale for you.

What’s Next

In the next article, we’ll step back and ask:

Simple RAG vs Agentic RAG: What Problem Are You Actually Solving?

Because adding agents before fixing ingestion is usually a mistake.

Discussion

How are you currently handling chunking and batching in your RAG pipelines?
What trade-offs have surprised you the most?

DEV Community