DEV Community

Cover image for Chunking, Batching & Indexing: The Hidden Costs of RAG Systems
Parth Sarthi Sharma
Parth Sarthi Sharma

Posted on

Chunking, Batching & Indexing: The Hidden Costs of RAG Systems

Most RAG discussions focus on retrieval quality.

  • Which embeddings to use.
  • Which vector database is faster.
  • Which similarity metric performs better.

But in production, RAG systems rarely fail because of retrieval alone.

They fail because of how content is chunked, batched, and indexed — quietly, expensively, and at scale.

Why Chunking Is a Cost Decision, Not Just a Text Decision

Chunking is often treated as a preprocessing step:

“Split documents into 500-token chunks and move on.”

That decision impacts:

  • Retrieval accuracy
  • Context window usage
  • Latency
  • Token cost
  • Index size
  • Re-ranking complexity

Bad chunking doesn’t just reduce answer quality — it multiplies operational cost.

The Real Trade-Offs in Chunk Size

Small Chunks
Pros

  • More precise retrieval
  • Better semantic focus
  • Lower “Lost in the Middle” risk

Cons

  • More chunks per document
  • Larger vector index
  • Higher retrieval fan-out
  • More context assembly overhead

Large Chunks

Pros

  • Fewer vectors
  • Smaller index
  • Faster ingestion

Cons

  • Lower relevance density
  • More noise per chunk
  • Higher chance of ignored context
  • Worse attention utilisation

👉 There is no “perfect” chunk size.
There is only context-aware chunking.

Why Batching Quietly Becomes Your Biggest Cost Lever

Most teams underestimate batching.

Batching affects:

  • Ingestion throughput
  • Embedding API cost
  • Failure recovery
  • Observability
  • Reprocessing overhead

Common Anti-Pattern

  • Ingesting documents one by one
  • Embedding synchronously
  • No retry or visibility

This works for demos. It collapses at scale.

What Good Batching Looks Like

Production-grade ingestion pipelines:

  • Batch documents intentionally
  • Track batch IDs
  • Log failures per batch
  • Allow partial retries
  • Emit metrics per stage

Batching isn’t just optimisation — it’s operational control.

Indexing: The Forgotten Scaling Problem

Indexing is often treated as “fire and forget”.

But indexing decisions affect:

  • Query latency
  • Memory footprint
  • Rebuild cost
  • Migration complexity

Questions teams forget to ask:

  • Can we re-index incrementally?
  • Can we support multiple indexes per domain?
  • Can we rebuild without downtime?
  • Can we version indexes safely?

RAG systems age badly without good indexing strategy.

Why These Costs Compound in Production

Here’s the uncomfortable truth:

Every extra chunk
→ increases retrieval cost
→ increases prompt size
→ increases token spend
→ increases latency
→ reduces answer quality

Poor chunking and batching don’t fail loudly.
They fail financially.

Practical Guidelines (That Actually Work)

Some battle-tested principles:

  • Prefer smaller, semantically complete chunks
  • Avoid “just in case” retrieval
  • Batch ingestion with observability
  • Track cost per document, not per query
  • Treat indexes as versioned assets
  • Re-evaluate chunking as usage evolves

RAG is not a static pipeline — it’s a living system.

Final Takeaway

RAG systems don’t get expensive overnight.

They get expensive through:

  • Over-chunking
  • Over-retrieval
  • Under-observability
  • Poor batching discipline

If you don’t design ingestion for scale, your costs will scale for you.

What’s Next

In the next article, we’ll step back and ask:

Simple RAG vs Agentic RAG: What Problem Are You Actually Solving?

Because adding agents before fixing ingestion is usually a mistake.

Discussion

How are you currently handling chunking and batching in your RAG pipelines?
What trade-offs have surprised you the most?

Top comments (0)