DEV Community

강해수
강해수

Posted on • Originally published at riversealab.com

1024-token RAG chunks cut my storage cost in half — and nearly doubled my Claude bill

Switching from 512 to 1024-token chunks saved $1.20/month on Vectorize. It cost me $92 more on Claude Sonnet. I didn't see that coming until I did the math.

I run an ad analytics SaaS with a daily agent flow that hits a RAG step on every cycle — about 400 runs a day. I'd left the chunk size at the default 512 tokens for three months before I got curious enough to actually measure it. So I indexed the same 100 ad reports three ways (256, 512, 1024 tokens), ran 20 fixed queries five times each, and tracked latency, citation accuracy, and estimated monthly cost.

The summary table looked like 512 wins cleanly:

Chunk size Avg latency Citation accuracy Monthly vector cost
256 tokens 43ms 16.2 / 20 ~$4.80
512 tokens 51ms 17.8 / 20 ~$2.40
1024 tokens 67ms 15.1 / 20 ~$1.20

But the vector storage number is a trap. With 1024-token chunks, each top_k: 5 retrieval pulls ~5,120 tokens into Claude's context instead of ~2,560. At $3/M input tokens, that's roughly 2,560 extra tokens × 12,000 monthly calls = 30.7M tokens = $92/month. The $1.20 Vectorize saving doesn't touch it.

1024-token chunks also produced what I'd call dilution rather than hallucination — Claude wasn't making things up, it was including too much surrounding text and missing the actual point. One campaign's 60-day performance data would land as a single dense chunk, and the model would surface averages instead of the anomaly I was asking about. Smaller chunks hurt in the opposite direction: a campaign summary split across three 256-token pieces meant top_k: 3 often came back with an incomplete picture.

What I run now is two indexes on the same data — a report-512 namespace for accuracy-first summary queries and a live-256 namespace for latency-sensitive ad-hoc questions:

const summaryChunks = await vectorize.query(embedding, {
  topK: 5,
  namespace: 'report-512',
});

const liveChunks = await vectorize.query(embedding, {
  topK: 3,
  namespace: 'live-256',
});
Enter fullscreen mode Exit fullscreen mode

Double the storage cost, but the Claude savings on the live path more than cover it at my call volume. Whether that math holds at lower volumes — it probably doesn't.

I also hit a production incident during this experiment that had nothing to do with chunk size: swapping embedding models mid-index caused a dimension mismatch that killed the entire RAG step at 9am. That failure and the chunk overlap experiments I haven't run yet are in the full writeup.

I wrote up the full breakdown — including the dimension mismatch incident and what I'd test next with dynamic chunking by document type — over on riversealab.com.

Full post →

Top comments (0)