What I learned building a document chunking and embedding API for RAG

#rag #ai

Chunking sounds like the boring part of RAG. It is also where a lot of retrieval quality is won or lost. I built a document chunking and embedding API and ran it in production, and these are the things that actually moved the needle.

Repo: https://github.com/ahmetguness/doc-chunking-api
Live demo (3 free runs): https://chunkingservice.com

Sentence-aware beats fixed-size

The naive approach is to split text every N characters or tokens. It is simple and it quietly hurts retrieval, because it cuts sentences in half and splits ideas across chunks. Sentence-aware chunking with a configurable overlap keeps each chunk coherent, so the embedding actually represents a complete thought. This one change usually improves retrieval more than swapping embedding models.

Tables are their own problem

Real documents are not just prose. CSV and Excel files carry meaning in rows and columns, and a generic text splitter shreds a record across chunk boundaries, so a row like a customer and their balance gets separated from its header. Treating tables as a distinct extraction path, rather than flattening them into text first, keeps rows intact and makes the retrieved context usable.

The embedding model is a tradeoff, not a default

The API supports nine embedding models and runs BAAI/bge-m3 in production. bge-m3 is a strong multilingual default, but model choice is a tradeoff between quality, dimension size (which affects your vector DB cost), and latency. The right answer depends on your data and budget, which is why it is a parameter, not a hardcoded choice.

Multilingual preprocessing has sharp edges

The most surprising lesson: for Turkish and other multilingual text, lowercasing before chunking measurably improved retrieval with bge-m3. But lowercasing is not universal. Turkish has dotted and dotless I, so a naive lowercase corrupts words. Locale-aware normalization mattered, and getting it wrong silently degraded results in a way that was hard to spot without an eval set.

Treat it like an API, not a script

The difference between a notebook and something you can rely on is the boring infrastructure: auth, rate limiting, structured logging, and supporting local (CPU/GPU/CUDA) or cloud backends so it runs where you need it. None of this is glamorous, but it is what lets you actually depend on the thing.

Takeaway

If your RAG answers are weak, look at chunking and retrieval before you blame the model. Sentence-aware splitting, table-aware extraction, and locale-correct preprocessing are cheap changes with outsized impact.

Code: https://github.com/ahmetguness/doc-chunking-api
Demo: https://chunkingservice.com

What does your chunking pipeline look like, and what broke the first time you put it in front of real documents?